Regression
Analysis
Regression
analysis is a statistical technique used to find relationships between
variables for the purpose of predicting future values. If a set of data is
given, it may be useful to “fit” a function to the data to describe it. The question is what would be the best “fit”
for a set of data?
Mathematical Foundations of Regression Analysis
Given
a set of data, {(
),
the line that best “fits” the data would be a line, y = ax + b such that the sum of the distances between each
and
is minimized.
Let E = the
total error for a straight line fitted to a set of data, thus
![]()
If the derivatives are taken
in respect to a and b
![]()

To
minimize E, the equations are set to
zero. This yields a system of two
equations in two variables:
![]()
a
Putting
this system of equations into matrix form and solving by Gauss-Jordan gives
real number values for a
and b such that y=ax+ b is the line of best fit.
![]()

Similarly,
if the best “fit” to the data is represented by a parabola, the quadratic,
, and E= the total error for a quadratic fitted to
a set of data, minimizing by taking the derivative in respect to a, b
and c and setting to zero,
respectively, yields a system of three equations in three variables represented
by the matrix:
![]()

From which the values of a,
b, and c of the quadratic of “best fit” can be determined.
If
it is speculated that a set a data may behave exponentially, it would be useful
to find the exponential of “best fit”.
Consider the exponential model,
. The natural log is
derived on both sides of the equation,

If
the original data is exponential, taking the natural log of the
data will yield data that will behave linearly. This would suggest that to find a exponential
of ‘best fit,’ first evaluate the data
. If this is linear,
then fit a linear equation to the data
, say
. The exponential fit
of the original data can be found by ![]()
How Good is Your Fit?
One
way to decide if your regression is accurate is to determine the correlation
coefficient. Below is a quick overview
followed by steps in MAPLE.
Calculate
the correlation coefficient.
The
correlation coefficient between two variables is a statistic that measures the strength of the linear relationship
between them, on a unitless scale of -1 to +1. That
is, it measures the extent to which a linear model can be used to predict the deviation
of one variable from its mean given knowledge of the other's deviation from its
mean at the same point in time.
The
correlation coefficient is most easily computed if we first standardize each of the variables. The
standardized value of X is commonly
denoted by X*, and the value of X*(i), where i is the number of the data point is defined as:

Where
X*(i) = the standardized value of the ith
data point
X(i) = the original
value of the ith
data point
= the mean of the data points
S =
the standard deviation of the data points
Do
a similar set of calculations for the Y
values, standardizing them as Y*.
Now,
the correlation coefficient is equal to
the average product of the standardized values of the two variables. That
is, if we let X* and Y* denote the standardized values of X and Y:

Finally, square r. The closer the value is to 1, the greater
linear correlation there is between the two variables.
MAPLE code for this would consist of a few loops. Consider the case where we have specific points and are looking to return r=-.6404. Since n = 5, 1/(n-1) would be .25. Also the mean of the data points can be found easily so I used constants in this example. The formula for standard deviation is
![]()
and what this loop did was sum up
all the (x-mean)2 and then find the final square root.
> x_val:=[-1,-2,0,1,5]; y_val:=[3,4,6,-4,-2];n:=5
> standx:=0;standy:=0;
> for i from 1 to n do
> standx:=(standx + .25*(x_val[i] - .6)^2):
standy:=(standy + .25*(y_val[i] - 1.4)^2):
end do;
finalstandx:=sqrt(standx); finalstandy:=sqrt(standy);
> xstar:=0;ystar:=0;r:=0;
> for i from 1 to n do
xstar[i]:=(x_val[i] - .6)/finalstandx;
ystar[i]:=(y_val[i] - 1.4)/finalstandy;
r:= r
+ xstar[i]*ystar[i];
end do;
aver:= r/(n-1);
![]()
{Another example will follow with variable
names that are clearer.}
R is called the Pearson product-moment correlation
coefficient, after Karl
Pearson, who first coined the phrase Standard Deviation
R
= {Mean(xy) - Mean(x) Mean(y) }
/{ SD(x)SD(y) }
|
R |
= { |
|
|
= |
It measures how correlated the yn are,
to the xn. It varies from -1 to +1 and
|
|
R2
|
|
http://home.golden.net/~pjponzo/R-squared.htm
Another specific example follows along with code from MAPLE.
In this example we plot 5 points on the coordinate plane as follows:
> x_val:=[-1,-2,0,1,5];
y_val:=[3,4,6,-4,-2];n:=5;with(plots);
> for i from 1 to n do
C[i]:=[x_val[i],y_val[i]];
> end do;
> my_data_plot:=plot([seq(C[i],i=1..n)],style=point):
> display(my_data_plot);
The plot for the 5 data points are in display 1.1.
Now that you have your points, you can see that you need a line of best fit. MAPLE code allows you to use loops to sum up values and linear algebra to solve the system of equations with two variables.
This sets up a matrix for you to use. In this case it is a 2X3.
> with(linalg);
A:=matrix(2,3,[0,0,0,0,0,0]);
This loop fills up the matrix with values for all the sums needed.
> for i from 1 to n do
A[1,1]:=A[1,1]+x_val[i]^2;
> A[1,2]:=A[1,2]+x_val[i];
A[2,1]:=A[2,1]+x_val[i];
A[1,3]:=A[1,3]+x_val[i]*y_val[i];
A[2,2]:=n;
A[2,3]:=A[2,3]+y_val[i];
> end do;
> evalm(A);
![]()
> Matt:=gaussjord(A);
![]()
Looking at the solution to the matrix, we see that our line of best fit would be y=-1x + 2.
That line is plotted below in display 1.2

display 1.1
best_plot:=plot(Matt[1,3]*x+Matt[2,3]
,x=-2..5,color=VIOLET,thickness=2):
> display( {my_data_plot,best_plot} ); 
display 1.2
As you can see, the line doesn’t hit all the points. That is why the correlation coefficient is not great.
An
Application of Linear Regression:
Recently released data on NBA
players shows a trend in Free Throw Percentage as the players get taller.
Height Free Throw Percentage
5ft 11in 79.1%
6ft 0 in 75.8%
6ft 1 in 77.6%
6ft 2 in 76.5%
6ft 3 in 77.2%
6ft 4 in 77.5%
6ft 5 in 76.8%
6ft 6 in 76.0%
6ft 7 in 75.3%
6ft 8 in 73.4%
6ft 9 in 72.7%
6ft 10 in 71.5%
6ft 11 in 70.6%
7ft 68.2%
In order to simplify calculations,
we will change all the heights to inches.
Using MAPLE we would see the following code and results.
> heights:=[71,72,73,74,75,76,77,78,79,80,81,82,83,84]; freethrows:=[79.1,75.8,77.6,76.5,77.2,77.5,76.8,76.0,75.3,73.4,72.7,71.5,70.6,68.2];n:=14;with(plots):
![]()
![]()
![]()
This sets up the array as coordinate points and leaves them separate because we need them to be separate in order to sum the x and y values.
stdheight:=0;stdfree:=0;heightstar:=0;freestar:=0;averheight:=0;averfree:=0;
multiplier:=1/(n-1);sumheight:=0.0;sumfree:=0;avercorrelation:=0;
correlation:=0;
for
counter from 1 to n do
Points[counter]:=[heights[counter],freethrows[counter]];
> end do;
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Now you can plot the points and you have your initialization done for the sums.
The next set of code will give us the mean of both the heights and free throws.
> for
counter from 1 to n do
sumheight:= sumheight + heights[counter];
sumfree:=sumfree + freethrows[counter];
end do;
averheight:=sumheight/n; averfree:=sumfree/n;
![]()
![]()
![]()
![]()
The next set of statements will find the standard deviation of the heights and the free throws.
> for
counter from 1 to n do
stdheight:=(stdheight +
multiplier*(heights[counter]-averheight)^2);
stdfree:=(stdfree +
multiplier*(freethrows[counter]-averfree)^2);
end do;
finalstdheight:=sqrt(stdheight);finalstdfree:=sqrt(stdfree);
![]()
![]()
![]()
![]()
We need the standard deviation of the heights and free throws in order to standardize the x and y.
The following set of code will find the correlation coefficient.
> for
counter from 1 to n do
heightstar[counter]:=(heights[counter]-averheight)/finalstdheight;
freestar[counter]:=(freethrows[counter]
- averfree)/finalstdfree;
avercorrelation:=avercorrelation + heightstar[counter]*freestar[counter];
end do;
correlation:=avercorrelation/(n-1);
![]()
![]()
![]()
![]()
Therefore the correlation coefficient is -.9. This signifies a negative relationship that is relatively strong.
The plot of the points in our application and the corresponding line of regression is below in display 1.3.
The code below will row reduce a matrix created by summing the values needed as stated in the first pages of this document.
> with (linalg):
> Regression:=matrix(2,3,[0,0,0,0,0,0]);
![]()
> for counter from 1 to
n do
Regression[1,1]:=Regression[1,1] + heights[counter]^2;
Regression[1,2]:=Regression[1,2] + heights[counter];
Regression[1,3]:=Regression[1,3] + heights[counter]*freethrows[counter];
Regression[2,3]:=Regression[2,3] + freethrows[counter];
end do;
Regression[2,1]:=Regression[1,2];
Regression[2,2]:=n;
![]()
![]()
![]()
![]()
![]()
![]()
> Answer:=evalm(Regression);
![]()
> RowReduced:=gaussjord(Answer);
![]()
> my_data_plot:=plot([seq(Points[counter],counter=1..n)],style=point): best_plot:=plot(RowReduced[1,3]*x+RowReduced[2,3],x=50..90,color=VIOLET,thickness=2):
> display( {my_data_plot,best_plot} );

Technology
The major draw back to the use of the technology
mentioned herein, MAPLE, is the availability of this resource in the secondary
school classroom. MAPLE can be a very
powerful demonstration tool. If one is
fortunate enough to be in a school that has a computer lab equipped with MAPLE
software, MAPLE could be used in various lab projects regarding linear,
quadratic and exponential fitting.
Regression analysis in the secondary school classroom
would involve use of the TI-83 calculator. Students would calculate lines of
regression by hand and by using the calculator.
Using the lines of best fit, students could extrapolate and predict
future events. Connections can be made
to real world problems and their application in the realm of mathematics. Students can also calculate quadratics and cubics of best fit using the TI-83. Higher level math students can use the
summation as an introduction into Calculus and limit theory.
Appendix
Interpolation: the computation of points or values between
ones that are known or tabulated using the surrounding points or values. http://mathworld.wolfram.com/Interpolation.html
Extrapolation: The computation of points or values outside
the range of the data that has been collected.
Least Squares: A mathematical procedure for finding the best fitting
curve to a given set of points by minimizing the sum of the squares of the
offsets ("the residuals") of the points from the curve. The method of least squares assumes that the
best-fit curve of a given type is the curve that has the minimal sum of the
deviations squared (least square error) from a given set of data.
http://mathworld.wolfram.com/LeastSquaresFitting.html
Regression: A method for fitting a curve (not necessarily a
straight line) through a set of points using some goodness-of-fit criterion.
The most common type of regression is linear regression.
http://mathworld.wolfram.com/Regression.html