1

Regression Analysis

Regression analysis is a statistical technique used to find relationships between variables for the purpose of predicting future values. If a set of data is given, it may be useful to “fit” a function to the data to describe it. The question is what would be the best “fit” for a set of data?

Mathematical Foundations of Regression Analysis

Given a set of data, {(), the line that best “fits” the data would be a line, y = ax + b such that the sum of the distances between each and is minimized.

Let E = the total error for a straight line fitted to a set of data, thus

If the derivatives are taken in respect to a and b

To minimize E, the equations are set to zero. This yields a system of two equations in two variables:

Putting this system of equations into matrix form and solving by Gauss-Jordan gives real number values for a and b such that y=ax+ b is the line of best fit.

Similarly, if the best “fit” to the data is represented by a parabola, the quadratic, , and E= the total error for a quadratic fitted to a set of data, minimizing by taking the derivative in respect to a, b and c and setting to zero, respectively, yields a system of three equations in three variables represented by the matrix:

From which the values of a, b, and c of the quadratic of “best fit” can be determined.

If it is speculated that a set a data may behave exponentially, it would be useful to find the exponential of “best fit”. Consider the exponential model, . The natural log is derived on both sides of the equation,

If the original data is exponential, taking the natural log of the data will yield data that will behave linearly. This would suggest that to find a exponential of ‘best fit,’ first evaluate the data . If this is linear, then fit a linear equation to the data , say . The exponential fit of the original data can be found by

How Good is Your Fit?

One way to decide if your regression is accurate is to determine the correlation coefficient. Below is a quick overview followed by steps in MAPLE.

Calculate the correlation coefficient.

The correlation coefficient between two variables is a statistic that measures the strength of the linear relationship between them, on a unitless scale of -1 to +1. That is, it measures the extent to which a linear model can be used to predict the deviation of one variable from its mean given knowledge of the other's deviation from its mean at the same point in time.

The correlation coefficient is most easily computed if we first standardize each of the variables. The standardized value of X is commonly denoted by X*, and the value of X*(i), where i is the number of the data point is defined as:

Where

X^*(i) = the standardized value of the i^th data point

X(i) = the original value of the i^th data point

= the mean of the data points

S = the standard deviation of the data points

Do a similar set of calculations for the Y values, standardizing them as Y*.

Now, the correlation coefficient is equal to the average product of the standardized values of the two variables. That is, if we let X* and Y* denote the standardized values of X and Y:

Finally, square r. The closer the value is to 1, the greater linear correlation there is between the two variables.

MAPLE code for this would consist of a few loops. Consider the case where we have specific points and are looking to return r=-.6404. Since n = 5, 1/(n-1) would be .25. Also the mean of the data points can be found easily so I used constants in this example. The formula for standard deviation is

and what this loop did was sum up all the (x-mean)² and then find the final square root.

> x_val:=[-1,-2,0,1,5]; y_val:=[3,4,6,-4,-2];n:=5

> standx:=0;standy:=0;

> for i from 1 to n do

> standx:=(standx + .25*(x_val[i] - .6)^2):

standy:=(standy + .25*(y_val[i] - 1.4)^2):

end do;

finalstandx:=sqrt(standx); finalstandy:=sqrt(standy);

> xstar:=0;ystar:=0;r:=0;

> for i from 1 to n do

xstar[i]:=(x_val[i] - .6)/finalstandx;

ystar[i]:=(y_val[i] - 1.4)/finalstandy;

r:= r + xstar[i]*ystar[i];

end do;

aver:= r/(n-1);

{Another example will follow with variable names that are clearer.}

R is called the Pearson product-moment correlation coefficient, after Karl Pearson, who first coined the phrase Standard Deviation

R = {Mean(xy) - Mean(x) Mean(y) } /{ SD(x)SD(y) }

R	= { - } /{ SD(x)SD(y) }
	= /{ SD(x)SD(y) }

It measures how correlated the y_n are, to the x_n. It varies from -1 to +1 and

R = +1 means perfect (linear) correlation
R = 0 means no correlation
R = -1 means perfect inverse correlation

R²

http://home.golden.net/~pjponzo/R-squared.htm

Another specific example follows along with code from MAPLE.

In this example we plot 5 points on the coordinate plane as follows:

> x_val:=[-1,-2,0,1,5];

y_val:=[3,4,6,-4,-2];n:=5;with(plots);

> for i from 1 to n do

C[i]:=[x_val[i],y_val[i]];

> end do;

> my_data_plot:=plot([seq(C[i],i=1..n)],style=point):

> display(my_data_plot);

The plot for the 5 data points are in display 1.1.

Now that you have your points, you can see that you need a line of best fit. MAPLE code allows you to use loops to sum up values and linear algebra to solve the system of equations with two variables.

This sets up a matrix for you to use. In this case it is a 2X3.

> with(linalg);

A:=matrix(2,3,[0,0,0,0,0,0]);

This loop fills up the matrix with values for all the sums needed.

> for i from 1 to n do

A[1,1]:=A[1,1]+x_val[i]^2;

> A[1,2]:=A[1,2]+x_val[i];

A[2,1]:=A[2,1]+x_val[i];

A[1,3]:=A[1,3]+x_val[i]*y_val[i];

A[2,2]:=n;

A[2,3]:=A[2,3]+y_val[i];

> end do;

> evalm(A);

> Matt:=gaussjord(A);

Looking at the solution to the matrix, we see that our line of best fit would be y=-1x + 2.

That line is plotted below in display 1.2

display 1.1

best_plot:=plot(Matt[1,3]*x+Matt[2,3]

,x=-2..5,color=VIOLET,thickness=2):

> display( {my_data_plot,best_plot} );

display 1.2

As you can see, the line doesn’t hit all the points. That is why the correlation coefficient is not great.

An Application of Linear Regression:

Recently released data on NBA players shows a trend in Free Throw Percentage as the players get taller.

Height Free Throw Percentage

5ft 11in 79.1%

6ft 0 in 75.8%

6ft 1 in 77.6%

6ft 2 in 76.5%

6ft 3 in 77.2%

6ft 4 in 77.5%

6ft 5 in 76.8%

6ft 6 in 76.0%

6ft 7 in 75.3%

6ft 8 in 73.4%

6ft 9 in 72.7%

6ft 10 in 71.5%

6ft 11 in 70.6%

7ft 68.2%

In order to simplify calculations, we will change all the heights to inches. Using MAPLE we would see the following code and results.

> heights:=[71,72,73,74,75,76,77,78,79,80,81,82,83,84]; freethrows:=[79.1,75.8,77.6,76.5,77.2,77.5,76.8,76.0,75.3,73.4,72.7,71.5,70.6,68.2];n:=14;with(plots):

This sets up the array as coordinate points and leaves them separate because we need them to be separate in order to sum the x and y values.

stdheight:=0;stdfree:=0;heightstar:=0;freestar:=0;averheight:=0;averfree:=0;

multiplier:=1/(n-1);sumheight:=0.0;sumfree:=0;avercorrelation:=0;

correlation:=0;

for counter from 1 to n do

Points[counter]:=[heights[counter],freethrows[counter]];

> end do;

Now you can plot the points and you have your initialization done for the sums.

The next set of code will give us the mean of both the heights and free throws.

> for counter from 1 to n do

sumheight:= sumheight + heights[counter];

sumfree:=sumfree + freethrows[counter];

end do;

averheight:=sumheight/n; averfree:=sumfree/n;

The next set of statements will find the standard deviation of the heights and the free throws.

> for counter from 1 to n do

stdheight:=(stdheight + multiplier*(heights[counter]-averheight)^2);

stdfree:=(stdfree + multiplier*(freethrows[counter]-averfree)^2);

end do;

finalstdheight:=sqrt(stdheight);finalstdfree:=sqrt(stdfree);

We need the standard deviation of the heights and free throws in order to standardize the x and y.

The following set of code will find the correlation coefficient.

> for counter from 1 to n do

heightstar[counter]:=(heights[counter]-averheight)/finalstdheight;

freestar[counter]:=(freethrows[counter] - averfree)/finalstdfree;

avercorrelation:=avercorrelation + heightstar[counter]*freestar[counter];

end do;

correlation:=avercorrelation/(n-1);

Therefore the correlation coefficient is -.9. This signifies a negative relationship that is relatively strong.

The plot of the points in our application and the corresponding line of regression is below in display 1.3.

The code below will row reduce a matrix created by summing the values needed as stated in the first pages of this document.

> with (linalg):

> Regression:=matrix(2,3,[0,0,0,0,0,0]);

> for counter from 1 to n do

Regression[1,1]:=Regression[1,1] + heights[counter]^2;

Regression[1,2]:=Regression[1,2] + heights[counter];

Regression[1,3]:=Regression[1,3] + heights[counter]*freethrows[counter];

Regression[2,3]:=Regression[2,3] + freethrows[counter];

end do;

Regression[2,1]:=Regression[1,2];

Regression[2,2]:=n;

> Answer:=evalm(Regression);

> RowReduced:=gaussjord(Answer);

> my_data_plot:=plot([seq(Points[counter],counter=1..n)],style=point): best_plot:=plot(RowReduced[1,3]*x+RowReduced[2,3],x=50..90,color=VIOLET,thickness=2):

> display( {my_data_plot,best_plot} );

Technology

The major draw back to the use of the technology mentioned herein, MAPLE, is the availability of this resource in the secondary school classroom. MAPLE can be a very powerful demonstration tool. If one is fortunate enough to be in a school that has a computer lab equipped with MAPLE software, MAPLE could be used in various lab projects regarding linear, quadratic and exponential fitting.

Regression analysis in the secondary school classroom would involve use of the TI-83 calculator. Students would calculate lines of regression by hand and by using the calculator. Using the lines of best fit, students could extrapolate and predict future events. Connections can be made to real world problems and their application in the realm of mathematics. Students can also calculate quadratics and cubics of best fit using the TI-83. Higher level math students can use the summation as an introduction into Calculus and limit theory.

Appendix

Interpolation: the computation of points or values between ones that are known or tabulated using the surrounding points or values. http://mathworld.wolfram.com/Interpolation.html

Extrapolation: The computation of points or values outside the range of the data that has been collected. Richardson extrapolation is one of the key ideas used in the popular Bulirsch-Stoer algorithm of solving ordinary differential equations. (http://mathworld.wolfram.com/RichardsonExtrapolation.html)

Least Squares: A mathematical procedure for finding the best fitting curve to a given set of points by minimizing the sum of the squares of the offsets ("the residuals") of the points from the curve. The method of least squares assumes that the best-fit curve of a given type is the curve that has the minimal sum of the deviations squared (least square error) from a given set of data. http://mathworld.wolfram.com/LeastSquaresFitting.html

Regression: A method for fitting a curve (not necessarily a straight line) through a set of points using some goodness-of-fit criterion. The most common type of regression is linear regression. http://mathworld.wolfram.com/Regression.html