Definition: A statistical
technique used to find relationships between variables for the purpose of
predicting future values.
Goal: The goal of regression analysis
is to determine the values of parameters for a function that cause the function
to best fit a set of data.
In linear regression, the function is a
linear (straight-line) equation.
In quadratic regression the function is a
parabola.
In exponential regression, the function
is an exponential curve.
Extrapolation: To estimate (a value
of a variable outside a known range) from values within a known range by
assuming that the estimated value follows logically from the known values.
Interpolation: To estimate a value
of (a function or series) between two known values.
Least
Squares: A method of determining the curve that best describes the
relationship between expected and observed sets of data by minimizing the sums
of the squares of deviation between observed and expected values.
Technology Pros and Cons:
There
are pros and cons of using technology for computing regression. Some pros are
that the user can be relieved from tedious computations, and can spend more
time doing data analysis. A big con is that the user does not have to
understand how the regression is computed.
Regression in the Secondary
Curriculum:
Technology
which can calculate regression can be very useful in the secondary curriculum.
Some classes that it can be used in are Algebra and Statistics.
In
Algebra, students could predict what they think is a best-fit line for a given
set of data points, and use a calculator to verify their results.
Mathematical Foundation for Regression
Analysis:
Given
a set of data points, there may be a best-fit function which can be used to
predict results. Some possible best-fit functions include linear functions
(straight-line), quadratic functions (parabolic) or exponential functions
(exponential curve).
Once
the given data is plotted, visual inspection is useful to determine the type of
regression analysis to use. Another
method of finding a best-fit function is to use the various regression analysis
programs and plot the outcomes to determine the best-fit.
Simple
linear regression refers to fitting a straight-model by the method of least
squares and then assessing the model.
The method of least squares requires no assumptions. If we let the best—fit line be Y = aX + b,
then the method of least squares finds solutions to the coefficients a
and b by minimizing the sum of the squares of the vertical
distances from the data points to the best-fit line. Let E be the sum of the squared vertical
distances of the ’s from the best-fit line.
To
minimize E, we must take the partial derivatives of E with respect to a
and b.
By setting each equation equal to zero, we get the following system of
equations.
By solving this system for a and b,
we can find the equation of the best-fit line for the data in the form y = ax +
b.
Similarly, the quadratic
best-fit curve in the form y = ax2 + bx + c can derived using the least square
method. Again, let E be the sum of the
squared vertical distances of the ’s from the best-fit quadratic.
To minimize E, we must take the partial derivatives of E with respect to a, b and c.
By setting each equation equal to zero, we get the following system of
equations.
By solving this system for a, b and
c, we can find the equation of the best-fit quadratic for the data in
the form y = ax2 + bx + c.
Similarly, the exponential best-fit curve in
the form y = aebx can derived
using the least square method. Again,
let E be the sum of the squared vertical distances of the ’s from the best-fit exponential curve.
By taking the natural log of both sides we
get . If we let
, then
is linear. Therefore, we can do the same as above for
the linear best-fit line to find a and b.
How Good is the Fit?
Correlation
coefficient
R is the sample correlation
coefficient. It measures the extent of
the plotted points clustered about a best-fit model equation. The correlation coefficient can be from -1 to
1 inclusive. If the value of R is close
to 1, then the data would suggest a positive relationship. If the value of R is close to -1, then the
data would suggest a negative relationship.
If the value of R is close to zero, then the data would suggest no relationship. However, the correlation coefficient can be
misleading. A visual inspection of the
plotted data should accompany the correlation data analysis. It is important not to confuse the
correlation coefficient with the slope of the best fit line, they are not the
same.
R is the sample correlation
coefficient. For linear correlation
coefficient, R is the sum of the products of the two standardized variables
divided by one less than the number of data points.
Another version of R is
In simple linear regression, the square
of the linear correlation coefficient, R2, is the proportion of the
variation of the response variable, y, explained by the straight-line model. Example, if R2= .84, then we can
say 84% of the data is explained by the linear model.
Example of simple linear regression
The following is housing data for
Sturbridge new construction on 1 acre lots, the axis are square footage vs.
cost in thousands.
Square
Footage |
Cost in
Thousands of Dollars |
1750 |
239 |
1872 |
289.9 |
2100 |
299.5 |
2024 |
324.9 |
2600 |
432.9 |
2688 |
339 |
2740 |
404.9 |
2780 |
410 |
2900 |
459.9 |
3400 |
589.9 |
The following is a scatter plot of the
data.
From
a visual inspection, the data appears to be linear. Using the simple linear regression techniques
described above, we get the following linear model.
The following
is a graph of the data and the linear model of the data.
The
best-fit linear model appears to be a very good representation of the
data. Next, we will calculate the linear
correlation coefficient using the equation above.
Since
R is very close to one, it suggests that the data has a strong positive linear
relationship. In other words, the linear
model is the appropriate model for this data.
Next, we will calculate R2.
This
value suggests that about 87.6% of the data can be explained by the linear
model.