Lisa Sypek
Kathy Harty
WHAT IS REGRESSION ANALYSIS?
Regression analysis calculates an equation that provides values of y for given values of x. The goal of regression analysis is to determine the values of constants for a function that result in the function to best fitting a set of data. In linear regression, the function is a linear (straight-line) equation (y=b0 +b1x). There are also other equations that can best describe the relationship between the two variables, such as quadratic (y=ax2 +bx +c), exponential(y=abx), logarithmic (y=a logbx) or higher degree polynomial functions. The purpose of obtaining these equations is to then use them to make predictions.
“Since the line or curve that results is actually one of “best fit”, the difference between the actual value of the dependent variable and its predicted value for a particular observation is the error of the estimate which is known as the "deviation'' or "residual''. The goal of regression analysis is to determine the values of the parameters that minimize the sum of the squared residual values for the set of observations. This is known as a "least squares'' regression fit.” (Source: http://www.nlreg.com/intro.htm)
MATHEMATICAL
FOUNDATIONS OF REGRESSION ANALYSIS
The result of regression analysis is a mathematical equation that describes the line or curve that best fits the data. There is a difference between the observed value of y and the value of y predicted by the equation. This vertical offset is called a residual. The error is measured by the difference between these two values. The goal of regression analysis is to find the relationship, while minimizing the error. The sum of the squares is used, rather than the absolute values, so that it can then be treated as a “continuous differentiable quantity”.
The method of least squares is used to find the constants of the equation where the sum of the squares of the differences in these y values is as small as possible. A linear equation (y=mx+b), the values for two constants m and b must be obtained, in a quadratic (y=ax2 +bx+c), the values for three constants must be found. The condition for R2 to be a minimum is that partial derivatives for the equation with respect to each constant must be equal to zero. What results is a set of equations (the number of which depends on the number of unknowns, 2 for linear, 3 for quadratic, and so on.) which can be solved by a variety of methods.
LINEAR Vs. QUADRATIC Vs. EXPONENTIAL
Linear regression analysis is used to find the best fit straight line for a set of data. Since the equation y= mx+ b, has two constants, m and b that need to be determined, it is necessary to take the partial derivatives of the sum of the squares equation ,with respect to a and b individually. Upon calculating all the necessary sums of the from the data, what results in a system of two linear equations with two unknowns that can be solved quite simply, to determine a and b.
Quadratic regression analysis is used to find the best fit parabola for a set of data. Since the equation y=ax2 +bx+c the values for three constants a,b and c must be found. It is necessary to take the partial derivatives of the sum of the squares equation ,with respect to a and b and c individually. Upon calculating all the necessary sums from the data, what results in a system of three linear equations with three unknowns that can be solved quite simply, to determine a and b. Below is the actual Maple Code one could use to determine the quadratic equation which best fits some parabolic data.
> x_val:=[0,3,2,5,5,6];
> y_val:=[6,0,1,1,4,6];
> n:=6;
> for i
from 1 to n do C[i]:=[x_val[i],y_val[i]];od;
>
> our_data_plot:=plot([seq(C[i],i=1..n)],style=point):
> display(our_data_plot);
> parab_graph:=plot(0.5*x^2-x*4+6,x=0..8,color=green,thickness=2):
> display(parab_graph);
> display({parab_graph,our_data_plot});
> A:=matrix(3,4,[0,0,0,0,0,0,0,0,0,0,0,0]);
> for i
from 1 to n do A[1,1]:=A[1,1] + x_val[i]^4;od;
> for i
from 1 to n do A[1,2]:=A[1,2] + x_val[i]^3;od;
> for i
from 1 to n do A[1,3]:=A[1,3] + x_val[i]^2;od;
> for i
from 1 to n do A[1,4]:=A[1,4] + x_val[i]^2*y_val[i];od;
> for i
from 1 to n do A[2,1]:=A[2,1] + x_val[i]^3;od;
> for i
from 1 to n do A[2,2]:=A[2,2] + x_val[i]^2;od;
> for i
from 1 to n do A[2,3]:=A[2,3] + x_val[i];od;
> for i
from 1 to n do A[2,4]:=A[2,4] + x_val[i]*y_val[i];od;
> for i
from 1 to n do A[3,1]:=A[3,1] + x_val[i]^2;od;
> for i
from 1 to n do A[3,2]:=A[3,2] + x_val[i];od;
> A[3,3]:=n;
> for i
from 1 to n do A[3,4]:=A[3,4] +y_val[i];od;
> evalm(A);

reduce the matrix using
Gauss Jordan algorithm
> Lisa:=gaussjord(A);

> Lisa[1,4];
![]()
> evalf(%);
![]()
> Lisa[2,4];
![]()
> evalf(%);
![]()
> Lisa[3,4];
![]()
> evalf(%);
![]()
> best_parab:=plot(Lisa[1,4]*x^2+
Lisa[2,4]*x+Lisa[3,4], x=-5..10):
> display(best_parab,parab_graph, our_data_plot);
Exponential regression analysis is used to find the best fit
exponential curve for a set of data.
Since higher order polynomials can appear to be exponential, if a simple graph
of x vs. lny appears linear the original data is
exponential. The equation y= Aebx can be made linear by taking the log
of both sides to end with ln y = lnA
+ bx. This can
be dealt with similar to the linear case, but lastly one must calculate elnA to get A.
HOW GOOD OF A FIT ?
the use of r/r2
r
|
r2
|
0.1
|
0.01 = 1%
|
0.2
|
0.04 = 4%
|
0.3
|
0.09 = 9%
|
0.4
|
0.16 = 16%
|
0.5
|
0.25 = 25%
|
0.6
|
0.36 = 36%
|
0.7
|
0.49 = 49%
|
0.8
|
0.64 = 64%
|
0.9
|
0.81 = 81%
|
1.0
|
1.0 = 100%
|
For
example, we found that the correlation between a nation's power and its defense
budget was .66. This correlation squared is .45, which means that across the
fourteen nations constituting the sample 45 percent of their variance on the
two variables is in common (or 55 percent is not in common). In thus squaring
correlations and transforming covariance to percentage terms we have an easy to
understand meaning of correlation. And we are then in a position to evaluate a
particular correlation. As a matter of
routine it is the squared correlations that should be interpreted. This is
because the correlation coefficient is misleading in suggesting the existence
of more covariation than exists, and this problem
gets worse as the correlation approaches zero “
SOURCE:
http://www.mega.nu:8080/ampp/rummel/uc.htm#C8
TECHNOLOGY
“NLREG is a very powerful regression analysis program. Using it you can perform multivariate, linear, polynomial, exponential, logistic, and general nonlinear regression. What this means is that you specify the form of the function to be fitted to the data, and the function may include nonlinear terms such as variables raised to powers and library functions such as log, exponential, sine, etc. For complex analyses, NLREG allows you to specify function models using conditional statements (if, else), looping (for, do, while), work variables, and arrays. NLREG uses a state-of-the-art regression algorithm that works as well, or better, than any you are likely to find in any other, more expensive, commercial statistical packages. “(SOURCE: http://www.nlreg.com/intro.htm)
Technology
Pros
and cons of various pieces of technology
The LinReg function on a TI-89 calculator can be used for linear
regression analysis. A number of
programs utilizing Linear Regression can also be downloaded to a TI 89 graphing
calculator. These programs calculate the best-fit line for a set of data
without using the LinReg function on the TI’s.
These programs perform linear regression on a set of points, and unlike the LinR function, GRAPH the approximated line with the points.
The Regression Package on TI’s allow the user to fit a set of points to a linear,
logarithmic, sinusoidal, exponential, or power regression model, then visually
compare the fit line and the original points on the graph. Quadratic regression program for TI’s works in same manner as built-in
linear regression.
There are actually two ways to do a linear regression
analysis using Excel. The first is done using the Tools menu, and results in a
tabular output that contains the relevant information. The second is done if
data have been graphed and you wish to plot the regression line on the graph.
In this version you have the choice of also having the equation for the line
and/or the value of R squared included on the graph.
The
stats package in maple provides a number of sub-packages and functions for data
visualization, sorting, tabulating interval frequencies, computations of the
measures of location and dispersion, computations of distributions and linear
regression. Many of these functions are illustrated in the tutorial.
For
ease of use Excel outweighs the other methods at least in the context of
secondary student. Often students find
the graphing calculator confusing and for many Maple
would seem like an outdated programming language. (Remember, none of these students have ever
seen Fortran!)
The National Council of Teachers of
Mathematics (NCTM) recommends that instructional programs in secondary schools
should enable students to formulate questions that can be address through a
multitude of mathematical procedures.
These procedures most certainly would include the selection and use of
appropriate statistical methods.
Statistics provides students with an rich
opportunity to practice the development and evaluation of inferences and
predictions that are based on data collection and analysis. The increased emphasis on data analysis and
evaluation is supported by some of the more common themes in the standard
algebra curriculum, yet the development of mathematical models based upon
statistical procedures remains an infrequent experience in traditional algebra
classes.
In studying data analysis and
statistics, students many times learn that solutions to some problems depend
upon assumptions and a certain degree of uncertainty. Mathematical models that simulate linear
relationships for instance are popular but not always realistic as taught in
the context of a typical algebra class.
The simplest type of model relating a
response variable y to a single quantitative independent variable x is given by
the equation of a straight line y = mx+b. Since this represents a deterministic model
where there is no error reading in y, that is to say a discrete value for y can
be predicted exactly using the equation y = mx+b, it
is fairly limited in its practical interpretation. Knowing that many times a variable can’t be
represented as a simple deterministic equation in one or more quantitative
independent variables, it becomes valuable for students to participate in
classroom activities that force them to investigate deterministic linear equations
in the context of a more realistic setting.
This discussion ends up becoming their first introduction to statistics
via regression analysis.
A method
for fitting a curve (not necessarily a straight line) through a set of points
using some goodness-of-fit criterion. The most common type of regression is
linear regression.
A mathematical procedure for
finding the best fitting curve to a given set of points by minimizing the sum
of the squares of the offsets ("the residuals") of the points from
the curve.
The sum of the squares of the offsets is used instead of the offset absolute
values because this allows the residuals to be treated as a continuous
differentiable quantity.
The
computation of points or values between ones that are known or tabulated using
the surrounding points or values.
An estimate of future conditions based on the
assumption that the current trends will continue.
Example
|
|
per capita cigarette
consumption(x) |
lung cancer
deaths(y) per 1 million |
|
|
270 |
97 |
|
|
300 |
115 |
|
|
350 |
165 |
|
|
485 |
170 |
|
|
505 |
190 |
|
|
535 |
210 |
