Regression Analysis

MME 523

 

by Chris Benestad

St. John's High

Worcester, MA

 

Overview:  what is Regression Analysis?

The goal of regression analysis is to determine the values of parameters for a function that cause the function to best fit a set of data observations that you provide. In linear regression, the function is a linear (straight-line) equation.  In power or exponential regression, the function is a power (polynomial) equation of the form  or an exponential function in the form.

 

Mathematical Foundations of Regression Analysis

Definition for line of best fit: A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes.  We often use a regression line to predict the value of y for a given value of x.  Regression, unlike correlation, requires that we have an explanatory variable and a response variable.  The most common regression line is the Least-squares regression line (LSRL).  The LSRL of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. 

 

Linear vs. Power vs. Exponential

A variable grows linearly over time if it adds a fixed increment in each equal time period.  Many situations in the real world exhibit growth that is not linear.  Two other functions that can model data are the power function and the exponential function.  A variable grows exponentially if it is multiplied by a fixed number greater than 1 in each equal time period.  Exponential decay occurs when the factor is less than one. 

Power Regression is one in which the response variable is proportional to the explanatory variable raised to a power. 

Since both the exponential form and the power form involve exponents, we can construct the models in similar fashion.  We first take the log of both sides.  For exponential data, we plot log of both sides.  For exponential data, we plot log y on x, and if that produces a linear pattern, we perform a least-squares regression on the transformed data.  We then do the inverse transformation and see if the resulting exponential function captures the trend of the data.  For power functions, we again take the log of both sides but plot log y versus log x.  If the transformed points are linear, then we find the LSRL for log y versus log x and do the inverse transformation to obtain the power function. 

 

How Good a Fit?   The use of r/r2

Correlation and regression are closely related.  The correlation r is the slope of the LSRL when we measure both x and y in standardized units.  The square of the correlation  is the fraction of the variation of one variable that is explained by the least-squares regression on the other variable.  Correlation and regression should be interpreted with caution.  Watch out for extreme observations and remember that correlation and regression describe only linear relations.

You can examine the fit of a regression line or curve by studying the residuals, which are the differences between the observed and predicted values of  y.  Outlying points that have large residuals can cause non-linear patterns and uneven variation about the line or curve.

 

An Application of your choice

Linear: Minutes of studying versus performance on tests

Minutes

Average Score

10

70

15

72

20

78

25

83

30

87

35

90

40

92

45

95

50

100

Plot Data:

 

Using a TI-83 Plus, the LSRL represented by the black line on the graph is determined to be

 

 

 

 

 

 

 

 

 

 

 

Exponential: Growth of money in a bank account:  $1000 invested at 8%

Time (yrs)

Balance

1

$1,080.00

2

$1,166.40

3

$1,259.71

4

$1,360.49

5

$1,469.33

6

$1,586.87

7

$1,713.82

8

$1,850.93

9

$1,999.00

10

$2,158.92

11

$2,331.64

12

$2,518.17

13

$2,719.62

14

$2,937.19

15

$3,172.17

16

$3,425.94

17

$3,700.02

18

$3,996.02

19

$4,315.70

20

$4,660.96

21

$5,033.83

 

 

 

 

 

 

 

 

 

Power: Cost of housing in MA: Year 1 represents 1993

Year

Ave Price

1

163291

2

162854

3

167475

4

171702

5

178536

6

187213

7

200870

8

223539

9

261293

 

 

The curve that represents the data is a fourth degree polynomial calculated by the TI-83 Plus.  The shows that housing prices in MA are growing rapidly.  If the trend continues, eventually no one will be able to buy a house in MA.

 

Technology: pros and cons of various pieces of technology

Technology can be used to determine Least-Squares regression lines.  The TI-83 Plus is very useful when finding least-squares regression lines.  The STAT function allows the student to enter the data into the calculator and by using the LINREG (ax + b) function, the calculator will find the slope and the y-intercept.  It will also give the  and  values.  One drawback of the calculator can be that if there is a large data set, it is time consuming to enter the data into the calculator.  

Another excellent tool is Excel.  All of the data and graphs in this document are produced in Excel.  This allows the student to enter the data and use the tools to generate the graphs and the trend-lines.  

 

 

The Secondary Curriculum

This type of work with regression lines is necessary in the secondary curriculum.  It forces the student to work with data and calculate the regression lines by hand and using a calculator.  It also allows the student to see that mathematics applies to real world data and can be used in forecasting future data points from the regression line or curve.

 

 

 

Appendix: Terminology for the Novice

 

  • Regression: a line or a curve that describes how a response variable y changes as an explanatory variable x changes. 
  • Least squares: minimize the sum of the squares of the vertical distances of the observed y-values from the line or the curve.
  • Interpolation: the use of a regression line or curve for prediction inside the domain values of the explanatory variable x that you used to obtain the line or curve.  Such predictions are lest risky than those of extrapolation. 
  • Extrapolation: the use of a regression line or curve for prediction outside the domain values of the explanatory variable x that you used to obtain the line or curve.  Such predictions cannot always be trusted.