5: Regression Analysis

5: Regression Analysis Line of Best Fit 1 Adding and Modifying a Trendline 1 The function SLOPE and INTERCEPT 3 The FORECAST Function 4 The LINEST and...
Author: Marshall Warner
22 downloads 2 Views 80KB Size
5: Regression Analysis Line of Best Fit 1 Adding and Modifying a Trendline 1 The function SLOPE and INTERCEPT 3 The FORECAST Function 4 The LINEST and TREND functions 4 Using the Regression Analysis Add-in 6 Multiple Regression 6 Non-linear Regression - Polynomial 8 Perils of Forecasting 9

Line of Best Fit Figure 1 shows a chart generated from a table of data. The (x,y) pairs are plotted on the chart as diamond-shaped markers. A line of best fit has been drawn through the data points. This line is also called the regression line. And, to make things more complicated, Excel refers to it as a Trendline. A 1 2 3 4 5 6 7 8 9 10 11 12 13

B x 2 4 6 8

y 5.6 11.1 15.6 18.2

C

D

E

F

G

H

8

10

25 20

y = 2.115x + 2.050

15 10 5 0 0

2

4

6

Figure 1 The theory of regression is covered in chapters 14 and 15 of Quantitative Approaches in Business Studies. In this supplement we will concentrate on the use of Excel to get results but we must be sure of our terminology. In Figure 1, the line of best fit is a straight line. We say that the data fits a linear equation. This equation is generally written by statisticians in the form y = a + bx where a is the intercept (the y-value where the line hits the y-axis) and b is the slope (rise over run.) However, most scientists and engineers define a straight line by the equation y = mx + b where b is the intercept and m is the slope. The Help facility of Excel uses both forms for the linear equation! Moral: in regression analysis the 2b’s are not the same 2b’s so when working with a new book or new Help item, read the text to see which formula is being used. In Figure 1, the line of best fit has the form y = 2.115x + 2.050. So the slope of the line is 2.115 while the intercept is 2.050.

2 Regression Analysis

Adding and Modifying a Trendline Prepare a chart as shown in Figure 1 with the data displayed as markers. Right click on one of the markers to open the popup menu and select Add Trendline to bring up the Trendline dialog box as shown in Figure 2. We require a linear fit, so ensure that the first thumbnail sketch is selected. Open the Options tab (Figure 2) and click on the box Display equation on chart. After you have clicked the OK button, the chart should be similar to that in Figure 1.

Figure 2

Figure 3

The trendline has a number of useful options that may be set when it is initially added to the chart or subsequently by right clicking on it and selecting format trendline. This brings up the dialog box shown in Figure 4. By default, Excel draws a rather thick line for the

Regression Analysis 3 trendline but this may be altered using the Pattern tab of the dialog. The other two tabs in the Format Trendline dialog are the same as those of the Add Trendline dialog. The reader is asked to note two features: The first is the box labelled Display R-squared value on chart. This quantity is called the coefficient of determination and ranges from zero to 1 where the latter indicates a perfect correlation – the predicted y values are the same as the actual y values. The second feature is the Forecast area which allows one to extrapolate the trendline forwards and backwards. The trendline in chart in Figure 1 has been extended by 2 x units in both direction. Beware, however, of the perils attached to extrapolation (see Quantitative Approaches in Business Studies page 318.)

Figure 4

The function SLOPE and INTERCEPT We may wish to use the value of the slope and intercept in additional calculations on a worksheet. We could copy the values from the equation displayed on the chart but this would be dangerous for, if the XY data was changed for any reason, we would need to remember to transcribe the new slope and intercept values. The worksheet function SLOPE and INTERCEPT may be used to have the required values placed in cells. The syntax of SLOPE is =SLOPE(known_y’s, known_x’s) and the syntax for INTERCEPT is =INTERCEPT(known_y’s, known_x’s) Figure 5 shows a chart of the data from page 297 of Quantitative Approaches in Business Studies. The trendline equation is y = 8x + 48 with a low R2 (R-squared not R2) value of 0.64. We may also obtain the values for these regression coefficients in B8:B10 with the formulas =SLOPE(B2:B6, A2:A6), INTERCEPT(B2:B6, A2:A 6) and RSQ(B2:B6,A2:A6), respectively. The last function computes the R-squared quantity.

4 Regression Analysis

Figure 5

The FORECAST Function Knowing the slope and the intercept of the line of best fit, we can calculate the expected y values as shown in Figure 6. The equation in C2 is =$B$8*B2 + $B$9 where the use of $ symbols to make the references absolute permits us to copy the formula down to row 6. Excel provides the function FORECAST that gives us another way to get the expected y values. The formula in D2 is =FORECAST(A2,$B$2:$B$6,$A$2:$A$6). This computes the slope and intercept behind the scenes and then finds the expected y value. Of course, either method could be used to interpolate or extrapolate. Thus if A7 contained a value representing a possible advertising budget, we could predict the expected sales using either =B8*B2 + B9 or =FORECAST(A2,B$2:B6,A2:A6).

1 2 3 4 5 6 7 8 9 10

A B Advertising Sales 2 60 3 80 4 70 5 100 6 90 slope intercept rsq

C D y=mx+b FORECAST 64 64 72 72 80 80 88 88 96 96

8.0 48.0 0.64

Figure 6

The LINEST and TREND functions The LINEST function can be used in place of SLOPE and INTERCEPT to find the regression coefficients. Why would Excel provide two ways to do the same thing? SLOPE

Regression Analysis 5 and INTERCEPT are simpler to use while LINEST is more powerful in that it has options not present in the other functions. Similarly, TREND can do all that FORECAST can do and more. The syntax for LINEST is =LINEST(known_y’s, known_x’s, const, stats). The first argument is required while the other three are optional. If known_x’s is omitted, LINEST assumes the values to be {1, 2, 3, ...}. If the argument const is set to TRUE or omitted, then LINEST computes a slope and intercept in exactly the same way as the SLOPE and INTERCEPT functions. When the argument is set to FALSE, LINEST find the line of best fit that passes through the origin of the chart – i.e. it sets the intercept b to zero and finds the best fit for a line with the form y = mx. When the last argument, stat is set to TRUE, LINEST computes additional regression statistics. Since LINEST returns more than one value, it is an array function. Before typing the formula, you must select the required number of cells and you must complete the formula with C+S+R. In Figure 7 we show a worksheet with the same x and y values as in Figure 6. The regression coefficients have been computed with LINEST. The range B11:C15 was first selected and then the formula =LINEST(B2:B6,A2:A6,TRUE,TRUE) was entered and completed with C+S+R. When Excel completes the formula it adds braces { } around an array formula. A similar method was used to enter =TREND(B2:B6,A2:A6,A2:A6,TRUE) into C2:C6.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

A Advertising

B 2 3 4 5 6

slope intercept linest

C Sales 60 80 70 100 90

trend 64 72 80 88 96

8.0 48.0 8 3.464101615 0.64 5.333333333 640 slope slope error 2 R F statistic sum of squares

48 14.69693846 10.95445115 3 360 intercept intercept error y estimate error degrees of freedom residual sum of squares

Figure 7 We see that LINEST computed the same values for the slope, intercept and R2 values as before. The additional statistics are noted in the figure: see Excel Help for more details. Had we selected only the cells B11:C11 and used =LINEST(B2:B6,A2:A6), only the slope and the intercept would have resulted.

6 Regression Analysis It is left as an exercise for the reader to fit the data below to the equation y=mx using LINEST and show that the slope is 3.6057. x

5

10

12

15

y

19.5

38.2

41.1

53.9

You can also make a chart with a trendline passing through the origin to find the same slope.

Using the Regression Analysis Add-in The same information that LINEST gave may be obtained from the Regression Analysis add-in. With the same worksheet as in Figure 7, use the menu command Tools|Data Analysis|Regression to open the dialog box shown in Figure 8. When the OK button is clicked, the results shown in Figure 9 are obtained. NOTE: the information produced by a Data Analysis add-in is static unlike the dynamic results from formulas. If you change the XY data (information in A2:B6) you must regenerate the analysis data.

Figure 8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

J SUMMARY OUTPUT

K

L

M

N

O

Regression Statistics Multiple R 0.8 R Square 0.64 Adjusted R Square 0.52 Standard Error 10.95 Observations 5 ANOVA df Regression Residual Total

Intercept Advertising

Figure 9

SS 1 3 4

MS 640 360 1000

F Significance F 640 5.333333333 0.104088039 120

Coefficients Standard Error t Stat P-value 48 14.69693846 3.265986324 0.046918927 8 3.464101615 2.309401077 0.104088039

Lower 95% 1.227738632 -3.024327728

Regression Analysis 7

Multiple Regression When the value of a quantity is dependent on two or more factors, we need to use multiple regression to analyse the problem. In Figure 10 we reproduce the data from page 329 of Quantitative Approaches in Business Studies. This shows the sales figures for ten regions with varying amounts spent on advertising and having different numbers of households. We wish to know how the two factors (advertising and households) affect the sales figure. We may get this data from a LINEST formula. Select F2:H6, enter =LINEST(B2:B11, C2:D11, TRUE, TRUE) and complete the formula with C+S+R. The meaning of each cell is explained by the blue notes in the figure. A 1 2 3 4 5 6 7 8

Region A B c d e f g

B C D Sales Advertising Households (£00 000) (£00 000) (00 000) 20 0.2 5.15 25 0.2 5.42 24 0.2 5.76 30 0.3 6.17 32 0.3 6.83 40 0.4 7.07 28 0.3 5.00

9 h

50

0.5

7.42

10 I 11 j 12 13 14 15 16

40 50

0.4 0.5

7.47 7.70

E

F

G

H

LINEST 2.168 1.052 0.982 189.828 998.490

73.754 9.295 1.622 7.000 18.410

-4.310 4.386 #N/A #N/A #N/A

b2

b1

b0

b2 error 2 R F statistic sum of squares

b1 error

b0 error

y estimate error degrees of freedom residual sum of squares

y = b0 + b1x1+b2x 2 Sales = -4.3 + 73.6xAdvertising + 2.2xHouseholds All variables in 00 000

Figure 10 We could also use the Regression tool as before. The y-values are B1:B11 and the xvalues are in C1:D11 (remember to check the labels box.) The results are as shown in Figure 11. J 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

K

L

M

N

O

SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations

0.991 0.982 0.977 1.622 10

ANOVA df Regression Residual Total

Intercept X Variable 1 X Variable 2

Figure 11

SS 998.490 18.410 1016.900

MS 499.245 2.630

Coefficients Standard Error -4.310 4.386 73.754 9.295 2.168 1.052

t Stat -0.983 7.935 2.061

2 7 9

F Significance F 189.828 0.000

P-value 0.359 0.000 0.078

Lower 95% -14.681 51.775 -0.319

8 Regression Analysis

Non-linear Regression - Polynomial We may wish to fit data to polynomial expressions in the form y = b0 + b1x + b2x2 + b3x3 ... For example, maybe we wish to fit the data below (taken from page 335 of Quantitative Approaches in Business Studies) to a second order polynomial y = b0 + b1x + b2x2. x

1

2

3

4

5

6

7

8

y

7

10

12

18

22

32

45

60

We will see two slightly different ways of attacking this type of problem. In the first way we include a column with x2 data. We can then think of the x data as one variable and the x2 as a second and use the same techniques as we did for multiple regression. This is shown in Figure 12. The x data is in column A, the x2 in column B and the y in column D. Cells D2:F2 were selected and the formula =LINEST(C2:C9, A2:B9) was entered and completed with C+S+R. Note that we have omitted the optional const and stat arguments so that we get just the regression coefficients. This tells us that the data can be represented by the quadratic equation y = 1.14x2 !3.02x + 10.21. It is left as an exercise for the reader to use the data in A1:C9 with the Regression tool. A

B x 1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8 9

C x2 1 4 9 16 25 36 49 64

D y LINEST 7 1.142857 b2 10 12 18 22 32 45 60

E

F

-3.02381 10.21429 b1 b0

Figure 12 A second method is used in Figure 13. Here there is no column for the x2 data. This is computed by the LINEST function. Cells D2:F2 were selected and the formula =LINEST(B14:B21, A14:A21^{1,2}) was entered and completed with C+S+R. The part of the formula reading ^{1,2} raises the x values to powers of 1 and of 2. This method may be used with higher order polynomials and negates the need for extra columns of data. It is left as an exercise for the reader to draw a chart of the data in A13:B21 and add a trendline for a second order polynomial. A 13 14 15 16 17 18 19 20 21

Figure 13

B x 1 2 3 4 5 6 7 8

C y 7 10 12 18 22 32 45 60

D LINEST 1.142857 b2

E

F

-3.02381 10.21429 b1 b0

Regression Analysis 9

Perils of Forecasting In Quantitative Approaches in Business Studies, and this supplement, the reader has been warned of the perils of extrapolating regression data. Here is an example to show you why it is often misleading. In A1:G3 of Figure 14 we have some costs and revenue data for a six year period. This is shown graphically in the left hand chart. Close examination of the chart shows the two lines are slowly converging. A 1 year 2 costs 3 revenue 4 5 120 6 110 7 100 8 90 9 80 10 70 11 60 12 50 13 40 14 0 15 16

B

C

1 70 90

2 74 94

D 3 79 97

E

F

G

4 83 101

5 88 105

6 94 109

H

costs revenue

cost = 4.7429x + 64.733 2 R = 0.9958 rev = 3.7714x + 86.133 R2 = 0.9983 2

4

6

8

I year costs revenue

J

K L M N O 1 2 3 4 5 6 1.845 1.869 1.898 1.919 1.944 1.973 1.954 1.973 1.987 2.004 2.021 2.037 costs

2.20 2.15 2.10 2.05 2.00 1.95 1.90 1.85 1.80

revenue

costs = 0.0254x + 1.8194 R2 = 0.9988

rev = 0.0165x + 1.9384 R2 = 0.999 0

2

4

6

8

10

12

14

Figure 14

The formula in J2 is =LOG(B2) and this has been copied to J2:O3. The right hand chart shows the logarithmic data. In each chart, trendlines have been added to the two data series. (After the trendlines were added, references to them in the legends were removed. Also, the formulas were carefully edited to replace the y’s by costs and rev). The original data gives very high R2 values indicating a reasonably good fit to a straight line. The logarithmic data has even higher R2 values. Note that it is a mathematical fact that if data is a reasonable fit to a straight line, then the logarithm of the data will apparently be an even better straight line. In the righthand chart, the trendlines were extended forward by 5 units. They appear to intersect at about year 13. Does the original data give the same prediction? Rather than extend the trendlines, we will find the answer mathematically. The two lines will intersect when these two values are the same. costs = 4.74years + 64.7 revenue = 3.77years + 86.1 At intersection: 4.74years + 64.7 = 3.77years + 86.1 Or (4.74 ! 3.77)years = 86.1 ! 64.7 Or 0.97years = 21.4, giving years 22. So, we have one prediction that the company will start to become unprofitable in 13 years and another predicting 22 years. We have extended six years of data by a factor of two in the first case and a factor of over 3 in the second. Would you trust six year’s worth of business data to predict what will happen seven years from now?