R egression is perhaps the most widely used data

Using Statistical Data to Make Decisions Module 4: Introduction to Regression Tom Ilvento, University of Delaware, College of Agriculture and Natura...

Author: Isabel Sharp

1 downloads 0 Views 144KB Size

Report

Download PDF

Recommend Documents

Concrete is one of the most widely

The most widely used cutting instrument in EXPERIMENTAL

Cinderella. The story of Cinderella is perhaps the most widely told tale in the. About the Tale

THE analysis of the ECG is widely used for diagnosing

It is perhaps unfortunate, but the most recognizable line from

WORD sense disambiguation (WSD) is perhaps the most

The fluidized-bed jet mill (FBJM) is widely used

THE term symptomatic gallstones is widely used and

Diabetes is one of the most widely spread modern lifestyle

Forced-air warming (FAW) is widely used to

Propofol is a general anesthetic widely used during

Cyclophosphamide (CYC) is widely used to treat systemic

Atrazine, one of the most widely used pesticides. The Economics of Atrazine FRANK ACKERMAN, PHD

The common femoral artery is the most frequently used site

Consumption taxes are the second-most widely used source of tax revenue in Illinois. General sales

Cassava is widely grown and is one of the most important

Unexplored Dimensions of EPL EPL is perhaps the most widely investigated institution in the LM. Many features of EPL have not been taken into account

The implantable cardioverter defibrillator (ICD) is perhaps

TRANSMISSION lines are widely used for transportation

Hydrophilic matrix tablets are widely used for

Although widely used and well established, physical

Chiropractic services are widely used in the United States

Sustainability Reporting Trends in North America Most Active Sectors and Widely Used Guidelines

Bonded labour - or debt bondage - is probably the least known. form of slavery today, and yet it is the most widely used method of

Using Statistical Data to Make Decisions

Module 4: Introduction to Regression Tom Ilvento,

University of Delaware, College of Agriculture and Natural Resources, Food and Resource Economics

egression is perhaps the most widely used data analysis tool in statistics. Many fields use it as away to analyze simple or very complex models. It has been found useful in business applications, economics, the other social sciences (sociology, political science, and psychology) and in the hard sciences. It links well with covariance and correlation and is related to Analysis of Variance in experimental designs. Perhaps it best asset is that the technique is very flexible and therefore lends itself to many applications. As a statistical technique it is intuitive in its interpretation, fairly robust when assumptions are violated, and it provides a means to understand relationship while simultaneously controlling for other variables. It is the closest thing to a scientific “vacuum” when dealing with nonexperimental data.

• Understand the principles of the regression approach

This module will introduce regression by focusing on a bivariate regression, with one explanatory variable modeled on a dependent variable. We will explore the regression line and the coefficients that are estimated from the model, the ANOVA table (Analysis of Variance), R-square (R2) for the model, and the interpretation of the regression model. We will also look at and interpret a dummy variable regression and a trendline regression. Later modules will explore a multi-variate approach with several independent variables and more sophisticated methods of validating a regression model.

• Understand how to use regression with a single independent variable and a dummy variable; introduce a trendline analysis of data over time

R

BASICS OF REGRESSION Regression fits a linear function to a set of data. It requires that we have a single dependent variable that we are trying to model, explain, or understand. The dependent variable must be a quantitative variable (preferably measured on a continuous level). Regression estimates a set of coefficients that represent the effect of a single variable or a set of independent variables on the dependent variable. The independent variables can be measured on a qualitative level (as in categorical variables represented by dummy variables), an ordinal level, or at a continuous level.

• Understand the estimated equation of a line, the coefficients from regression, and the nature of the error term • Understand how to interpret regression coefficients and make a predicted value based on the model.

In this Module We Will: • Run a regression using Excel for a bivariate relations (single independent variable) • Look at and interpret the ANOVA table in regression • Estimate a regression model using a dummy variable and a trendline.

Key Objectives For more information, contact: Tom Ilvento

213 Townsend Hall, Newark, DE 19717

302-831-6773

[email protected]

Using Statistical Data to Make Decisions: Introduction to Regression

Page 2

A Quick Example. I recently went to Europe and had to deal with temperatures in Celsius. If you have had this experience and forget the formula for conversion, you can get frustrated. However, I remembered a friend once told me a quick “rule of thumb” was to double C and add 30 to find out an estimate of the Fahrenheit temperature. I used my calculator to convert several C values to F and then used these in a regression equation in Excel. The plot of the data are given below (Figure1). As you can see from the plot, the relationship is a perfect positive linear relationship. We would expect the correlation between these two variables to be 1.0 and that is confirmed by the R2 of 1 on the plot. Notice that the best fitting line is: y = 32 + 1.8x Fahrenheit vs Celsius 100

y = 1.8x + 32

Fahrenheit

80

2

R =1 60 40 20 0 -20

-10

0

10

20

30

40

Celsius

Figure 1. Scatter Plot of Fahrenheit versus Celsius Temperature

The question is, will regression in Excel give us an estimate of the linear function. I regressed Fahrenheit on Celsius using Excel, which produced the following output (see Table 1). The output in table is only part of the output from Excel. We will explore this output in detail, but for right now I want you to focus on a few things. In the top is the Multiple R and R Square and both are one. These indicate that the model fit the data perfectly. This will be rare in regression analysis. The other thing I want you to focus on is the bottom of the output, first column, where it gives the coefficients for the Intercept and for C. The values represent the intercept (a value of 32) and slope coefficient (a value of 1.8) for an equation of a line (F stands for Fahrenheit and C for Celsius): F = 32 + 1.8C

If there is a linear relationship in the data, regression will find it and estimate it

Using Statistical Data to Make Decisions: Introduction to Regression

Page 3

Table 1. Partial Regression Output from Excel for the Temperature Example. SUMMARY OUTPUT Regression Statistics 1 Multiple R 1 R Square Adjusted R Square 1 7E-05 Standard Error 18 Observations ANOVA df Regression Residual Total

1 16 17

Intercept C

Coeff 32 1.8

SS 12372.94 0.00 12372.94

MS 12372.94 0.00

Std Error t Stat 0.00 1519489.65 0.00 1567412.9

Notice that this equation is identical (as is R Square) to the trendline option in the Excel scatter plot. This confirms that the trend line option is in fact a bi-variate regression of one variable on the other. You state this as, we regress Y on X (or a set of X variables). I should also point out that regression coefficient for the slope is a measure of association, but unlike a correlation coefficient, it is not symmetrical. While R Square remains the same. the equation for regressing Celsius on Fahrenheit is: C = -17.78 + .56 F So, regression did fit a linear function to the data, and the resulting equation is in fact the actual formula for converting from Celsius to Fahrenheit. A key point to take away - if there is a linear function to the data, regression will find and estimate it. If the relationship is not perfect, regression will find the “best fitting” line to the data (we will define best fitting later). My friends advice about converting C to F temperature was not perfect, but it was reasonable. It is easy to add 30 and multiply by 2 to get an approximation of the Fahrenheit temperature. The Equation of a Line. If you remember back to introductory algebra, the equation of a line was given as: Y = a +bX

The equation of a line has an intercept and a slope coefficient.

Using Statistical Data to Make Decisions: Introduction to Regression Where a is the intercept and b is the slope. The slope coefficient (b) represents the change in Y for a unit change in X. The intercept (a) represents the value of Y when X equals zero (i.e., the line passes through the X axis). In this format we specify a dependent variable Y, and independent variable X. We can have more than one independent variable and we would use the following subscripts. Y =a + b1X1 + b2X2

Page 4

The slope coefficient (b) represents the change in Y for a unit change in X. The intercept (a) represents the value of Y when X equals zero.

The equation of a line is a deterministic model - it fits the data perfectly. In reality, we rarely will estimate a model that fits the data perfectly. We will shift to a probabilistic model which contains a deterministic component and a random error term (,i1) that represents the difference between what we observe in the data and what we estimate with our model. I will use Greek terms ($o and $1) to represent the coefficients in the equation and begin to think of this as representing a population.

Yi = $o + $1 Xi1 + ,i1 Our expectation of Y is the deterministic component, but we will recognize that the model won’t always fit the data perfectly. We use the following to represent the expectation. E(Yi) = $o + $1 Xi1 The error component is very important concept and links regression to the variance for the mean and the computation of a standard error from a sampling distribution. The following equations show the link to the error term. Notice that even in the population our model may not fit the data perfectly.

Yi = β 0 + β1 X i1 + ε i ) Yi = β o + β 1 X i1 )

ε i = Yi − Yi

What we observe in the population What we predict from our model The error is the difference between what we observe and what we predict

We have seen a similar error term when dealing with the mean, the variance, and a standard error. Consider a simple model where our prediction of a variable is equal to the population mean and an error term. The following set of equations walk you through to the variance.

In a probabilistic model we will work with an error term, even when dealing with the population

Using Statistical Data to Make Decisions: Introduction to Regression

Page 5

Equations Showing the Error Term in a Model Using the Mean Leads Toward the Variance Yi = : + ,i

Simple prediction model based on the mean

,i = Yi - :

Deviations about the mean

G,i2 = G(Yi - :)2

Sum of Squared deviations

G,i2/n = G(Yi - :)2/n

Mean squared deviation

G,i2/n = F2

Population Variance

So, just like the model for the mean, the error term in regression helps us determine a variance around the fitted regression line that will ultimately lead us toward a standard error for our estimates and a way to make an inference. In regression we assume equal variances for Y (dependent variable) across each level of X (independent variable). In essence, we will pool the measure of the variance in regression using information from each level of the independent variable(s). A Second Example: Catalog Sales Regressed on Customer Salary. The following example involves data we have already worked with in Module 3, catalog sales and the salary of the customer. For this example we will use SALES as the dependent variable and we want to explain the sales as a function of the SALARY of the customer. The scatter plot of the data are given in Figure 2. The correlation between these two variables is .70 which indicates a strong positive relationship between the two variables. As the salary of the customer increases so does the amount of catalog sales. However, the relationship is not a perfect one, and there is scatter around the regression line. You might notice that as Salary increases there appears to be more scatter around the regression line. This is called heteroscadasticity and is a potential problem we might want to examine further. The regression output is given in Table 2. The multiple R is the same at the correlation between the two variables. With a bivariate regression (one independent variable) the Multiple R will always be the absolute value of the correlation between the variables. The equation of the line is given by the coefficients is the regression output. Est SALES = -15.332 + .022*SALARY

The error term in regression provides a measure of variance around the fitted regression line that will ultimately lead us toward a standard error and a way to make an inference

Using Statistical Data to Make Decisions: Introduction to Regression

Page 6

Sales

Figure 2. Scatter Plot of Sales versus Salary $7,000 $6,000 $5,000 $4,000 $3,000 $2,000 $1,000 $0 $0

$50,000

$100,000

$150,000

$200,000

Salary

I usually am careful to note that we have an estimate of Y it is a probabilistic model with a random error component. The intercept is negative which is outside the range of our data and doesn’t make a lot of sense. However, we still need the intercept to make predictions with our model. Once we fit a line to the data, we can make predictions from out equation. I call this “solving the equation” based on the predicted model. If SALARY is $50,000, what is our estimate of SALES? Est SALES = -15.332 + .022*($50,000) Est SALES = $1,115.33 Our model says that the estimated catalog sales of someone with a salary of $50,000 will be $1,115. We can

Table 2. Regression of Catalog Sales on Customer Salary Regression Statistics Multiple R 0.700 R Square 0.489 Adjusted R Square 0.489 Standard Error 687.068 Observations 1000 ANOVA df Regression Residual Total

1 998 999

Intercept Salary

Coeff -15.332 0.022

SS MS 451624335.68 451624335.68 471117860.07 472061.98 922742195.74 Std Error 45.374 0.001

F 956.71

Signif F 0.00

t Stat P-value Low 95% CI High 95% CI -0.338 0.736 -104.373 73.708 30.931 0.000 0.021 0.023

Using Statistical Data to Make Decisions: Introduction to Regression also say that a dollar increase in salary will result in $.022 increase in sales. I could also express this as a $1,000 increase in salary results in a $22 increase in catalog sales.

Page 7

When making predictions, care should be taken when making estimates outside the range of the data.

When making predictions, care should be taken when making estimates outside the range of the data. The relationship may or may not hold for these values. For example, we could estimate the sales for a millionaire as: Est SALES = -15.332 + .022*($1,000,000) Est SALES = $22,015 However, it would not be wise to put a lot of faith in this estimate. The relationship between SALES and SALARY might change at lower or higher values. For example, the relationship might be nonlinear or curved at the extremes.

THE LEAST SQUARES APPROACH How do we find a “best fitting line,” or perhaps more to the point, what is a best fitting line? In regression, we will use the property of Least Squares to find estimates for B0 and B1 that will minimize the squared deviations about the fitted line. The “Best Fit” means the squared differences between the actual Y values and the predicted Y values are a minimum. These squared differences are referred to as the Sum of Squared Errors (SSE). The following equations show this minimum function. n n ) ) SSE = ∑ (Yi − Y ) 2 = ∑ ε 2 = minimum i =1

i =1

The Least Squares approach generates a set of coefficients that minimizes the Sum of the Squared Errors (SSE). This is done with matrix algebra and need not concern us for this course. However, we can show the formulas for the bi-variate regression coefficients just to emphasize the connection of this approach with covariance, correlation, and the variance.

)

β1 =

SS XY SS X

)

)

β 0 = Y − β1 X

where SS XY = ∑ ( X i − X )(Yi − Y )

We will use the property of Least Squares to find estimates for B0 and B1 that will minimize the squared deviations about the fitted line.

Using Statistical Data to Make Decisions: Introduction to Regression The formula for the intercept term shows that the regression line will always go through the means of the dependent and independent variables. The formula for the slope coefficient shows the connection with covariance, and that regression coefficients are not symmetric - the coefficient for SALARY regressed on SALES would have SSy in the denominator. The least squares approach can also be seen from the following graph (Figure 3). The numerator of the variance of Y, also known as the Total Sum of Squares (TSS) can be broken down into a part that is explained and a part that is unexplained. The strategy is to compare the regression line as a predictor of Y to the mean as a predictor of Y. The one dot represents a value of Y as an observation. If we had a very simple model as a predictor, just the mean, we would have a lot of error associated with our model (equal to the TSS). TSS is measured by

TSS =

∑ (Y − Y )

2

i

However, the regression line improves the fit, and the improvement is measured as the explained or Regression Sum of Squares (RSS). This is equal to:

RSS =

∑ (Y$ − Y )

2

and,

Regression coefficients are not symmetric - it matters which variable is considered the dependent variable.

The regression model expresses the total variation in the dependent variable (Y) as a part due to the regression model and a part that is due to the error term. Another way to express this is a part tht is “explained” and a part that is “unexplained.”

i

The final portion is the residual. This is the difference between the actual value and the prediction from the regression equation. This part is referred to as the SSE or the residual sum of squares.

SSE =

Page 8

∑ (Y

i

− Y$i ) 2

TSS = RSS + SSE

TSS = RSS + SSE

Y

Yi Total sum of squares (Yi -Y)2

Unexplained sum ^ of squares (Yi - Yi)2 Yˆi = βˆ0 + βˆ1 X i

Explained sum of ^ -Y)2 squares (Y i

Y Xi

This equality is shown in the Analysis of Variance Table (ANOVA) in the output. The equality is expressed as:

X

Figure 3. Breakdown of Total Sum of Squares into Parts Explained and Unexplained.

Using Statistical Data to Make Decisions: Introduction to Regression

Page 9

USING EXCEL TO ESTIMATE REGRESSION Excel has the capability to run a regression for simple or complex models. Add in Programs also can provide additional tools for advanced regression techniques. Here are some key things about regression to remember. Regression in Excel requires that the data be organized in columns, with the independent variables in contiguous columns. If you have a large data set you might need to copy the data you want to work with to a new workbook. Regression also requires the same number of observations for all variables in the regression. If there are missing values for any variable in the analysis, these observations will be removed from the analysis. This is called “pair-wise” deletion. The following are the suggested steps when using Excel for regression. •

Organize data in columns

•

First column should contain Y (dependent variable)

•

Remaining Columns should contain contiguous Xs (independent variables)

•

TOOLS Data Analysis Regression

•

Specify Y variable

•

Specify X variables – need to be contiguous columns

•

Remember to specify if first row has labels

•

Specify where the output should go - I recommend a new worksheet

•

I generally modify the output for presentation •

How many decimal places are showing (3 to 4)

•

Change Headings to make them fit

•

Bold Headers

Excel can estimate the regression equation and produce the output very quickly, even for a large data set it may only take a few seconds. When modifying the output, you should decide how many decimal places to use, depending upon the nature of the data. I would recommend at least two decimal places, but in some cases more may be warranted. Next we will examine the regression output in detail.

Excel will estimate a regression equation for you data using: TOOLS DATA ANALYSIS REGRESSION

Using Statistical Data to Make Decisions: Introduction to Regression Excel Regression Output. We will use a new example from the catalog sales data to examine the output from Excel in detail. Excel’s output for regression is similar to many other software packages, although there are some terms that have slightly different names. However, once you become comfortable with one software package you will be able to read and interpret the output from other packages. For this example we will examine a regression of SALES on CATALOGS (the number of catalogs mailed to the household. The correlation between these two variables is .473. The scatter plot between the two variables is given in Figure 4. Sacatter Plot of Sales versus Catalogs $7,000 $6,000

Sales

$5,000 $4,000 $3,000 $2,000 $1,000 $0

0

5

10

15

20

25

30

Catalogs M ailed

Figure 4. Scatter Plot of SALES versus CATALOGS I will examine the regression output in parts to emphasize each component. The first part of the output contains the Regression Statistics and provides some summary measures of the Regression and are labeled as Regression Statistics. These include: Multiple R – in a bivariate regression this is the absolute value of the correlation coefficient |r|. In a multivariate regression it is the square root of R2 R-Square – a measure of association that gives us an indication of the fit of the model. R Square ranges from 0 (nothing explained by the model) to 1 (a perfect fit). Like correlation it is based a linear fit of the data. SUMMARY OUTPUT Regression Statistics 0.473 Multiple R 0.223 R Square 0.223 Adjusted R Square 847.375 Standard Error 1000 Observations

Page 10

Using Statistical Data to Make Decisions: Introduction to Regression

Page 11

Adjusted R Square – R Square will always increase as you add independent variables to a model. To adjust for this to make it a better indicator of the fit of a model, the adjusted R Square adjusts the figure for the number of independent variables in the model. Standard Error – This is an overall standard error of the model and is used in calculating the standard error of the coefficients in the model. The standard error is the square root of the MSE, which will be discussed in a later section. Observations – the number of observations in the model. The second part of the output is known as the ANOVA Table. ANOVA stands for ANalysis Of VAriance and it contains the breakdown of the Total Sum of Squares (TSS) into a part due to the regression model (RSS) and a part due to the residuals or error term (SSE). It provides an overall significance test for the model based on a sample of data. We will also have to deal with degrees of freedom. Regression Sum of Squares (RSS) - The sum of squares due to the fit of the model. The degrees of freedom for regression is equal to the number of independent variables in the model and is denoted by k. The Mean Square due to Regression in the next column is equal to the Sum of Squares divided by the degrees of freedom. Residual or Sum of Squares Error (SSE) - this is the part of the Total Sum of Squares that is unexplained by the model. The degrees of freedom for the SSE is equal the sample size (n) minus 1 minus the degrees of freedom for regression: n - 1 - k. The Mean Square Error (MSE) is equal to the SSE divided by its degrees of freedom. The MSE is the variance of the model. Total Sum of Squares (TSS) - This is the numerator of the Variance, the total squared deviations of each value from the mean. The degrees of freedom for TSS is n - 1.

ANOVA df Regression Residual Total

1 998 999

SS MS 206133280.82 206133280.82 716608914.92 718045.00 922742195.74

F 287.08

Signif F 0.00

Using Statistical Data to Make Decisions: Introduction to Regression

Page 12

F - The F-value is the ratio of two variances. In this case it is the ratio of The Mean Square due to Regression divided by the Mean Square Error. The F- distribution is a probability distribution with two degrees of freedom. The essence of the test is whether the ratio is nearly equal to one in a probability sense. A ratio of one (or close to one based on a sample) would imply that the model was a poor fit and there is no relationship of any of the independent variables with the dependent variable. Significance F - The significance level associated with the F-value is the level of alpha to reject a null hypothesis that the model is a poor fit (all the coefficients for the independent variables are equal to zero). In most cases we are looking for a significance level of less than .05 in order to conclude that the model has something to offer in terms of explaining the dependent variable. The last part of the Excel regression output gives the coefficients estimated by the regression model; a standard error for each coefficient so we can make an inference in a hypothesis test or confidence interval; a t-test based on a null hypothesis that the coefficient is really zero; and the significance level of this test. The Coefficients - the regression estimates an intercept term and a slope coefficient(s) for the model. Excel assumes these are based on a sample. For this model the intercept is 209.77 and the slope is 68.587. This means that if no catalogs are sent we expect sales of $209.77, and for each catalog shipped we expect additional sales of $68.59. The intercept may not always make sense by itself (it may be out of the range of the data). Standard Error - the standard error is the standard deviation of the sampling distribution for the regression coefficients. Just like with the mean, it is the spread of a repeated samples of size 1000 used to estimate the same model. Once again we will rely on the statisticians to calculate the standard error. All we need to do is to know how to use the standard error in a hypothesis test or a confidence interval. t-Statistic - the t-statistic is based on a null hypothesis that the coefficient is really zero. Particularly in the case of the slope coefficient (in this case for CATALOGS), a slope

Intercept Catalogs

Coeff 209.770 68.587

Std Error 65.195 4.048

t Stat P-value Lower 95% Upper 95% 3.218 0.001 81.835 337.705 16.943 0.000 60.644 76.531

Using Statistical Data to Make Decisions: Introduction to Regression

Page 13

of zero implies there is no linear relationship between CATALOGS and SALES - no matter how many catalogs are sent the level of SALES will not change. The t-statistic for the coefficient for CATALOGS is based on the following calculation: t = (68.587 - 0)/4.048 = 16.943 Zero is included in the calculation because this is the null value that Excel uses - we are testing to see if our coefficient is significantly different from zero. The denominator is the standard error of our estimate. The meaning of this result is that we have taken a random sample of 1,000 people and found a value for CATALOGS that is so different from zero (it is over 16 standard deviations away from zero) that we can safely conclude that there is a relationship between the number of catalogs sent and the volume of sales. P-value - The p-value shows the level of significance for the test. It is based on a two-tailed test and shows the probability of finding a sample with an estimated slope given from the model when the real value is zero. A low pvalue (usually less than .05) means we can have confidence in rejecting the null hypothesis and in concluding there is a relationship. Confidence Intervals - Excel also puts a confidence interval around our estimates. You can specify the level of the confidence interval, but the default is a 95% C.I. Excel will give the lower and upper bound of the interval. In our example, the confidence interval for the coefficient for CATALOGS (which is 68.587) is 60.644 to 76.531. We are implying the true population value is somewhere between these two values, based on a sampling distribution and repeated calculations of a 95% C.I. The confidence interval is based on the following formula: b ± t"/2, n-1 d.f. * Std error A Little More on R2. R Square (R2) ends to get a lot of attention in a regression model. It provides a summary measure of the linear fit of the model - it shows us how good is the fit. It is a PRE measure of association (Proportional Reduction in Error) and actually compares our model from regression to a simpler model based on only the mean level of the dependent variable as a predictor. In essence it says how much better our model is in predicting the dependent variable when compared to using the mean. R2 ranges from zero to one.

R Square provides a summary measure of the linear fit of the model - it shows us how good is the fit. It is a PRE measure of association.

Using Statistical Data to Make Decisions: Introduction to Regression

Page 14

There are several ways to calculate R2 from the ANOVA table. R2 = SSR/SST R2 = 1 – SSE/SST The popular interpretation of R2 is how much we explain of the dependent variable by knowing something about the independent variable(s). If our R2 is high we can feel confident that our model fits the data quite well. However, the fact that R2 is low does not mean that our model is not useful. For some types of data (such as time series) we will expect R2 to be quite high because data over time is highly correlated with itself. For other types of data, such as cross-sectional data of people, R2 will tend to be low because there is a lot of variability in the data. HYPOTHESIS TESTS FOR REGRESSION There are two main tests that we focus on in regression, the overall F-test for the model and individual tests for the coefficients in the model (t-tests). Both of these tests are based on the notion of a sample of data being used to estimate a model and parameters from a population. They both use knowledge of a sampling distribution (either the F-distribution or a t-distribution) to help make the inference. This section will briefly walk through each test in more detail. The F-test is a general test of the whole model. The F-test uses a null hypothesis none of the independent variables are significantly different from zero, implying nothing is going on in the model. It can be used for a multi-variate analysis with many independent variables or in the bivariate case with one independent variable. If there is only one independent variable, the F-Test = (t-test)2. As noted earlier, The F-value is the ratio of The Mean Square due to Regression divided by the Mean Square Error (se formula below). The F- distribution is a probability distribution with two degrees of freedom (from the numerator and denominator). The essence of the test is whether the ratio of the variances is nearly equal to one.

F=

MS Regression MS Residual

The popular interpretation of R2 is how much we explain of the dependent variable by knowing something about the independent variable(s). If our R2 is high (i.e., closer to 1.0) we can feel confident that our model fits the data quite well.

Using Statistical Data to Make Decisions: Introduction to Regression The hypothesis test for the F-test takes the following form. Null Hypothesis

H0: b1 = b2 = bk = 0

Alternative Hypothesis

Ha: at least one bi … 0

Test Statistic

F = MSRegression/MSResidual

We make a conclusion if the F-value from our test statistic is greater than the F-value from a distribution with degrees of freedom from the numerator and denominator of the test statistic at a specified level of alpha (the chance of being wrong in rejecting the null hypothesis). We usually use an alpha level of .05. In most cases we simply look at the p-value of the test given on the output to see if it is less than the alpha level we are willing to accept. If the p-value is sufficiently small (e.g., less than .05) we can conclude that at least one of the coefficients for the independent variables are different from zero - something is going on in the model that helps explain the dependent variable. The second test is individual test for the coefficients estimated in the model. In most cases we will not be interested in the intercept term. However, we will have strong interest in the slope coefficient(s). The heart of the matter is whether we have evidence of a linear relationship between an independent variable (X) and the dependent variable (Y). This will involve a test to see if the slope coefficient is significantly different from zero, because of slope of zero means no relationship - no matter how much X changes it will not influence a change in Y. The hypothesis test for a regression (or slope) coefficient takes the following form. Null Hypothesis

H0: bi = 0 (No Linear Relationship)

Alternative

Ha: bi … 0 (Linear Relationship)

Test Statistic

t = (bi - 0)/Std Error

If the test statistic is sufficiently large (e.g., more than 2 standard deviations away) or the p-value is sufficiently small (usually p < .05) we can reject the null hypothesis and conclude there is a linear relationship between the two variables. There are several key points to remember about this test. First, the hypothesized value need not always be zero. Although a test to see if the coefficient is different from zero makes a great deal of sense, we could test to see if our coefficient is different from any value. For example, we might want to see if it is different from 1.

Page 15

Using Statistical Data to Make Decisions: Introduction to Regression

Page 16

Second, the alternative hypothesis could be a one or twotailed test. Most computer output for regression will assume a two tailed test. If you had prior expectations of the direction of the relationship (e.g., it is positive) you should conduct a one-tailed test. In this case you can divide the p-value in half when making a conclusion on a one-tailed test. Finally, it is important to remember that a significance test using a p-value does not mean that the relationship is important or even useful in making decisions. It simply means that you found a relationship and it didn’t occur by chance. Only your business knowledge or prior experience can tell if a finding is important. Do not confuse statistical significance with substantive importance - they are not one in the same. DUMMY VARIABLE REGRESSION A special case in regression is when the independent variable is represented by one or more dummy variables. We noted earlier that a dummy variable takes on one of two values - a zero for the absence of an attribute and one if the attribute is present. For any categorical variable with j categories, we can represent it in regression with j-1 dummy variables. One category must be left out of the regression, but its value will be captured in the intercept term. This category is referred to as the reference category. Let’s look at an example to see how this works, how to test for dummy variables, and how to interpret the output. In the catalog data set the original Age variable had three categories - 1 if the customer was less than 31; 2 if the customer was between 31 and 55; and 3 if the customer was over 55 years old. We could think of this variable as at least ordinal, but we would be very unsure of the difference between the levels of age - is the movement from 1 to 2 (less than 31 to 31 to 55) the same as the movement from 2 to 3 (31 to 55 compared with 55 and older)? I think a better strategy is to convert this variable into two dummy variables: AGE1 = 1 if less than 31 years old, zero for all else; AGE2 = 1 if 31 to 55, zero for all else. The left out category is customers 55 and older and is often referred to as the Reference Category. The correlations between AGE1 and AGE2 with SALES are .348 and -.435,respectively. I estimated a regression equation with both dummy variables included. Since we are really representing one variable in the equation (with two dummy variables) I still think of this as a bi-variate regression.

Dummy variables represent categorical variables and can be used as independent variables in a regression.

Using Statistical Data to Make Decisions: Introduction to Regression

Page 17

Table 3. Regression of SALES on AGE1 and AGE2 Regression Statistics Multiple R 0.436 R Square 0.190 Adjusted R Square 0.188 Standard Error 865.985 Observations 1000 ANOVA df Regression Residual Total

Intercept AGE1 AGE2

SS MS 2 175061377.66 87530688.83 997 747680818.08 749930.61 999 922742195.74

Coeff 1432.13 -873.51 69.56

Std Error 60.48 79.19 71.66

The intercept term is 1,432.13 which we will interpret as the mean level of SALES for the reference category (people aged 55 and older). The coefficient for AGE1 (people under age 31) is -873.51 and the coefficient for AGE2 (people 31 to 55) is 69.56. The interpretation of these coefficients is the difference in sales of these age groups from the reference category. If we solve the regression equation for each group it will make these values clearer. Since we have dummy variables, the values for the slope coefficients are either zero or one depending on the age category.

est SALES = 1432.13 - 873.51(0) + 69.56(0) est SALES = 1432.13

Sig F 0.00

t Stat P-value Lower 95% Upper 95% 23.68 0.00 1313.44 1550.82 -11.03 0.00 -1028.91 -718.11 0.97 0.33 -71.06 210.17

Let’s look at the output briefly. The R2 for the model is .19, indicating about 19 percent of the variability in SALES is explained by knowing something about the age of the customer. The F-value is 116.72 and the p-value associated with it is .00 indicating that something is going on in the model that helps to explain SALES (we would expect one or both of the dummy variables to be significantly different from zero).

For those 55 and over, the equation is:

F 116.72

Solving the regression equation for dummy variables helps to emphasize how to interpret the coefficients.

Using Statistical Data to Make Decisions: Introduction to Regression

Page 18

For those under 31, the equation is: est SALES = 1432.13 - 873.51(1) + 69.56(0) est SALES = 558.62 For those 31 to 55, the equation is: est SALES = 1432.13 - 873.51(0) + 69.56(1) est SALES = 1501.69 You can see from the equations that the coefficients for AGE1 and AGE2 represent the difference in the mean level of sales from those aged 55 and over (the Reference Group) for the two age groups. The significance tests for the coefficients for AGE1 and AGE2 will give us a test if the mean level of sales of each group is significantly different from the reference group. The test statistic and p-value for AGE1 is -11.03 and .00, respectively, indicating we have strong evidence that younger customers (those under age 31) spend less than those over age 55. The test statistic and p-value for AGE2 tells a different story (.97 and .33, respectively). In this case we cannot conclude there is any difference in sales between the group aged 31 to 55 and the age group over 55. While the coefficient is positive and indicates a slightly higher level of sales for the 31 to 55 group, we cannot conclude that this coefficient is any different than zero. That is the meaning of our test. Dummy variable regression must always leave out one category in order to complete the regression, otherwise there will be a perfect linear relationship between the independent variables and Excel will not be able to estimate the coefficients. However, it is somewhat arbitrary which category we use as the reference category. I could have created AGE3 for those 55 and older and included AGE2 and AGE3 in the model. In this case the R2 and the F-statistic would be identical as the previous model. However, the intercept would be 558.62 (the mean level of sales for the youngest age group) and the other coefficients would reflect the difference in means compared to the youngest age group. Be careful in how you interpret the coefficients in a dummy variable regression - focus on the reference category.

You can see from the equations that the coefficients for AGE1 and AGE2 represent the difference in the mean level of sales from the Reference Group.

Dummy variable regression must always leave out one category in order to complete the regression, otherwise there will be a perfect linear relationship between the independent variables.

Using Statistical Data to Make Decisions: Introduction to Regression

Page 19

LINEAR TREND ANALYSIS A very basic approach to time series data (data that is measured over discrete time periods) is to conduct a linear trend analysis. In this approach a regression model is estimated using the dependent variable measured over time regressed against the time periods as the independent variable. The independent variable could be years, months, days, or simply a number sequenced from 1 to the number of observations. This approach provides a simple and direct way to model the trend of the data as a linear function and can be used to make predictions or forecasts into the future based on the past trends. Let’s look at an example to be clear on this approach. I have a small data set on U.S. manufacturing sales from 1970 to 1991. The dependent variable is SALES in billions of dollars and the independent variable is the year. The data are given in the table to the right.

YEAR 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991

SALES 52.805 55.906 63.027 72.931 84.79 86.589 98.797 113.201 126.905 143.936 154.391 168.129 163.351 172.547 190.682 194.538 194.657 206.326 223.541 232.724 239.459 235.142

I can easily graph the data using Excel to show the linear trend. Excel will also allow me to add a trendline and display the trendline equation and R Square for this model. As noted earlier, the options for Trendline in Excel is simply a regression model based on the Y and X variables specified in the Chart options. This graph is given below (see Figure 5). We can see that the line shows a strong linear trend and R Square is close to 1.0 (R2 = .988). However, we can also see that there are some slight curves to the data and we will explore a nonlinear approach to fitting this data in the next module.

Regression can be used to estimate a linear trend to data over time.

In a linear trend analysis, the independent variable is the time period, given as the year, month, day, or simply a linear sequence of 1 to the number of observations.

Using Statistical Data to Make Decisions: Introduction to Regression

U.S. Manufacturing Sales, 1970 to 1991

Billions $

$300 $250 $200

y = 9.5616x - 18788 2 R = 0.988

$150 $100 $50 $0 1968

1973

1978

1983

1988

1993

Year

Figure 5. Linear Trend Example Using U.S. Manufacturing Sales Data, 1970 to 1991

I have also included the regression output for this model to show the details of the ANOVA Table and the hypothesis tests (see Table 4). In terms of the analysis, SALES is the dependent variable and YEAR is the independent variables. The model fits the data quite well based on an R Square of .988. The F-value for the model is very large - 1,648.46 - indicating that the trend is significant (p < .01). The coefficient for YEAR is 9.56 indicating that sales increased by $9.56 billion each year. This coefficient is clearly significant as indicated by a t-value of 40.60 and a p-value < .01 for a Null Hypothesis test that the coefficient is really zero.

Table 4. Trendline Regression Output Regression Statistics 0.994 Multiple R 0.988 R Square 0.987 Adjusted R Square 7.008 Standard Error 22 Observations ANOVA Regression Residual Total

Intercept YEAR

df 1 20 21

SS 80955.50 982.19 81937.69

Coef. Std Error -18787.84 466.41 9.56 0.24

MS 80955.50 49.11

F 1648.46

t Stat P-value -40.28 0.00 40.60 0.00

Sig. F 0.00

Page 20

Using Statistical Data to Make Decisions: Introduction to Regression While care should be taken to expand this trend too far into the future, forecasts could be made for future sales based on the model by solving the equation for a particular year. We would simply plug the year into the equation and solve for our forecast. The forecast for 1995, assuming the trend would continue, is: Estimated Sales = -18,787.84 + 9.56(1995) Estimated Sales = $287.478

CONCLUSIONS This module provided a brief introduction to regression. We looked at the basic regression model that fits a linear function of the relationship between a dependent variable and an independent variable. We showed how the regression line is estimated, the meaning of the coefficients and how we can solve for predicted values, and even how we make inferences from a sample to a population. Excel can estimate a regression equation and we looked at and interpreted the output from a regression using Excel. Finally, we looked at two unique examples of bi-variate regression - dummy variable regression and a linear trend analysis. In dummy variable in regression we entered a variable that denoted membership isn a group which was coded as a zero or one. We noted that this approach gives predicted mean levels of the dependent variable for each group. We also used regression in a trend analysis using the time period as the dependent variable tpo produce a simple trend analysis that can be used to make a forecast into the future. The next module will provide details on the regression model including the use and interpretation of a multivariate model - when there is more than one independent variable. The multi-variate model is a very powerful tool because it provides us with a means of assessing the effects of independent variables while controlling for other variables.

Page 21