Multiple Regression Analysis

Multiple Regression Analysis y = β0 + β1x1 + β2x2 + . . . βkxk + u 1. Estimation Introduction ! Main drawback of simple regression is that with 1 ! ...
Author: Horace Mason
Multiple Regression Analysis y = β0 + β1x1 + β2x2 + . . . βkxk + u 1. Estimation

Introduction ! Main drawback of simple regression is that with 1 ! ! ! !

RHS variable it is unlikely that u is uncorrelated with x Multiple regression allows us to control for those “other” factors The more variables we have the more of y we will be able to explain (better predictions) e.g. Earn=β0+ β1Educ+ β2Exper+u Allows us to measure the effect of education on earnings holding experience fixed 1

Parallels with Simple Regression β0 is still the intercept β1 to βk all called slope parameters u is still the error term (or disturbance) Still need to make a zero conditional mean assumption, so now assume that ! E(u|x1,x2, …,xk) = 0 ! other factors affecting y are not related on average to x1,x2, …, xk ! Still minimizing the sum of squared residuals

! ! ! !

2

Estimating Multiple Regression ! The fitted equation can be written: yˆ = !ˆ 0 + !ˆ1 x1 + !ˆ 2 x2 + ... + !ˆ k xk ! And we want to minimize the sum of squared residuals:

#(

yi ! "ˆ 0 ! "ˆ1 xi1 ! "ˆ 2 xi 2 ! ... ! "ˆ k xik

)

2

! We will now have k+1 first order conditions ˆ ! "ˆ x ! "ˆ x ! ... ! "ˆ x = 0 y ! " # i 0 1 i1 2 i 2 k ik

(

)

( ˆ x y ! " # (

) ! ... ! "ˆ x ) = 0

ˆ ! "ˆ x ! "ˆ x ! ... ! "ˆ x = 0 x y ! " # i1 i 0 1 i1 2 i 2 k ik i2

Etc.

i

ˆ x ! "ˆ x ! " 0 1 i1 2 i2

k ik

3

Interpreting the Coefficients yˆ = !ˆ 0 + !ˆ1 x1 + !ˆ 2 x2 + ... + !ˆ k xk ! We can obtain the predicted change in y given changes in the x variables

ˆ ˆ ˆ ˆ !y = "1!x1 + " 2 !x2 + ... + " k !xk ! The intercept drops out because it’s a constant so its change is zero. If we hold x2,…,xk fixed, then

!yˆ = "ˆ1!x1 or !yˆ = "ˆ1 !x1

4

Interpreting the Coefficients ! Note that this is the same interpretation we gave !ˆ1 in the univariate regression model.

It tells us how much a 1 unit change in x changes y.   In this more general case, it tells us the effect of x1 on y, holding the other x’s constant.   We can call this a ceteris paribus effect. 

! Of course !ˆ2 tells us the ceteris paribus

effect of a 1 unit change in x2, and so forth.

5

iClickers Question: What day of the week is today? A) Monday B) Tuesday C) Wednesday D) Thursday E) Friday Question: Is next week Spring Break? A) Yes B) No 6

iClickers Press any letter on your iClicker for showing up on a rainy Friday after Midterm 1 and before Spring Break.

7

Ceteris Paribus interpretation ! Imagine we rerun our education/earnings regression, but include lots of controls on the RHS (x2, x3,….,xk) If we control for everything that affects earnings and is correlated with education, then we can truly claim that “holding everything else constant, !ˆ1 gives an estimate of the effect of an extra year of education on earnings.”   Because if everything else is controlled for, the zero conditional mean assumption holds. 8 

Ceteris Paribus interpretation ! This is a very powerful feature of multiple regression Suppose only 2 things determine earnings, education and experience.   In a laboratory setting, you’d want to pick a bunch of people with the same experience (perhaps average experience) and then vary education across them, and look for differences in earnings that resulted.   This is obviously impractical/unethical 9 

Ceteris Paribus interpretation ! If you couldn’t conduct an experiment, you could go out and select only people with average experience and then ask them how much education they have and how much they earn   

Then run univariate regression of earnings on education This is inconvenient to have to pick your sample so carefuly

! Multiple regression allows for you to control (almost as if you’re in a lab--more on that qualifier later in the course) for differences in individuals along dimensions other than their level of education. 

Even though we haven’t picked a sample holding experience constant, we can interpret the coefficient on education as if we held experience constant in our sampling. 10

A “Partialling Out” Interpretation ! Suppose that x2 affects y and that x1 and x2 are

correlated. This means the zero conditional mean assumption is violated if we simply estimate a univariate regression of y on x1. (convince yourself of this) 



The usual way to deal with this is to “control for x2” by including x2 on the right-hand side (multiple regression). In practice, this is how we’ll usually deal with the problem. In principle, there’s another way we could do this. If we could come up with a measure of x1 that is purged of its relationship with x2, then we could estimate a univariate regression of y on this “purged” version of x1. x2 would still be contained in the error term, but it would no longer be correlated with x1, so the zero 11 conditional mean assumption would not be violated.

A “Partialling Out” Interpretation ! Consider the case where k=2, so if we do multiple regression we estimate

yi = ! 0 + !1 xi1 + ! 2 xi 2 + ui ! The estimate of coefficient, β1, is:

2 ˆ !1 = # (xi1 " x1 ) yi # ( xi1 " x1 ) ! Another way to express this same estimate is to estimate (by univariate regression)

yi = ! 0 + !1rˆi1 + " i ! The estimate of coefficient, β1, is: !ˆ1 =

(

)

where the rˆi1 are residuals from the estimated regression xi1 = ! 0 + ! 2 xi2 + ri xˆi1 = !ˆ 0 + !ˆ 2 xi 2 rˆi1 = xi1 ! xˆi1 12

# (rˆi1 " rˆ1 )yi

2 ˆ ˆ ( r " r ) # i1 1

“Partialling Out” continued ! This implies that we estimate the same effect of x1 either

by 1. Regressing y on x1 and x2 (multiple regression) or by 2. Regressing y on residuals from a regression of x1 on x2 (univariate regression) ! These residuals are what is left in the variation of x1 after x2 has been partialled out. They contain the variation of x1 that is independent of x2 ! Therefore, only the part of x1 that is uncorrelated with x2 is being related to y ! By regressing y on the residuals of the regression of x1 on x2 we are estimating the effect of x1 on y after the effect of x2 on x1 has been “netted out” or “partialled out.” 13

Simple vs Multiple Reg Estimate

~ ~ ~ • Compare the simple regression y = ! 0 + !1 x1 with the multiple regression yˆ = !ˆ 0 + !ˆ1 x1 + !ˆ 2 x2

~ Generally, !1 " !ˆ1 unless : 1. !ˆ 2 = 0 (i.e. no partial effect of x2 on y) -the first order conditions will be the same in either case if this is true or 2. x1 and x2 are uncorrelated in the sample. -if this is the case then regressing x1 on x2 results in no partialling out (residuals give the complete variation in x1). In the general case (with k regressors), we need x1 uncorrelated with all other x’s. 14

Goodness-of-Fit ! As in the univariate case, we can think of each observation being made up of an explained part and an unexplained part yi = yˆ i + uˆi

! We can then define the following: 2

" ( y ! y ) is the total sum of squares (SST) " ( yˆ ! y ) is the explained sum of squares (SSE) " uˆ is the residual sum of squares (SSR) i

2

i

2 i

! Then, SST=SSE+SSR 15

Goodness-of-Fit (continued) ! We can use this to think about how well our

sample regression line fits our sample data ! The fraction of the total sum of squares (SST) that is explained by the model is called the R-squared of regression ! R2 = SSE/SST = 1 – SSR/SST ! Can also think of R-squared as the correlation coefficient between the actual and fitted values of 2 y (yi ! y )( yˆi ! yˆ ) " 2 R = 2 2 ˆ ˆ (y ! y ) ( y ! y ) " i " i

(

(

)(

)

)

16

More about R-squared ! R2 can never decrease when another independent variable is added to a regression, and usually will increase ! When you add another variable to the RHS, SSE will either not change (if that variable explains none of the variation in y), or it will increase (if that variable explains some of the variation in y) ! SST will not change ! So while it’s tempting to interpret an increase in the Rsquared after adding a variable to the RHS as indicative that the new variable is a useful addition to the model, one shouldn’t read too much into it. 17

Assumptions for Unbiasedness ! The assumptions needed to get unbiased estimates

in simple regression can be restated 1. Population model is linear in parameters: y = β0 + β1x1 + β2x2 +…+ βkxk + u 2. We have a random sample of size n from the population model, {(xi1, xi2,…, xik, yi): i=1, 2, .., n} - i.e. no sample selection bias

Then the sample model is yi = β0 + β1xi1 + β2xi2 +…+ βkxik + ui 18

Assumptions for Unbiasedness 3. None of the x’s is constant, and there are no exact linear relationships among the x’s *We say “perfect collinearity” exists when one of the x’s is an exact linear combination of other x’s *It’s OK for x’s to be correlated with each other, but perfect correlation makes estimation impossible; high correlation can be problematic. 4. E(u|x1, x2,… xk) = 0 *all of the explanatory variables are “exogenous” -We call an x variable “endogenous” if it is correlated with the error term. -Need to respecify model if at least one x 19 variable is endogenous.

Examples of Perfect Collinearity 1. x1=2(x2) - One of these is redundant *x2 perfectly predicts x1. So we should throw out either x1 or x2. *If an x is constant (e.g. x=c) , then it will be perfect linear function of β0. c=aβ0 for constant a. 2. The linear combinations can be more complicated *Suppose you had data on people’s age and their education, but not on their experience. If you wanted to measure experience, you could assume they started work right out of school and calculate experience as Exp=Age-(Educ+5), where the 5 accounts for the 5 years before we start school as kids. 20

Examples of Perfect Collinearity *But then these three variables would be perfectly collinear (each a linear combination of the others) *in this case, only 2 of those variables could be included on the RHS. 3. Including income and income2 does not violate this assumption (not an exact linear function)

21

Unbiasedness ! If assumptions 1-4 hold, it can be shown that: E !ˆ j = ! j , j = {0, 1, ..., k}

( )

i.e. that the estimated parameters are unbiased

! What happens if we include variables in our

specification that don’t belong (over-specify)? 



Called “overspecification” or “inclusion of irrelevant regressors.” Not a problem for unbiasedness.

! What if we exclude a variable from our specification that does belong? 

This violates assumption 4. In general, OLS will be biased; called “omitted variables problem”

22

Omitted Variable Bias ! This is one of the greatest problems in regression analysis 



If you omit a variable that affects y and is correlated with x, then the coefficient that you estimate on x will be partly picking up the effect of the omitted variable, and hence will misrepresent the true effect of x on y. This is why we need the zero conditional mean assumption to hold; and to make sure it holds, we need to include all relevant x variables.

23

iClickers ! Question: Suppose the true model of y is given by

y = ! 0 + !1 x1 + ! 2 x2 + u

!

but you don’t have data on x2 so you estimate y˜ = "˜ 0 + "˜1 x1 + v instead. Assume that x1 and x2 are correlated, but not perfectly. This is a problem because: A) x1 and x2 are perfectly collinear. B) u and v are different from each other. C) v contains x2. D) None of the above. 24

Omitted Variable Bias ! Suppose the true model is given by

y = ! 0 + !1 x1 + ! 2 x2 + u

! But we estimate y! = !! 0 + !!1 x1 , then

!!1 =

#( x #( x

i1 i1

" x1 ) yi " x1 )

2

! Recall the true model, so that

yi = ! 0 + !1 xi1 + ! 2 xi 2 + ui 25

Omitted Variable Bias (cont) ! So the numerator of !!1 becomes

" ( x ! x )( # + # x + # x + u ) = # " ( x ! x ) + # " ( x ! x )x + # " ( x + " ( x ! x )u recall: " ( x ! x )=0, " ( x ! x )x = " ( x i1

0

i1

1

0

1 i1

i1

1

1

1

i1

2 i2

i1

i

1

i1

2

i1

! x1 )xi 2

i

1

i1

1

i1

i1

! x1 )

2

! Now the numerator becomes

"1 \$ (x i1 # x1 ) 2 +" 2 \$ (x i1 # x1 )x i2 + \$ (x i1 # x1 )ui 26

Omitted Variable Bias (cont) ! Returning to our formula for !!1

#( x " x ) y #( x " x ) ! # (x " x ) + ! # ( x " x )x + # ( x = #( x " x ) ! # ( x " x )x ( x " x )u # =! + + #( x " x ) #( x " x ) !!1 =

i1

1

i

2

i1

1

2

1

i1

2

i1

1

i2

i1

" x1 )ui

2

i1

2

i1

1

1

i2

1

i1

1

i1

1

2

i1

1

i 2

27

Omitted Variable Bias (cont) x " x x x " x u ( ) ( ) # # so, !! = ! + ! + #( x " x ) #( x " x ) i1

1

1

1

i2 2

2

i1

i1

1

i

2

1

i1

1

! Taking expectations (conditional on x’s) and recalling that E(ui)=0, we get

E(!!1 ) = !1 + ! 2

# (x " x )x # (x " x ) i1

i2 2

.

i1

! In other words, our estimate is (generally) biased. :(

28

Omitted Variable Bias (cont) ! How might we interpret the bias? ! Consider the regression of x2 on x1:

x! 2 = !!0 + !!1 x1

then !!1

(x " x )x # = #( x " x ) i1

i1

1

i2 2

1

The omitted variable bias equals zero if: 1. β2=0: (e.g. if x2 doesn’t belong in the model) 2. !!1 = 0 (e.g. if x1 and x2 don’t covary)

!

29

Summary of Direction of Bias ! The direction of the bias, (" 2 * #˜1) , depends on: 1.  The sign of "˜1 -positive if x1 x2 positively correlated -negative if x1 x2!negatively correlated ! of β2 2. The sign Summary Table Corr(x1, x2) > 0

Corr(x1, x2) < 0

β2 > 0

Positive bias

Negative bias

β2 < 0

Negative bias

Positive bias 30

iClickers Question: Suppose the true model of y is given by y = ! 0 + !1 x1 + ! 2 x2 + u but you estimate y˜ = "˜ 0 + "˜1 x1 + v instead. Suppose that x1 and x2 are negatively correlated, and that x2 has a negative effect on y. What will be the sign of the bias of "˜1 ? A)! Positive. B) Negative. ! be zero. C) The bias will D) None of the above.

!

iClickers Suppose that 2 things, education and ability, determine earnings. Assume that both matter positively for earnings, and that ability and education positively covary. Question: If you estimate a model of earnings that excludes ability, the coefficient estimate on education will be A)  positively biased. B)  negatively biased. C)  unbiased. D)  none of the above.

!

32

Omitted Variable Bias Summary ! The size of the bias is also important ! Typically we don’t know the magnitude of β2 or "˜1 but we can usually make an educated guess whether these are positive or negative !

What about the more general case (k+1 variables)? ! Technically we can only sign the bias of the more general case if all the included x’s are uncorrelated

33

OVB Example 1 ! Suppose that the true model of the wage is wage = " 0 + "1educ + " 2experience + u but you estimate wage = " 0 + "1educ + v. Which way !

!

will the estimated coefficient on educ be biased? 1) "˜1 has same sign as corr(educ, exper)0 ! (because people with more experience tend to be more productive) Therefore bias= (" 2 * #˜1) 0 * β2>0 ! (because people with more ability tend to be more productive) * bias= (" 2 * #˜1) > 0; the estimate is positively biased In other words, if you omit ability when it should be included, the estimated coefficient on educ will be ! positively biased (too big on average) 35

The More General Case ! For example, suppose that the true model is:

y = ! 0 + !1 x1 + ! 2 x2 + ! 3 x3 + u and we estimate: y! = !! 0 + !!1 x1 + !! 2 x2 ! Suppose x1 and x3 are correlated. In general, both !!1 and !! 2 will be biased. !! 2 will only be unbiased

if x1 and x2 are uncorrelated.

36

Variance of the OLS Estimators ! We now know (if assumptions 1-4 hold) that the

sampling distribution of our estimate is centered around the true parameter ! Want to think about how spread out this distribution is ! Much easier to think about this variance under an additional assumption, so Assumption 5: ! Assume Var(u|x1, x2,…, xk) = σ2 (Homoskedasticity) ! This implies that Var(y|x1, x2,…, xk) = σ2 37

Variance of OLS (cont) ! The 4 assumptions for unbiasedness, plus this homoskedasticity assumption are known as the Gauss-Markov assumptions ! These can be used to show:

( )

Var !ˆ j =

2

" 2 2 \$ (xij # x j ) 1 # R j

(

)

! R 2j refers to the R-squared obtained from a

regression of xj on all other independent variables (including an intercept)

!

38

Important to understand what R is 2 j

It’s not the R-squared for the regression of y on the x’s (the usual R-squared we think of)

!   



  

!

It’s an R-squared, but for a different regression Suppose we’re calculating variance of the estimator ! of x1; then R12is the R-squared from a regression of x1 on x2, x3, ..., xk If we’re calculating variance of the estimator of x2; then R22 is the R-squared from a regression of x2 on x1, x3, ..., xk ! etc. Whichever coefficient you’re calculating variance for, you just regress that variable on all other RHS variables (this gives a measurement of how closely that variable is related—linearly—to other RHS vars) 39

Components of OLS Variances Thus, how precisely we can estimate the coefficients (variance) depends on: 1.  The error variance (σ2):

!

-A larger σ2 implies higher variance of the slope coefficient estimators -The more “noise” (unexplained variation) in the equation, the harder it is to get precise estimates. -Just like univariate case

2.  The total variation in xj -Larger SSTj gives us more variation of xj from which to discern an impact on y; lowers variance -Just like univariate case 40

Components of OLS Variances 3. The linear relationships among the independent variables (Rj2). Note that we didn’t need this term in the univariate case: *A larger Rj2 implies larger variance for estimator *The more correlated the RHS variables, the higher the variance of the estimators (the more imprecise the estimates) ! The problem of highly correlated x variables is called “Multi-Collinearity”   Not clear what the solution is. Could drop some of the correlated x variables, but then you may end up with omitted variables bias.   Can increase your sample size, which will boost the SST, and maybe let you shrink the variance of your estimators that way. 41

iClickers ! Recall the expression for variance of the slope OLS estimators in multivariate regression:

( )

Var !ˆ j

"2 = 2 2 (x # x ) 1 # R \$ ij j j

(

)

! Question: Suppose you estimate y i = " 0 + "1 x1i + " 2 x 2i + " 3 x 3i + ui

!

!

and want an expression for the variance of "ˆ 3 . You would obtain the needed R32 by estimating which of the following equations? A) y i = " 0 + "1 x1i + " 2 x 2i + " 3 x 3i + ui ! B) y i = " 0 + " 3 x 3i + v i ! C) x 3i = " 0 + " 1 x1i + " 2 x 2i + w i D) None of the above. 42

Example of Multi-Collinearity 2 (high R j ) ! Suppose that you want to try to explain differences

!

in mortality rates across different states ! Might use variation in medical expenditures across the states (e.g. expenditures on hospitals, medical equipment, doctors, nurses, etc.) ! Problem? 



Expenditures on hospitals will tend to be highly correlated with expenditures on doctors, etc. So estimates of impact of hospital spending on mortality rates will tend to be imprecise. 43

Variances in misspecified models ! Consider again the misspecified model (x2 omitted): y! = !! 0 + !!1 x1 ! We showed that the estimate of β1 is biased if

(" 2 * #˜1 ) does not equal zero and that including x2

!

in this model does not bias the estimate even if β2=0; this suggests that adding x variables can only reduce bias ! So, Q: why not always throw in every possible RHS variable, just to be safe? A: Multicollinearity ! We know that in the misspecified model 2 2 ! =0 if no other variables R " 1 ! Var(!1 ) = 2 on RHS; adding vars will R 44 1 SST 1

Misspecified Models (cont) Thus, Var !!1 < Var !ˆ1 (holding σ2 constant) ! Unless x1 and x2 are uncorrelated (Rj2=0); then the variances are the same (holding σ2 constant) ! Assuming that x1 and x2 are correlated: 1)  If β2=0: !!1 and !ˆ1 will both be unbiased and Var !! < Var !ˆ

( )

( )

( ) 1

( ) 1

2) If β2 not equal to 0: Then !!1 will be biased, and !ˆ1 will be unbiased, and variance of the misspecified estimator could be greater or less than the variance of the correctly specified estimator, depending on what happens to σ2 45

Misspecified Models (cont) ! So, if x2 doesn’t affect y, (i.e. β2=0), it’s best to

leave x2 out, because we’ll get a more precise estimate of the effect of x1 on y. ! Note that if x2 is uncorrelated with x1, but β2≠0 then including x2 will actually lower the variance of the estimated coeff. on x1, because its inclusion lowers σ2 ! Generally, we need to include variables whose omission would cause bias 

But we often don’t want to include other variables, if their correlation with other RHS variables will reduce precision of the estimates. 46

Estimating the Error Variance ! We don’t know what the error variance, σ2, is, because we don’t observe the errors, ui ! What we observe are the residuals, ûi ! We can use the residuals to form an estimate of the error variance as we did for simple regression

!ˆ 2 =

(

2 ˆ u " i

)

( n # k # 1)

= SSR df

! df=# of observations - # of estimated parameters ! df=n-(k+1) 47

Error Variance Estimate (cont) ! Thus, an estimate of the standard error of the jth coefficient is:

( )

se !ˆ j =

1/2

n

+% 2( 2 . -' \$ (xij # x j ) * (1 # R j ) 0 ) ,& i =1 /

(

2 j

)

12

= "ˆ +, SST j 1 # R ./

48

The Gauss-Markov Theorem ! Allows us to say something about the efficiency of OLS

-why we should use OLS instead of some other estimator

! Given our 5 Gauss-Markov Assumptions it can be shown that the OLS estimates are “BLUE” Best:

-Smallest variance among the class of linear unbiased estimators

Var("ˆ1 ) < Var("˙˙1 ) where "˙˙1 denotes any other unbiased estimator of β1. 49

!

The Gauss-Markov Theorem Linear: -Can be expressed as a linear function of y n

"ˆ j = # w ij y i i=1

where w is some function of the sample x’s.

Unbiased:

! !

E("ˆ j ) = " j , for all j Estimator: -provides an estimate of the underlying parameters *When assumptions hold, use OLS *When assumptions fail, OLS is not BLUE: a better 50 estimator may exist