2008 American Economic Association Summer Program. Lecture Notes

2008 American Economic Association Summer Program Department of Economics University of California Santa Barbara Brian Duncan ECON 194SE: Econometric...
Author: Elvin Parks
2 downloads 0 Views 391KB Size
2008 American Economic Association Summer Program Department of Economics University of California Santa Barbara

Brian Duncan ECON 194SE: Econometrics (Session 2)

Summer 2010 MWF 9:30 – 11:00 (NH 2212)

Lecture Notes Lecture #1: I. II. III.

I.

IV.

The Gauss-Markov Theorem. OLS coefficients are linear estimators. Properties of OLS weights, k i . The expected value of βˆ .

V.

Variance of βˆ .

VI.

OLS estimate of the variance of βˆ .

VII.

Omitted variable bias.

The Gauss-Markov Theorem: Given the Classical Linear Regression Model (CLRM) assumptions, the ordinary least-squares (OLS) estimates are the Best Linear Unbiased Estimates (BLUE).

We have already proven each part of the Gauss-Markov Theorem. The assumptions we used to prove that the OLS estimates are linear, unbiased and best are: (i )

Y = X β + ε , where β is a constant and ε is a random

( nx1)

( nxm ) ( mx1)

( nx1)

(ii) (iii) (iv)

error. (X ' X) −1 exists (i.e., it can be calculated). X is non-stochastic (not a random variable). E (ε i ) = 0 ∀i .

(v ) (vi)

E (ε i2 ) = σ 2 ∀i . E (ε i ε j ) = 0 ∀i ≠ j .

Another way to write conditions (iv), (v) and (vi) is: ε ~ iid (0, σ 2 I) The six assumptions above are called the classical linear regression model (CLRM) assumptions. They are also known as the Gauss-Markov assumptions.

1

II.

OLS coefficients are linear estimators.

Recall the OLS deviations from mean formula: n

βˆ =

∑ ( X i − X )(Yi − Y ) i =1

n

∑(Xi − X )

. 2

i =1

∑ (( X i − X )Yi − ( X i − X )Y ) n

βˆ =

i =1

.

n

∑ ( X i − X )2 i =1

βˆ =

∑ ( X i − X )Yi − Y ∑ ( X i − X ) . ∑ ( X i − X )2

βˆ =

∑ ( X i − X )Yi − Y (∑ X i − nX ) , Note: ∑ X − nX i ∑ (X i − X )2

βˆ =

∑ ( X i − X )Yi ∑ ( X i − X )2

= nX − nX = 0 .

.

βˆ = ∑ k i Yi , where k i =

Xi − X . 2 ∑(Xi − X )

Each k i is non-stochastic, because it is a function of X i (which we assume to be non-stochastic). βˆ is a linear estimator because it is a linear function of Y , with i

k i serving as weights. Therefore, βˆ is a weighted average of Yi . In fact, the formula for βˆ is similar to the formula for Y . Both βˆ and Y are weighted averages of Yi . In the formula for Y , the weights are k i = 1 / n ∀i . A.

In matrix form: βˆ = (X ' X) −1 X ' Y . (X ' X) −1 X ' is analogous to k i . The vector βˆ is a linear function of the

vector Y with (X ' X) −1 X ' serving as the weights.

2

III.

Properties of OLS weights, k i .

Our ultimate goal is to derive the properties of the OLS estimator (i.e., its expected value, variance, etc…). Writing the OLS estimate as:

βˆ = ∑ k i Yi , where k i =

Xi − X is the OLS weight, 2 ∑(Xi − X )

will make it easier for us to reach our goal. The next step is to establish some properties of k i that we will use later. 1.

2.

k i is a function of X i . Anything we assume about X i will be true of k i . In particular, we will assume that X i is non-stochastic (not a random variable) and, therefore, k i is non-stochastic. ∑ ki = 0 . ⎛ Xi − X ⎞ 1 ⎟= Proof: ∑ k i = ∑ ⎜⎜ (X i − X ) = 0 . 2 ⎟ 2 ∑ ⎝ ∑ (X i − X ) ⎠ ∑ (X i − X )

Note: (1): ∑ ( X i − X ) 2 is a constant. (2): ∑ ( X i − X ) = nX − nX = 0 . 3.

2 ∑ ki =

1 . 2 ∑(Xi − X ) 2

2 ⎛ Xi − X ⎞ 1 ∑ (X i − X ) ⎟ . Proof: ∑ k = ∑ ⎜⎜ = = 2 2 2 ⎟ 2 ∑ (X i − X ) ⎝ ∑ (X i − X ) ⎠ ∑ (X i − X ) 2 i

4.

(

)

∑ ki ( X i − X ) = ∑ ki X i = 1 . Proof: (1):

(2):

∑ ki ( X i − X ) = ∑ ki X i − X ∑ ki = ∑ ki X i . ⎛ (X i − X )2 ⎞ ∑ ( X i − X )2 ⎟= = 1. ∑ k i ( X i − X ) = ∑ ⎜⎜ 2 ⎟ 2 ⎝ ∑ (X i − X ) ⎠ ∑(X i − X )

3

IV.

The expected value of βˆ . Yi = α + βX i + ε i .

βˆ = ∑ k i y i = ∑ k i (α + βX i + ε i ) . βˆ = α ∑ k i + β ∑ k i X i + ∑ k i ε i .

(Note:

∑ ki X i

= 1)

βˆ = β + ∑ k i ε i . Expected Value:

() ()

E βˆ = β + E (∑ k i ε i ) . E βˆ = β + ∑ k i E (ε i ) .

()

E βˆ = β .

A.

(i) β is a constant (non-stochastic). (ii) X i is non-stochastic (and so is k i ). (iii) E (ε i ) = 0 . βˆ is an unbiased estimate of β .

Expected value of βˆ in matrix form. βˆ = (X ' X) −1 (X ' Y) . βˆ = (X ' X) −1 X ' (Xβ + ε) .

βˆ = β + (X ' X) −1 X ' ε . E(βˆ ) = β + (X ' X) −1 X ' E(ε) = β .

E(βˆ ) = β . V.

(i) β is a constant (non-stochastic). (ii) X is non-stochastic. (iii) E (ε ) = 0 .

βˆ is an unbiased estimate of β .

Variance of βˆ .

[

]

2 var(βˆ ) = E βˆ − E ( βˆ ) = E ( βˆ − β ) 2

var( βˆ ) = E (∑ k i ε i ) 2 , because βˆ = β + ∑ k i ε i from above.

(

var( βˆ ) = E (k1ε 1 ) 2 + (k 2 ε 2 ) 2 + L + (k n ε n ) 2 + 2k1 k 2 ε 1ε 2 + L + 2k n k n −1ε n ε n −1

4

)

Let us assume: ε i ~ iid (0, σ 2 ) . (Which means: (iv) E (ε i2 ) = σ 2 , and (v) E (ε i ε j ) = 0 ∀i ≠ j .) var( βˆ ) = ∑ k i2 E (ε i2 ) = σ 2 ∑ k i2

var(βˆ ) =

σ2

∑(X i − X )

2

, from the 3rd property of k i .

“ σ 2 ” is the variance of the disturbance term. “ ∑ ( X i − X ) 2 ” is the variation of X. The var(βˆ ) is directly proportional to the variance of the error term, and inversely proportional to the variation of X.

The variance of βˆ in matrix form.

A.

var(βˆ ) = E[(βˆ − β)(βˆ − β) ' ] . From above:

βˆ = β + (X ' X) −1 X ' ε . βˆ − β = (X ' X) −1 X ' ε .

var(βˆ ) = E[( Aε)( Aε) ' ] , where A = (X ' X) −1 X ' . var(βˆ ) = E[Aεε ' A ' ] . var(βˆ ) = AE (εε ' ) A ' . Note that E (εε ' ) = σ 2 I (see below). var(βˆ ) = Aσ 2 IA ' = σ 2 AA ' .

Where: AA ' = (X ' X) −1 X ' X(X' X) −1 , because (A −1 ) = (A ' ) . AA ' = (X ' X) −1 '

var(βˆ ) = σ 2 (X ' X) −1 . Variance-covariance matrix:

εε '

( nx1)(1 xn )

⎡ ε 12 ε 1ε 2 ⎡ε1 ⎤ ⎢ ⎢ε ⎥ ε 2 ε 1 ε 22 2⎥ ⎢ ⎢ [ε ε 2 L ε n ] = ⎢ = ⎢M⎥ 1 M M ⎢ ⎢ ⎥ ⎣ε n ⎦ ⎣⎢ε n ε 1 ε n ε 2

5

L ε 1ε n ⎤ ⎥ L ε 2ε n ⎥ O M ⎥ ⎥ L ε n2 ⎦⎥

−1

cov(ε 1ε 2 ) L E (ε 1ε n ) ⎤ ⎡ var(ε 1 ) ⎥ ⎢ L E (ε 2 ε n )⎥ ⎢cov(ε 2 ε 1 ) var(ε 2 ) = M M O M ⎥ ⎢ ⎥ ⎢ 2 L E (ε n ) ⎦⎥ ⎣cov(ε n ε 1 ) cov(ε n ε 2 )

⎡ E (ε 12 ) E (ε 1ε 2 ) ⎢ E (ε 2 ε 1 ) E (ε 22 ) ' ⎢ E (εε ) = ⎢ M M ⎢ ⎣⎢ E (ε n ε 1 ) E (ε n ε 2 )

If we assume that ε i ~ iid (0, σ 2 ) , then: (iv) E (ε i2 ) = σ 2 (v) E (ε i ε j ) = 0 ∀i ≠ j . Therefore, E (εε ' ) = σ 2 I

VI.

OLS estimate of the variance of βˆ . The OLS estimate of σ 2 is:

σˆ 2 =

1 2 ∑ εˆi . n−2

(Remember that the mean of ε is zero).

Therefore, the OLS estimate of the variance of βˆ is:

vaˆr( βˆ ) =

σˆ 2

∑ (X i − X )2

In matrix form:

σˆ 2 =

1 εˆ ' εˆ , and so, vaˆ r(βˆ ) = σˆ 2 (X ' X) −1 , n−m

where k represents the number of X variables, including the constant. Why divide by (n – m)? (Answer: to produce an unbiased estimate.)

6

L cov(ε 1ε n ) ⎤ L cov(ε 2 ε n )⎥⎥ ⎥ O M ⎥ L var(ε n ) ⎦

VII.

Omitted variable bias.

Consider the regression: y i = α + β 1 x1i + β 2 x 2i + ε i .

(1)

Suppose that (1) is the true model, but that we either do not have the x 2 variable, or we have it, but choose not to include it in our regression. That is, suppose we estimate the following regression: y i = α + β1 x1i + ε i .

(2)

Is the OLS estimate βˆ1 from (2) and unbiased estimate of β 1 even if (1) is the true model, or does omitting variables from an OLS regression bias the coefficient estimates for the variables that we do include? The answer is “maybe.” To see why, start with the OLS estimate of (2):

βˆ1 = ∑ k i y i , where k i =

x1i − x1 . ∑ ( x1i − x1 ) 2

Substituting in the correct definition of y i (i.e., from (1)):

βˆ1 = ∑ k i (α + β1 x1i + β 2 x 2i + ε i ) .

(3)

βˆ1 = α ∑ k i + β1 ∑ k i x1i + β 2 ∑ k i x 2i + ∑ k i ε i .

(4)

βˆ1 = β1 + β 2 ∑ k i x 2i + ∑ k i ε i .

(5)

[

]

(6)

[

]

(7)

E[ βˆ1 ] = β 1 + β 2 E ∑ k i x 2i + ∑ k i E[ε i ] . E[ βˆ1 ] = β 1 + β 2 E ∑ k i x 2i . Note that βˆ1 is biased if β 2 E[∑ k i x 2i ] ≠ 0 . To understand what this means, think about what

∑k x i

2i

represents. It looks very similar to

∑k y i

i

, which is

the formula for the OLS estimate of β 1 from (2). In fact, consider estimating a hypothetical regression with x 2 as the dependent variable and x1 as the independent variable:

7

x 2i = α + γx1i + ε i ,

(8)

where γ is the coefficient. The OLS estimate of γ is γˆ = ∑ k i x 2i , where ki =

x1i − x1 . 2 ( x − x ) ∑ 1i 1

Therefore, (7) can be re-written as: E[ βˆ1 ] = β1 + β 2 E [γˆ ] .

(9)

E[ βˆ1 ] = β 1 + β 2 γ .

(10)

Therefore, if you omit the variable x 2 from regression (1), the OLS estimate β 1 will, in general, be biased. There are, however, two specific situations in which βˆ1 will not be biased: the first is when β 2 = 0 , the second is when γ = 0 . If β 2 = 0 , then x 2 does not belong in (1), and so omitting it means omitting an irrelevant variable. In this case, it is not surprising that βˆ is unbiased. When 1

γ = 0 is a more interesting case. If γ = 0 , then the x 2 variable is uncorrelated with x1 (regression (8) would have R 2 = 0 ). In this case, βˆ1 is unbiased. In other words, if you omit a relevant variable, but the omitted variable is uncorrelated with the variable that you do include, then the OLS coefficients remain unbiased. What is perhaps more useful is that, when OLS is biased, we can tell the direction of the bias. Summary:

Suppose the true model is: but instead we estimate:

(1) y i = α + β1 x1i + β 2 x 2i + ε i , (2) y i = α + β1 x1i + ε i .

The direction of the omitted variable bias of βˆ1 estimated from (2) is: Correlation between x 2 and x1

Effect of x2 on y

Positive

Negative

Zero

Positive

βˆ1 bias positive

βˆ1 bias negative

βˆ1 is unbiased

Negative

βˆ1 bias negative

βˆ1 bias positive

Zero

βˆ1 is unbiased

βˆ1 is unbiased

8

βˆ1 is unbiased

Note: Omitted variable bias in matrix form. Consider the regression: Y = X

( n x 1)

β + z π+ ε ,

( n x m ) ( m x 1)

( n x 1) (1 x1)

(1)

( n x 1)

where X represents a matrix of m variables (including the constant), z represents one additional variable, and π represents the coefficient on z . Now consider the same regression without the z variable: Y = X

( n x 1)

β + ε .

( n x m ) ( m x 1)

(2)

( n x 1)

What happens if (1) is the correct model, but we estimate (2)? In other words, we omit the relevant variable z . Our OLS estimate of (2) is:

βˆ = (X ' X) −1 X ' Y . Substituting in the correct definition of Y (i.e., from (1)):

βˆ = (X ' X) −1 X ' (Xβ + zπ + ε) .

(3)

βˆ = (X ' X) −1 X ' Xβ + (X ' X) −1 X ' zπ + (X ' X) −1 X ' ε .

(4)

βˆ = β + (X ' X) −1 X ' zπ + (X ' X) −1 X ' ε .

(5)

E[βˆ ] = β + E[(X ' X) −1 X ' zπ] .

(6)

Note that βˆ is biased if E[(X ' X) −1 X ' zπ] ≠ 0 . To understand what this means, think about what (X ' X) −1 X ' zπ represents. Consider estimating a hypothetical regression with z as the dependent variable and X as the independent variables: z = X

( n x 1)

γ + ε .

( n x m ) ( m x 1)

(7)

( n x 1)

The OLS estimate of γ is γˆ = (X ' X) −1 X ' z . Therefore, (6) can be re-written as: E[βˆ ] = β + E[ γˆ π] .

E[βˆ ] = β + γ ( m x 1)

(8)

π .

(9)

( m x 1) (1 x 1)

9

2008 American Economic Association Summer Program Department of Economics University of California Santa Barbara

Brian Duncan ECON 194SE: Econometrics (Session 2)

Summer 2010 MWF 9:30 – 11:00 (NH 2212)

Lecture #2: I.

I.

Hypothesis testing. A. Null and alternative hypothesis. B. Statistical test. 1. Critical region and acceptance region. C. Type one and type two error. D Power of a test. E. Confidence and significance level. F. Test of significance, p-value, and confidence interval.

Hypothesis testing. Testing a hypothesis that a true parameter equals a particular value, against an alternative hypothesis that the true parameter takes on other values. Test Statistics A normally distributed random variable

The sample mean

An OLS regression Model: Yi = α + βX i + ε i .

y ~ N (μ y ,σ ) . 2 y

z=

y − μy

σy

~ N (0,1) .

y ~ N ( μ y , σ / n) . 2 y

z=

t=

y − μy

σy / n y − μy Sy / n

~ N (0,1) .

~ t ( n −1) .

ε ~ N (0, σ ε2 ) . Y ~ N (α + βX i , σ ε2 ) . βˆ ~ N [ β , σ ε2 / ∑ ( X i − X ) 2 ] . βˆ − β z= ~ N (0,1) . σ ε / ∑ (X i − X )2 t=

βˆ − β se βˆ

~ t ( n−k ) ,

Where se βˆ = σˆ ε / Note: S y is the sample standard deviation.

S y / n is the sample standard error.

10

∑ ( X i − X )2 .

Note: k represents the number of coefficients, including the constant.

σˆ ε =

1 εˆi2 ∑ n−m

Null hypothesis:

Hypothesis that a parameter takes on a particular value. H 0 : μ x = 20 . H0 : β = 0 .

Alternative hypothesis:

Other values that a parameter might take on.

H 1 : β > 0 ⇒ One sided test. H 1 : β < 0 ⇒ One sided test. H 1 : β ≠ 0 ⇒ Two sided test.

Note: What you are trying to prove becomes the alternative hypothesis. Therefore, if we “reject” the null-hypothesis, we support our theory. If we “fail to reject” the nullhypothesis, then we fail to support our theory. For example, in a criminal court, the null-hypothesis is that the defendant is innocent. The court must reject the nullhypothesis to send the defendant to jail. Statistical test:

A decision rule used to “fail to reject the null hypothesis” or “reject the null hypothesis.” A statistical test produces a test statistic.

Critical region:

The range of values of the test statistic that rejects the null hypothesis.

Type I and Type II error:

Type I error: α = Prob(rejecting H 0 | H 0 is true). Type II error: β = Prob(fail to reject H 0 | H 0 is false). A researcher tries to minimize α (the probability of sending an innocent person to jail). Power of a test:

Prob(rejecting H 0 | H 0 is false) = (1- β ).

Significance level:

Prob(rejecting H 0 | H 0 is true) = α .

Confidence level:

Prob(fail to reject H 0 | H 0 is true) = (1- α ).

11

Example:

Suppose we estimate the regression: Yi = α + βX i + ε i , from a sample with 42 observations. The OLS estimate of β is 0.156, with a standard error of 0.084.

Test of significance:

Test the hypothesis that the marginal effect of X on Y is zero. Given: βˆ = 0.156 , se βˆ = 0.084 , n = 42 , d . f . = 40 . H0 : β = 0. H1 : β ≠ 0 . t=

βˆ − β se βˆ

=

(Null hypothesis) (Alternative hypothesis)

0.156 = 1.857 ~ t 40 0.084

(Test statistic)

At the 95% confidence level (5% significance level): * t 40 , 0.025 = ±2.021

(Critical value)

0.025

0.025

-2.021 Critical Region

2.021 Critical Region 1.857 (test statistic)

The test statistic does not fall in the critical region. Therefore, we fail to reject the null hypothesis at the 95% confidence level.

12

At the 90% confidence level (10% significance level): * t 40 , 0.05 = ±1.684

(Critical value)

0.05

0.05

-1.684 Critical Region

1.684 Critical Region 1.857 (test statistic)

The test statistic falls in the critical region. Therefore, we reject the null hypothesis at the 90% confidence level. Notice that changing the confidence level does not change the test statistic. It only changes the size of the critical region. The same is true if we change from a two-tail test to a one-tail test. P-value:

At what significance level would we reject the null-hypothesis? (At what confidence level would we reject the null-hypothesis?)

p-value = 0.07

1.857 (test statistic)

If we use a significance level greater than 0.07, we will reject the null-hypothesis. (If we use a confidence level less than 93% (1 – 0.07), we will reject the null-hypothesis.)

13

90% confidence interval:

The formula to standardize a random variable is: t=

βˆ − β se βˆ

~ t ( n−k )

Lookup the t-distribution with 40 d.f.

P (−1.684 ≤ t ≤ 1.684) = 0.90 Plug in the definition of t: ⎛ ⎞ βˆ − β ≤ 1.684 ⎟ = 0.90 P⎜ − 1.684 ≤ ⎜ ⎟ se βˆ ⎝ ⎠

(

)

P βˆ − 1.684( se βˆ ) ≤ β ≤ βˆ + 1.684( se βˆ ) = 0.90 P(0.156 − 1.684(0.084 ) ≤ β ≤ 0.156 + 1.684(0.084 )) = 0.90 P(0.0145 ≤ β ≤ 0.297 ) = 0.90

There is a 90% probability that the random confidence interval {0.0145, 0.297} contains the true mean.

The confidence interval formula is: βˆ ± (tα ,k )( se βˆ )

14

2008 American Economic Association Summer Program Department of Economics University of California Santa Barbara

Brian Duncan ECON 194SE: Econometrics (Session 2)

Summer 2010 MWF 9:30 – 11:00 (NH 2212)

Lecture #3: I. II. III. IV. V.

I.

Goodness of fit. R-squared. Adjusted R-squared. A. Comparing R-squared and adjusted R-squared. F-Test for overall significance. F-Test for Joint Significance.

Goodness of fit. A measure of fit between the estimated regression line and the data. Large residuals imply a poor fit. Small residuals imply a good fit.

A.

Residual sum of squares (RSS):

RSS = ∑ εˆi2 .

The problem with using the RSS as a measure of goodness of fit is that it depends on the unit of measurement. We want to develop a measure of goodness of fit that is measure free.

B.

Variation of Y:

Variation(Y) = ∑ (Yi − Y ) 2 .

The variation of Y, also called the total sum of squares, is equal to: TSS = ∑ (Yi − Y ) 2 . Our measure of goodness of fit divides the variation of Y—or the TSS— into two parts: Part 1: Variation that is explained by the estimated regression. - The explained sum of squares (ESS). Part 2: Variation that is unexplained (the error term). - The residual sum of squares (RSS).

15

In order to calculate a measure of goodness of fit, consider, running a regression without the X variable. In a regression, the predicted value of the dependent variable is defined as: Yˆi = αˆ + βˆX i .

Omitting the X variable is equivalent to imposing the restriction that βˆ = 0 . Therefore,

αˆ = Y − βˆX = Y . Yˆi = αˆ , and so, Yˆi = Y .

Without any additional information (i.e., the value of X), the predicted value of Yi is equal to its sample mean. In this case, the variation of Y is:

∑ (Yi − Y ) 2 = ∑ (Yi − Yˆi ) 2 = ∑ εˆi2 . Which means that TSS = RSS. When we run a regression without an X variable, the variation of Y is equal to the variation of the error term. In other words, we haven’t explained any of the variation of Y. All of the variation of Y is unexplained. This is the starting point. The goal of a regression is not simply to do better than knowing nothing about Y, but to do better than knowing the mean if Y.

If we include the X variable, we can do better. When we include an X variable in our regression, our predicted values of Y is defined as: Yˆi = αˆ + βˆX i .

The additional information (i.e., X) will hopefully explain some of the variation of Y.

16

Y Yi

Yˆi = αˆ + βˆX i (Yi − Yˆi ) RSS

Yˆi

(Yi − Y ) TSS

(Yˆi − Y ) ESS

Y

αˆ X Consider a decomposition of the variation of Y: Yi − Y = (Yi − Yˆi ) + (Yˆi − Y ) . (add and subtracted Yˆi ) (Yi − Y ) 2 = (Yi − Yˆi ) 2 + (Yˆi − Y ) 2 + 2(Yi − Yˆi )(Yˆi − Y ) . 2 2 2 ∑ (Yi − Y ) = ∑ (Yi − Yˆi ) + ∑ (Yˆi − Y ) + 2∑ (Yi − Yˆi )(Yˆi − Y ) .

Focus on the last term on the right: ∑ (Yi − Yˆi )(Yˆi − Y ) = ∑ εˆi (Yˆi − Y ) . ∑ (Yi − Yˆi )(Yˆi − Y ) = ∑ εˆi Yˆi − Y ∑ εˆi . ∑ (Yi − Yˆi )(Yˆi − Y ) = ∑ εˆi Yˆi , because ∑ εˆi = 0 . ∑ (Y − Yˆ )(Yˆ − Y ) = ∑ εˆ (αˆ + βˆX ) . i

i

i

i

i

∑ (Yi − Yˆi )(Yˆi − Y ) = αˆ ∑ εˆi + βˆ ∑ εˆi X i . ∑ (Yi − Yˆi )(Yˆi − Y ) = βˆ ∑ εˆi X i , because ∑ εˆi = 0 . ∑ (Y − Yˆ )(Yˆ − Y ) = βˆ ∑ (Y − αˆ − βˆX ) X . i

i

i

i

i

∑ (Yi − Yˆi )(Yˆi − Y ) = 0 , because second normal equation.

i

∑ (Yi − αˆ − βˆX i ) X i

2 2 2 ∑ (Yi − Y ) = ∑ (Yi − Yˆi ) + ∑ (Yˆi − Y ) .

Total Variation Residual Variation Explained Variation Total sum of squares.

Residual sum of squares.

Explained sum of squares.

(TSS)

(RSS)

(ESS)

17

= 0 is the

II.

R2:

A measure of goodness of fit.

Start with: TSS = RSS + ESS. To normalize the above equation, divide both sides by TSS: 1=

RSS ESS + . TSS TSS

Define R2 to be: R2 =

ESS RSS = 1− . TSS TSS

R2 can also be written as: ∑ εˆi R = 1− 2 . ∑ (Yi − Y ) 2

2

R2 is the portion of the total variation of Y that is explained by the regression line. In the example where βˆ = 0 , RSS = TSS and so R 2 = 0 (worst fit). If the regression line perfectly fits the data, then RSS = 0, and ESS = TSS. Therefore R 2 = 1 (best fit). Therefore, 0 ≤ R 2 ≤ 1 .

18

III.

Adjusted R-squared. The R 2 measures how much of variation of Y is explained by the estimated regression: R2 =

2 ESS RSS ∑ εˆi = 1− = 1− . 2 TSS TSS ∑ (Yi − Y )

R 2 does not take into account the degrees of freedom. Alternatively, we could measure how much of the variance of Y is explained by the estimated regression. ∑ (Yi − Y ) Instead of TSS (variation), consider: S = . n −1 2 ∑ εˆi Instead of RSS, consider: σˆ 2 = , where m is the number of n−m parameters in the model, including the constant. 2

2 y

Adjusted R-squared:

A.

R 2 = 1−

2 σˆ 2 ∑ εˆi /(n − m) 2 Î = − . 1 R 2 S y2 ∑ (Yi − Y ) /(n − 1)

Comparing R-squared and adjusted R-squared. R = 1− 2

∑ εˆ

(n − 1) ⎛ RSS ⎞ (n − 1) = 1− ⎜ . ⎟ ⎝ TSS ⎠ (n − m) ∑ (Yi − Y ) (n − m) 2 i

2

R 2 = 1 − (1 − R 2 )

Rearrange terms: Therefore:

1. 2. 3.

n −1 . n−m n −1 1− R 2 ≥ 1 , with equality if m = 1 . = n − m 1− R2 1 − R 2 ≥ 1 − R 2 , with equality if m = 1 . R 2 ≤ R 2 , with equality if m = 1 .

If m = 1 , then R 2 = R 2 = 0 . If m > 1 , then R 2 < R 2 (provided that the regression includes a constant). R 2 can be negative.

19

IV.

F-Test for overall significance. Consider the two-variable regression model: Yi = α + βX i + ε i . If β = 0 , then knowing the value of X does not help predict Y. In this case, the regression will do no better at predicting Y than the sample average of Y (i.e, Y ), and the regression is said to be insignificant. Testing the hypothesis that β = 0 is equivalent to testing whether the regression above is significant. Now consider the multiple variable regression model: Yi = α + β 1 X 1 + L + β m −1 X m −1 + ε i Notice the slight change in notation: there are still m coefficients, one of which is the constant. Even if we fail to reject the hypothesis that each individual coefficient is zero, that is, that none of the variables are statistically significant, it is still possible that the regression will do better at predicting Y than the sample average of Y (i.e, Y ). This is because some of the X variables may be correlated. For example, if X 3 and X 4 are correlated, then both may explain the same variation in Y. Thus, while the regression explains some of the variation in Y, it cannot determine whether it is X 3 or X 4 that explains it. Therefore, we cannot reject the hypothesis that either β 3 = 0 or that β 4 = 0 . However, we may be able to reject the hypothesis that both β 3 and β 4 are equal to zero at the same time. To test the hypothesis that the regression does better at predicting Y than the sample average of Y (i.e, that that regression is significant), we need to test the hypothesis that all of the coefficients (except the constant term) are zero at the same time. We write that joint hypothesis test as: H 0 : β 1 = β 2 = L = β m −1 = 0 . (Not including the constant, α ) This hypothesis is conceptually equivalent to testing the hypothesis that R 2 = 0 , or that the explained sum of squares is zero (i.e., ESS = 0 ). Recall that: TSS = ESS + RSS ( n −1)

( m −1)

(n−m)

Because TSS has (n-1) degrees of freedom, and RSS has (n-m) degrees of freedom, then ESS must have (m-1) degrees of freedom.

20

Constructing a test statistic is a necessary step in conducting a hypothesis test. For a test statistic to be useful in hypothesis testing, we must know its distribution. The F-test constructs a test statistic that follows the F-distribution.

F-Test Statistic:

ESS /(m − 1) ∑ (Yˆi − Y ) 2 /(m − 1) F= ~ Fm −1,n − m . = 2 RSS /(n − m) ∑ εˆi /(n − m)

If the regression does not explain Y better than Y , then ESS will be not be significantly different from zero, thus, the F-statistic will not be significantly different from zero, and we will fail to reject the null hypothesis that the regression is insignificant. F-statistic above can also be written in terms of R 2 . Note that: R2 =

ESS RSS . = 1− TSS TSS

Dividing both the numerator and denominator of the F-Test statistic by TSS yields:

F-Test Statistic:

F=

R 2 /(m − 1) ~ Fm −1,n − m . (1 − R 2 ) /( n − m)

A final note: In the two-variable regression model, the F-test is equivalent to the t-test for significance of the coefficient on the X variable.

21

Example of F-Test for overall significance in Stata . reg wage schoolyr exp female Source | SS df MS ---------+-----------------------------Model | 541990.818 3 180663.606 Residual | 1635574.14 39802 41.0927627 ---------+-----------------------------Total | 2177564.96 39805 54.7058148

Number of obs F( 3, 39802) Prob > F R-squared Adj R-squared Root MSE

= 39806 = 4396.48 = 0.0000 = 0.2489 = 0.2488 = 6.4104

-----------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------schoolyr | 1.832411 .0191913 95.481 0.000 1.794796 1.870027 exp | .1809176 .0031213 57.963 0.000 .1747998 .1870354 female | -1.646128 .0748767 -21.985 0.000 -1.792888 -1.499368 _cons | -17.46978 .3383515 -51.632 0.000 -18.13296 -16.80661 ------------------------------------------------------------------------------

Null Hypothesis:

H 0 : β1 = β 2 = β 3 = 0

In the regression:

m −1 = 3 . n − m = 39,802 .

The F-Statistic is:

F=

ESS /(m − 1) = Fm −1,n − m RSS /(n − m)

F=

541,990.818 / 3 180,663.333 = = 4,396.4757 ~ F3,39802 1,635,574.14 / 39,802 41.09276

Critical Value:

F3*,∞ = 2.60 (95% confidence level).

Result:

F > F* Therefore we reject the null-hypothesis at the 95% confidence level. The regression is significant.

Alternative calculation of the F-Stat:

F-Statistic:

R 2 /(m − 1) ~ Fm −1,n − m . (1 − R 2 ) /( n − m) 0.2489 / 3 0.0829666 F= = = 4396.5365 ~ F3,39802 (1 − 0.2489) / 39,802 0.0000188709 F=

22

V.

F-Test for Joint Significance. Testing the joint significance of a group of coefficients. Consider the model: Yi = α + β 1 X 1 + L + β k X k + β k +1 X k +1 + L + β m −1 X m −1 + ε i . Notice the slight change in notation: there are still m coefficients, one of which is the constant. Suppose we want to test the hypothesis that a group of variables are jointly insignificant. H 0 : β 1 = β 2 = L = β k = 0 (Some subset coefficients). The joint F-test test estimates two regressions, one that forces the null hypothesis to be true (restricted model) and another that does not (unrestricted model). Unrestricted model (u):

Yi = α + β 1 X 1 + L + β m −1 X m −1 + ε i (nested hypothesis)

Restricted model (r):

Yi = α + β k +1 X k +1 + L + β m −1 X m −1 + ε i (Omitting: X 1 , X 2 , K, X k )

If the null hypothesis is true, then the restricted model will explain the variation in Y just as well as the unrestricted model, and so, both models will have a similar residual sum of squares (RSS). Based on that logic, the F test statistic is: F=

( RSS r − RSS u ) /(k ) ~ Fk ,n − m . RSS u /(n − m)

The d.f. of the numerator comes from: RSS r − RSS u [ n − ( m − k )]

.

( n−m) = k

ESS RSS , and that TSS u = TSS r by definition. Dividing = 1− TSS TSS both the numerator and denominator by TSS yields:

Note that R 2 =

F=

( Ru2 − Rr2 ) /(k ) ~ Fk ,n − m . (1 − Ru2 ) /( n − m)

Remember: k

Number of restrictions. Difference in degrees of freedom between the restricted and unrestricted model.

n−m

Degrees of freedom in the unrestricted model. 23

Example of F-Test for Joint Significance in Stata * Restricted Model A . reg wage schoolyr exp female Source | SS df MS ---------+-----------------------------Model | 541990.818 3 180663.606 Residual | 1635574.14 39802 41.0927627 ---------+-----------------------------Total | 2177564.96 39805 54.7058148

Number of obs F( 3, 39802) Prob > F R-squared Adj R-squared Root MSE

= 39806 = 4396.48 = 0.0000 = 0.2489 = 0.2488 = 6.4104

-----------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------schoolyr | 1.832411 .0191913 95.481 0.000 1.794796 1.870027 exp | .1809176 .0031213 57.963 0.000 .1747998 .1870354 female | -1.646128 .0748767 -21.985 0.000 -1.792888 -1.499368 _cons | -17.46978 .3383515 -51.632 0.000 -18.13296 -16.80661 -----------------------------------------------------------------------------* Restricted Model B . reg wage schoolyr exp Source | SS df MS ---------+-----------------------------Model | 522129.906 2 261064.953 Residual | 1655435.05 39803 41.5907106 ---------+-----------------------------Total | 2177564.96 39805 54.7058148

Number of obs F( 2, 39803) Prob > F R-squared Adj R-squared Root MSE

= 39806 = 6277.00 = 0.0000 = 0.2398 = 0.2397 = 6.4491

-----------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------schoolyr | 1.888841 .0191338 98.718 0.000 1.851338 1.926343 exp | .1828974 .0031388 58.269 0.000 .1767452 .1890495 _cons | -19.67009 .3251626 -60.493 0.000 -20.30742 -19.03276 -----------------------------------------------------------------------------* Unrestricted Model (AKA fully interacted model) . reg wage schoolyr exp female femschyr femexp Source | SS df MS ---------+-----------------------------Model | 552443.794 5 110488.759 Residual | 1625121.16 39800 40.8321901 ---------+-----------------------------Total | 2177564.96 39805 54.7058148

Number of obs F( 5, 39800) Prob > F R-squared Adj R-squared Root MSE

= 39806 = 2705.92 = 0.0000 = 0.2537 = 0.2536 = 6.39

-----------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------schoolyr | 1.407679 .0422311 33.333 0.000 1.324905 1.490453 exp | .2503263 .0064145 39.025 0.000 .2377537 .262899 female | -8.575581 .8047316 -10.656 0.000 -10.15287 -6.998288 femschyr | .5168297 .0473953 10.905 0.000 .4239337 .6097257 femexp | -.0881529 .0073397 -12.010 0.000 -.1025389 -.0737669 _cons | -11.68424 .7169586 -16.297 0.000 -13.0895 -10.27899 ------------------------------------------------------------------------------

24

* Test hypothesis (A) . test femschyr femexp ( 1) ( 2)

femschyr = 0.0 femexp = 0.0 F(

2, 39800) = Prob > F =

128.00 0.0000

* Test hypothesis (B)

. test female femschyr femexp ( 1) ( 2) ( 3)

female = 0.0 femschyr = 0.0 femexp = 0.0 F(

3, 39800) = Prob > F =

247.47 0.0000

Unrestricted Model (u): wagei = α + β 1 schoolyri + β 2 exp i + β 3 femalei + β 4 ( schoolyri )( femalei )

β 5 ( femalei )(exp i ) + ε i Restricted Model A (r): wagei = α + β 1 schoolyri + β 2 exp i + β 3 femalei + ε i Restricted Model B (r): wagei = α + β 1 schoolyri + β 2 exp i + ε i

( RSS r − RSS u ) /(k ) ~ Fk ,n − m RSS u /(n − m)

F-test:

F=

Hypothesis (A):

H0 : β4 = β5 = 0 F=

Hypothesis (B):

(1635574.14 − 1625121.16) / 2 5226.49 = = 127.999 ~ F2,39800 40.83219 1625121.16 / 39800

H 0 : β 3 = β 4 = β 5 = 0 (Coefficients from unrestricted model) F=

Results:

(Coefficients from unrestricted model)

(1655435.05 − 1625121.16) / 3 10104.63 = = 247.467 ~ F3,39800 40.83219 1625121.16 / 39800

5% Significance: F2*,∞ = 3.00 . F3*,∞ = 2.60 . Reject (A) and (B).

25

2008 American Economic Association Summer Program Department of Economics University of California Santa Barbara

Brian Duncan ECON 194SE: Econometrics (Session 2)

Summer 2010 MWF 9:30 – 11:00 (NH 2212)

Lecture #4: I. II. III. IV.

I.

Dummy variables. Mutually exclusive and exhaustive. Using dummy variables to define multiple categories. Interaction terms.

Dummy variables:

Variables that take on the values 1 or 0.

⎧1 if female For example: femalei = ⎨ ⎩0 otherwise This variable is an indicator variable because it separates the sample into two groups. Defining the value of a dummy variable to the two particular values of 0 and 1 is somewhat arbitrary. We could easily define an indicator variable that takes on the values 1 and 2. However, defining an indicator as 0 for those not in the group and 1 for those in the group has advantages when interpreting their coefficients. For example, consider running a regression with only a constant and the dummy variable “female”, as defined above. ⎧1 if female y i = αˆ + βˆf i + εˆi , where f i = ⎨ ⎩0 otherwise Note that one way to write the average y of men and women are: yf =

∑ f i yi ∑ fi

, and y m =

∑ (1 − f i ) yi . ∑ (1 − f i )

The average y for observations in the female category is: yf =

∑ f i (αˆ + βˆf i + εˆi ) = αˆ ∑ f i + βˆ ∑ f i 2 + ∑ f i εˆi ∑ fi ∑ fi

y f = αˆ + βˆ

∑ f i 2 + ∑ f i εˆi . ∑ fi ∑ fi

26

.

Note that: f i 2 = f i and ∑ f i εˆi = 0 because it is the second first-order condition (i.e., ∑ xi εˆi = 0 ). Therefore, y f = αˆ + βˆ . The average y for observations not in the female category (men) is: ym =

ym =

∑ (1 − f i )(αˆ + βˆf i + εˆi ) . ∑ (1 − f i ) αˆ ∑ (1 − f i ) + βˆ ∑ (1 − f i ) f i + ∑ (1 − f i )εˆi

∑ (1 − f i )

0 y m = αˆ + βˆ

0

.

0

∑ (1 − f i ) f i + ∑ εˆi − ∑ f i εˆi . ∑ (1 − f i ) ∑ (1 − f i )

y m = αˆ . Therefore, for the regression, ⎧1 if female y i = αˆ + βˆf i + εˆi , where f i = ⎨ ⎩0 otherwise The OLS estimated coefficients are:

αˆ = y m βˆ = y f − αˆ = y f − y m In other words:

αˆ Î The average of y for the observations not in the category. βˆ Î The difference in the average of y between the observations in the category and those not in the category.

27

Adding other control variables to the regression above only slightly changes the interpretation of βˆ . For example, consider the regression: wagei = α + β 1 femalei + β 2 schoolyri + ε i . Graphing the regression lines yields: wage

α + β 2 schoolyri α + β1 + β 2 schoolyri α

β2

β1 α + β1

schoolyr

The dummy variable allow there to be a different intercepts for men and women. However, the slope for men and women is that same. The coefficient β 1 represents the expected difference in the wages of men and women, controlling for schooling. That is, β 1 represents the expected wage difference between a man and a woman with the same amount of education. The regression line passes through the mean of the data. Therefore, the observed difference in the dependent variable reflects both β 1 and differences in the other independent variables. Y

α + β 2 schoolyri

E (Y | f = 0, X = X m ) =Ym Ym − Y f

α + β1 + β 2 schoolyri

E (Y | f = 0, X = X f ) E (Y | f = 1, X = X m )

Difference due to different intercepts Difference due to different average X’s

E (Y | f = 1, X = X f ) = Y f

X

Xm Xf E (Y | f = 0, X = X f ) Î Where the average female would be if women had the same returns as men. E (Y | f = 1, X = X m ) Î Where a female would be if women had the same average X as men. 28

A.

Dummy variable trap. Consider estimating the regression: wagei = α + β 1 femalei + β 2 malei + β 3 schoolyri + ε i . Note that: malei = 1 − femalei . Therefore, wagei = α + β 1 femalei + β 2 (1 − femalei ) + β 3 schoolyri + ε i . wagei = (α + β 2 ) + ( β 1 − β 2 ) femalei + β 3 schoolyri + ε i . The variables male and female form an exact linear relationship. The coefficients α , β 1 , and β 2 are not identified.

II.

Mutually exclusive and exhaustive. Mutually Exclusive: Two or more events that cannot occur at the same time. Exhaustive: Example:

A set of events that contain all possible outcomes. Education attainment among a group of randomly selected people.

Education Elementary High School College Graduate School

Completed Grade A B C D Events are not mutually exclusive

29

Highest Grade Completed E F G H Events are mutually exclusive

III.

Using dummy variables to define multiple categories. Consider estimating the regression: ⎧1 if graduated high school wagei = α + βhsi + ε i , where hs i = ⎨ ⎩ 0 otherwise In this regression:

α is the average wage of individuals without a high school degree.

β is the difference in the average wage between high school graduates and those without a high school degree. If βˆ is positive, then high school graduates earn more, on average, than those without a high school degree. Now consider adding other control variables to the regression: wagei = α + β 1 hsi + β 2 exp i + β 3 IQi + ε i , ⎧1 if graduated high school where hs i = ⎨ ⎩ 0 otherwise exp i = years of work experience, IQi = IQ test score. In this regression:

α is the average wage of individuals without a high school degree, with zero years of experience, and with an IQ score of zero. It may be difficult to literally interpret the constant because it often represents a hypothetical observation that may not be realistic, or even possible. For example, it may not be possible to score a zero on an IQ test. The constant term is still a legitimate estimate; it ensures that the regression line passes through the mean of the data.

β 1 is the estimated difference in the wage of high school graduates and those without a high school degree, controlling for experience and IQ. If βˆ is positive, then a high school graduate earns more than someone 1

with the same experience and IQ, but who did not graduate from high school.

30

β 2 is the marginal return of experience. An additional one-year of experience will increase a person’s wage by βˆ 2 . Notice that—as a result of functional form—the marginal return of experience is not a function of a person’s wage, education, or IQ.

β 3 is the marginal return of IQ score. A one-unit increase in a person’s IQ score will increase his or her wage by βˆ 3 . Notice that—as was the case with experience—the marginal return of IQ is not a function of a person’s wage, education, or experience. The regression above places individuals into two categories: those with a high school degree and those without a high school degree. However, if graduating from high school increases a person’s wage, then perhaps graduating from college also increases a person’s wage. Rather than placing all individuals into two categories, you may want to place individuals into several mutually exclusive and exhaustive categories. For example:

Dummy Variable elemi = hs i = collegei = mai = phd i =

Definition (highest education level) Less than a high school education. High school diploma. College degree. Masters degree. Ph.D.

31

When creating a set of dummy variable, it is important to make sure that they are mutually exclusive and exhaustive. Each observation must fall into one, and only one, category: ⎧1 less than high school elem = ⎨ ⎩ 0 otherwise ⎧1 high school diploma hs = ⎨ ⎩ 0 otherwise Each observation must fall into one, and only one, category

⎧1 college degree college = ⎨ ⎩ 0 otherwise ⎧1 masters degree ma = ⎨ ⎩0 otherwise ⎧1 Ph.D degree phd = ⎨ ⎩0 otherwise

If we have created the five education dummy variables correctly, then for every observation only one of the dummy variables will equal “1” with all of the other dummy variables equal to “0”. One way to check this in STATA is to create a new variable equal to the sum of the dummy variables: generate check = elem+hs+college+ma+phd sum check drop check The variable “check” should equal 1 for all observations (i.e., the MAX and MIN of the variable check should both be equal to one). The most common error when creating a set of dummy variables occurs when dealing with missing variables. For example, which of the five education categories should a person fall into if he or she did not report a level of education? The answer is that they shouldn’t fall into any of the categories above. This is not a problem, because we can always create a sixth category called missing: ⎧1 education missing missing = ⎨ ⎩ 0 otherwise

32

Whenever you create two or more categories, you must omit one on the categorical variables from the regression to avoid the dummy variable trap. The variables in the example above represent a set of five mutually exclusive and exhaustive dummy variables (assume that there is no one with missing education). We must omit one variable from our regression: wagei = α + β 1elemi + β 2 collegei + β 3 mai + β 4 phd i + β 5 exp i + β 6 IQi + ε i In this regression, high school (hs) is the omitted group. The interpretations of the coefficients are:

α is the average wage of individuals with a high school degree only, with zero years of experience, and with an IQ score of zero.

β 2 is the estimated difference in the wages of college graduates and high school graduates, controlling for experience and IQ. If βˆ 2 is positive, then a college graduate earns more than a person with the same experience and IQ, but who graduated from high school.

β 3 is the estimated difference in the wages of MA graduates and high school graduates, controlling for experience and IQ. If βˆ 3 is positive, then a person with a M.A. degree earns more than a person with the same experience and IQ, but who graduated from high school.

β 4 is the estimated difference in the wages of college graduates and high school graduates, controlling for experience and IQ. If βˆ 4 is positive, then a person with a Ph.D. earns more than a person with the same experience and IQ, but who graduated from high school. You can include more than one set of mutually exclusive and exhaustive dummy variables. For example, the following regression controls for both education categories and gender categories: wagei = α + β 1elemi + β 2 collegei + β 3 mai + β 4 phd i +

β 5 femalei + β 6 exp i + β 7 IQi + ε i In this regression:

α is the average wage of males with a high school degree only, with zero years of experience, and with an IQ score of zero. α represents the intercept for high school educated men.

33

β 2 is defined as it was in the last regression, except now it also controls for gender. That is, β 2 is the estimated difference in the wages of college graduates and high school graduates, controlling for gender, experience, and IQ. Notice that the returns to education are not a function of gender. In other words, because of functional form, a college education has the same affect on a man’s wage as it does on a woman’s wage.

β 5 is the estimated difference in the wages of men and women, controlling for education, experience, and IQ. Notice that we can calculate relative wages between any two groups. For example: a. b. c.

Wage of college men relative to high school men: β 2 . Wage of college women relative to high school women: β 2 . Wage for college females relative to high school men: (α + β 2 + β 5 ) − (α ) = β 2 + β 5 . (The constant term is a part of every group. It drops out when calculating relative wages).

c.

Wage of master/females relative to master/men: β 5 (The estimated male/female wage differential is the same for all education groups).

d.

Wage of college/men to masters/women: (β 2 ) − (β 3 + β 5 ) = β 2 − β 3 − β 5 .

IV.

Interaction terms. The functional forms of the regressions above restrict the returns to education, experience and IQ to be the same for men and women. Perhaps the returns to any or all of the control variables are different for men and women. Interaction terms allow a regression to estimate different effects for different groups. For example, the following functional form allows the regression to estimate different effects of experience and IQ for men and women. wagei = α + β 1elemi + β 2 collegei + β 3 mai + β 4 phd i + β 5 femalei +

β 6 exp i + β 7 IQi + β 8 (exp i )( femalei ) + β 9 ( IQi )( femalei ) + ε i

34

Notice that β 8 (exp i )( femalei ) = 0 for men. Therefore, whatever β 8 is, it has no influence on the wages of men. The same is true for β 9 . Both β 8 and β 9 only influence the wages of women. Conversely, β 6 and β 7 affect the wages of both men and women. In the regression above:

β 6 is the marginal effect of experience for men. β 7 is the marginal effect of IQ for men. β 8 is the difference in the marginal effect of experience for women and men.

β 9 is the difference in the marginal effect of IQ for women and men. Therefore,

β 6 + β 8 is the marginal effect of experience for women. β 7 + β 9 is the marginal effect of IQ for women. You can also interact a dummy variable with other dummy variables. For example, the following regression estimates separate returns to education for men and women. wagei = α + β 1elemi + β 2 collegei + β 3 mai + β 4 phd i + β 5 f i + β 6 exp i + β 7 IQi +

β 8 (elemi )( f i ) + β 9 (collegei )( f i ) + β 10 (mai )( f i ) + β 11 ( phd i )( f i ) + ε i ⎧1 if female where f i = ⎨ ⎩0 otherwise As with the previous example, all of the terms containing f i are zero for men, and so the coefficients associated with the f i terms do not affect a man’s wage.

β 2 is how much more a college educated man makes relative to a high school educated man, controlling for experience and IQ.

35

β 9 the difference between the returns to a college education for a women relative to a man. If β 9 is zero, then the returns to a college education are the same for men and women. The other education interaction terms are defined similar to β 9 . Notice, as before, that we can calculate relative wages between any groups. For example: a. b. c.

Wage of college men relative to high school men: β 2 . Wage of college women relative to high school women: β 2 + β 9 . Wage for college females relative to high school men: ( β 2 + β 5 + β 9 ) − (0) = β 2 + β 5 + β 9 .

d.

Wage of master/females relative to master/men: ( β 3 + β 10 + β 5 ) − ( β 3 ) = β 10 + β 5 (The estimated male/female wage differential is different for each education groups).

e.

Wage of college/men to masters/women: ( β 2 ) − ( β 3 + β 5 + β 10 ) = β 2 − β 3 − β 5 − β 10 .

36

2008 American Economic Association Summer Program Department of Economics University of California Santa Barbara

Brian Duncan ECON 194SE: Econometrics (Session 2)

Summer 2010 MWF 9:30 – 11:00 (NH 2212)

Lecture #5: I. II. III. IV.

I.

Economic relationships and internal/external validity. The problem of determining causal relationships. The causal pathway. Endogenous explanatory variables.

Economic relationships and internal/external validity. Deterministic relationship: There is an exact mathematical relationship between variables. Statistical relationship: There is an association between variables that also contains variation. What is wrong with the following regression? miscarriagesi = α + β 1 nutritioni + β 2 educi +

β 3 pregnanciesi + β 4 abortions i + β 4 livebirthsi + ε i The problem is that there is a deterministic relationship among the variables: pregnancies i = miscarriagesi + abortions i + livebirths i Regressions are used to estimate statistical relationships. In economics we use regression analysis to estimate economic relationships. Internal validity:

The analysis correctly answers the research question within the sample population.

External validity:

Results are generalizable to the other groups or the general population.

37

II.

The problem of determining causal relationships Consider estimating for following regressions: (1) (2)

wagei = α 1 + β 1 agei + ε i agei = α 2 + β 2 wagei + ε i

Model (1) implies that a person’s wage is a function of his or her age. Model (2) implies that a person’s age is a function of his or her wage. Let’s assume that both βˆ1 and βˆ 2 are positive. If we interpret these relationships to be causal, then (2) is clearly wrong. It would imply that a person could expect to become younger by accepting a lower paying job. Moreover, as you can easily verify, OLS cannot distinguish between these two models. That is, both models will produce the same R 2 and the t-stat for β1 will be equal to the t-stat for β 2 (you should verify this on your own with a simple regression). The point is that OLS does not test for causal relationships. Therefore, you should avoid making statements like: “The regression proves that age causes an increase in wage.” “Increasing age by one year will cause wage to increase by five cents.” Instead, you should use language like: “The relationship between age and wage is statistically significant at the 95% confidence level.” “Getting one year older is associated with a five cent increase in wage.”

III.

The causal pathway. Suppose we take as given that smoking during pregnancy causes premature births, and that premature births result in low birth weight. A researcher is interested in estimating how smoking impacts birth weight. Which regression is appropriate? (1) weight i = α 1 + β1 smo keri + β 2 gestationalagei + β 3 motheragei + ε i (2) weight i = α 2 + β 4 smo keri + β 5 motherage + ε i The answer is that both are appropriate (or inappropriate), and that each answer a different research question. Suppose that the true causal relationship can be summarized in the following diagram (with arrows representing a causal effect): Smoking (x2)

Premature birth (x1)

38

Low birth weight (y)

This diagram indicates that smoking causes a baby to be born earlier and that babies born earlier have low birth weight. That is, smoking does not directly cause low birth weight, it indirectly causes low birth weight by causing babies to be born earlier. If this were the case, then we would expect β1 from (1) to be zero. However, it would be incorrect to say that regression (1) indicates that smoking does not cause low birth weights. It would be correct to say that the estimated marginal effect of smoking on birth weight is zero. That is, holding constant when a baby is born, smoking is not associated with lower birth weights. In essence, we would expect this result if smoking did not cause babies to gain weight slower, just to be born earlier. Now suppose that the true causal relationships can be summarized in the following diagram: Premature birth (x1) Low birth weight (y) Smoking (x2) This diagram indicates that smoking causes premature births and that premature birth causes low birth weights, but that smoking also directly causes low birth weight. That is, smoking causes babies to be born earlier, and it slows the growth of babies in the womb. If this were the case, we would expect β1 from (1) to be negative. Therefore, both regressions (1) and (2) are correct. Regression (2) can answer the question “is smoking associated with low birth weights?” Regression (1) can answer the question “does smoking slow the growth of babies in the womb?” If anything is incorrect, it would be the researcher incorrectly interpreting the regression. For example, claiming that because β 1 = 0 smoking will not cause lower birth weights would be an incorrect interpretation.

39

IV.

Endogenous explanatory variables. An endogenous variable is a variable that is correlated with the error term. This occurs because of an omitted variable, measurement error, or simultaneity. Consider another example of the possible relationships between ability, education, and wage: Experience (x) [Exogenous] Education (y2) [Endogenous]

Wage (y1)

Ability (unobserved) If we want to estimate the marginal effect of education on wage, holding constant experience and ability, the “correct” model would be: y1i = α + β y 2i + γxi + δ ability i + ε i . However, if ability is unobserved, then we would be forced to estimate: y1i = α + β y 2i + γxi + ε i . As we have seen, the estimate of β can suffer omitted variable bias if education is correlated with ability. Ways to control for endogenous explanatory variables include: A. B. C. D. E.

A.

Acknowledge / ignore / assign direction to the problem. Proxy variables. Fixed effect / first differencing. Instrumental variables and two stage least squares. Natural experiments.

Acknowledge / ignore / assign direction to the problem. Every regression has omitted variables. As we have already seen, variables that are “irrelevant” omitted variables will not cause bias. Also, omitted variables will not cause bias if they are uncorrelated with the variables included in the model. Acknowledging and ignoring the problem is acceptable if we can assign direction to the bias (or argue that it is zero). For example, if we can argue that omitting a 40

particular unobserved factor will bias a positive coefficient towards zero, then we can argue that any positive estimate is an underestimate (a.k.a. a “lower bound” estimate).

B.

Proxy variables. One way to deal with omitted variable bias is to identify a proxy variable for the unobserved factor. A good proxy variable will be highly correlated with the unobserved factor. For example, in the example above, the omitted factor is ability. Often, omitted variables are things that are abstract and therefore unobservable. Ability is a concept. Human capital is a concept. We cannot hope to know “ability”, but we can hope to have a measure of, or proxy for, ability. IQ scores are a possible measure of ability. Consider the hypothetical regression: ability i = π 0 + π 1 IQi + ν i . Note that this is a hypothetical regression because “ability” is not observed. If π = 0 , then IQ is not a proxy for ability. In this case, we would expect that π > 0 . We then could estimate the regression: wagei = α + β educi + γ exp i + φ IQi + ε i . In this case, we would argue that γˆ is not biased due to unobserved ability because IQ is a proxy for ability. In order for IQ to be a valid proxy for ability, the following condition must hold: E (ability | educ, exp, IQ) = E (ability | IQ) = π 0 + π 1 IQ In words, the average level of ability changes with IQ, but not with education or experience once IQ is controlled for. The counter argument would be that IQ is not a perfect measure of ability, and so there is still unmeasured ability in the regression that could bias βˆ . If they are available to us, we can use several variables to proxy for one or more unobserved factors.

C.

Fixed effect / first differencing. If the unobserved factor is fixed over time (i.e., time consistent omitted variables), and if we have panel data, then we can estimate fixed effects or first differenced models. More on this in an upcoming lecture…

41

D.

Instrumental variables and two stage least squares. Instrumental variables is a way to isolate exogenous variation within the endogenous variable y 2i . We then use this exogenous variation to identify the relationship between y 2i and the dependent variable y1i . In order to use the instrumental variables technique, you will need to identify an “instrument.” An instrument is a variable that satisfies two conditions that we will discuss later. As you will see, it is very difficult, and sometimes impossible, to find a variable that qualifies as an instrument.

E.

Natural experiments. A situation where an otherwise endogenous variable is changed by some external random process, usually occurring by accident or by a policy or institutional change. Natural experiments can often lead to a good instrument or can be used to set up a control group and a treatment group.

42

2008 American Economic Association Summer Program Department of Economics University of California Santa Barbara

Brian Duncan ECON 194SE: Econometrics (Session 2)

Lecture #6: I.

I.

Summer 2010 MWF 9:30 – 11:00 (NH 2212)

(15.1, 15.2, 15.3, 15.5) Instrumental variables and two stage least squares.

Instrumental variables and two stage least squares. Once again, consider the hypothetical causal relationships between ability, education, and wage: Experience (x) [Exogenous] Education (y2) [Endogenous]

Wage (y1)

Ability (unobserved) This diagram indicates that “education” is a determinant of wage, but that it is also a dependent variable (i.e., determined by other observed and unobserved factors which also influence wage). A model that reflects the diagram above is: (1) (2)

y1i = α + β y 2i + γxi + ε i . y 2i = π 0 + π 1 z i + π 2 xi + ν i .

OLS estimates of β will be biased and inconsistent if cov( y 2 , ε ) ≠ 0 , which will be the case if there are unobserved factors (like ability) which influence both education and wage. The technique of instrumental variables can be used to obtain a consistent estimate of β provided that an appropriate instrument (z) can be found. An instrument is an observed variable (which we call variable z) that is correlated with y 2 (the endogenous variable) but uncorrelated with ε (the error term in (1)). These two conditions are formally written as: (i) (ii)

cov( z , y 2 ) ≠ 0 cov( z , ε ) = 0

(the instrument is “relevant”). (the instrument is “valid”).

43

The first condition is fairly straightforward. We need to find a variable that is correlated with education. Because both our instrument and education are observed, this will be easy to test (more on this below). The second condition is a bit more difficult. We need to find a variable that is uncorrelated with the error term in (1). The error term is unobserved, and so this condition will be difficult, if not impossible, to test. It is also not immediately obvious to most people what it means for a variable to be uncorrelated with the error term. As an example, a variable would be correlated with education, but uncorrelated with the error in the wage equation if it influences education, but that the only way it influences wage is through its effect on education. Consider the following diagram: Experience (x)

Instrument (z)

Education (y2)

Wage (y1)

Ability (unobserved) The point is that the instrument z influences y1 only through its influence on y 2 . If we drew another arrow from z directly to y1 , or from z to ability, then z would be an invalid instrument. If we erase the arrow from z to y 2 then z would be an irrelevant instrument. A common way of saying conditions (i) and (ii) is that “an instrument must be correlated with education (the endogenous variable) but uncorrelated with ability (the unobserved factor).” Note that a proxy variable for ability, like IQ, would make a very bad instrument. IQ would be correlated with education, meaning it would pass condition (i). However, it would fail condition (ii) because IQ is correlated with ability. A possible example of an instrument for education is quarter of birth. A person’s quarter of birth determines when he or she starts kindergarten, which in turn determines what grade a person is in when he or she can legally drop out of school. Students who are allowed to drop out of school before they start their senior year are more likely to drop out (i.e., if you are forced to start your senior year, then you are more likely to finish). Therefore, a person’s quarter of birth can influence dropping out, but is (hopefully) unrelated to a person’s ability. Other possible examples are parental education, number of siblings, birth order, number or two year and four year colleges in commuting distance of ones childhood home, etc. All of these instruments have some potential problems with condition (ii), but this is not uncommon. It is difficult to find a good instrument.

44

IV and 2SLS Once we have identified an instrument (not an easy task) we can use it to obtain consistent estimates of β using the technique of instrumental variables (IV) or of two-stage least squares (2SLS). IV and 2SLS are two different estimation techniques that arrive at the same βˆ . IV uses a different formula for calculating βˆ than does OLS. For example, start with the regression:

y = α + βx + ε . Suppose that x is an endogenous variable but that we have a variable z such that:

cov( z, x) ≠ 0 . cov( z, ε ) = 0 .

(i) (ii)

The regression equation implies that:

cov( z, y ) = β cov( z, x) + cov( z , ε ) . Imposing the two instrument conditions above and solving for β :

β=

cov( z , y ) . cov( z , x)

Entering the sample estimates into this equation we get (note that the 1/n in numerator and denominator cancel out):

βˆiv =

∑ (z ∑ (z

i

− z )( y i − y )

i

− z )( xi − x )

.

Note that if z i = xi , then βˆiv = βˆ ols . Thus, if x is exogenous then it would satisfy (i) and (ii) and could serve as an instrument for itself (but this would just lead us back to the OLS estimate). The IV estimator is consistent provided that the instrument conditions (i) and (ii) are met. The matrix version of the IV estimator is: βˆ iv = (Z' X) −1 Z ' Y .

This matrix formula allows there to be more than one endogenous variable, along with one or more than one exogenous variable. However, it assumes that you have exactly one instrument for each of the endogenous variables in X . This is reflected in the fact that the matrix Z has the same dimensions as X . If it did 45

not, then the formula would not be conformable. The matrix Z is created by starting with the matrix X and then replacing each endogenous variable (i.e., its column in the X matrix) with its instrument. The exogenous variables in X are left alone. Technically this means that an exogenous variable serves as its own instrument. This might sound a little strange, but if a variable is exogenous then it would satisfy condition (ii). A variable is obviously correlated with itself, and so would satisfy condition (i). Therefore, it follows that an exogenous variable is its own instrument. What is two stage least squares (2SLS)? It is possible to use two or more instruments for one endogenous variable. In fact, if you can find two variables that meet the criteria for an instrumental variable, then using two is better than using one. However, the above formula would no longer work (because it would no longer be conformable). Instead, we can create a new variable which is a linear combination of our instruments. We can then use this one variable as our “supper instrument.” This technique is called the general form if the IV estimator, which is also known as the 2SLS estimator: βˆ 2 sls = [(X' Z(Z' Z) −1 Z' X]−1 X' Z(Z' Z) −1 Z ' Y .

This general form of the IV estimator is often referred to as the 2SLS estimator because it can be calculated using the formula above, or using a two step estimation process. The two stage least squares estimation procedure is:

First stage:

Estimate regression model (2) from above (the endogenous variable is the dependent variable). y 2i = π 0 + π 1 z1i + π 2 z 2i + π 3 xi + ν i . Note that we have two instruments called z1 and z 2 . Next we calculate the predicted values yˆ 2i . yˆ 2i = πˆ 0 + πˆ1 z i + πˆ 2i z 2i + πˆ 3 xi . We can use STATA’s “predict” command to create yˆ 2i . Notice that yˆ 2i is a linear combination of the instruments and the exogenous variables. This is our “supper” instrument because it combines two or more instruments into one variable.

46

Second stage: Estimate regression model (1) from above, replacing y 2i with yˆ 2i . y1i = α + β 2 sls yˆ 2i + γxi + ε i . The estimate βˆ 2 sls is the two stage least squares estimator. As stated above, it is identical to the general form IV estimator. However, when using the two step procedure, the standard errors reported for βˆ 2 sls will be incorrect. For this reason, STATA’s “ivreg” command calculates IV/2SLS estimates using the general IV formula given above. Why does IV/2SLS work? Think about the predicted values ( yˆ 2i ) that are calculated in the first stage regression. Notice that the predicted values are a function of z and x both of which are exogenous variables (i.e., not correlated with ε i ). Therefore, yˆ 2i is exogenous. In essence, we have used z to isolate exogenous variation in y 2 . In the second stage, we determine the relationship between the exogenous variation in y 2 and the dependent variable ( y1 ). Important note: In 2SLS, the first stage regression ALWAYS includes ALL of the variables that are in the second stage regression, plus at least one instrument. The instrument (z) is excluded from the second stage regression. This is called the identifying restriction. You must have at least as many instruments as you have endogenous variables, although you may have more.

A.

Testing the Instruments. There are three tests that are often used in the context of instrumental variables estimation. 1.

Test for relevance of the instrument. An instrument, z, is said to be “relevant” is cov( z , y 2 ) ≠ 0 . That is, the instrument must be correlated with the endogenous variable that is causing the trouble. Although correlation is all that is required, we often argue that the instrument causes changes in the endogenous (i.e., quarter of birth causes differences in education). Testing that the instrument is correlated with the endogenous variable is straightforward because both z and y 2 are actual variables in our data set. In fact, the first stage of the 2SLS procedure is to estimate the regression:

47

y 2i = π 0 + π 1 z1i + π 2 z 2i + π 3 xi + ν i . The relevance of z can be tested using a t-test if there is one instrument, or using an F-test if there are two or more instruments. If the instruments are statistically significant in the first stage regression then they pass the relevance test. However, Staiger and Stock (1997) show that using the 95% confidence level to test the relevance of instruments is not sufficient. Their rule of thumb that they developed is that the F-stat must be greater than 10 for us to conclude that the instruments are relevant. 2.

Test for validity of the instrument. An instrument, z, is said to be “valid” if cov( z , ε ) = 0 . This condition is difficult to test because ε is not observed. If we have one instrument for each endogenous variable, then we cannot test this condition. However, if we have more than one instrument (i.e., more instruments than we need), then we can test if some of the instruments meet the validity condition using the overidentifying restrictions test. I will not go into the details of the overidentifying restrictions test (although it is discussed in your textbook, if you are interested). The overidentifying restrictions test is a weak test. For example, it assumes that at least one instrument is valid (i.e., if all of your instruments are invalid, then you will pass the test!). The best test for the validity of an instrument is the economic theory test. Because there are no good tests for validity, we are often required to assume that the instruments are valid. In this situation, it is best to have a strong theoretical argument for why we should believe that the instrumental variables are uncorrelated with the error term.

3.

Test for endogeneity of the explanatory variable. We used IV/2SLS when we have an endogenous explanatory variable and are able to identify a good instrument. If what we think is an endogenous explanatory variable turns out to be an exogenous variable, then 2SLS remains consistent. However, OLS would also be consistent and, in fact, more efficient that 2SLS. For example, if education really were exogenous (i.e., there is no problem with the education variable), the 2SLS is consistent, but we would rather use OLS because it is more efficient (i.e., a more accurate estimator). We can test the endogeneity of the explanatory variable using the Hausman (1978) model specification test. The null hypothesis is that the explanatory is exogenous. If the null hypothesis is true, then OLS and 2SLS are estimating the same thing. Therefore, the OLS and 2SLS coefficient estimates should be the same except for some sampling error.

48

Therefore, if the 2SLS coefficient estimates are the same as the OLS coefficient estimates, then we conclude that the explanatory variable is exogenous and that we should use OLS. To test this hypothesis, first estimate the regression: y2 = π 0 + π 1 x + π 2 z +ν . Next, calculate the residuals, νˆ . Finally, estimate the regression: y1 = α + β1 x + β 2 y 2 + δνˆ + ε .

Test hypothesis H 0 : δ = 0 . If you reject H 0 then we conclude that y 2 is endogenous. If you fail to reject, then you should consider using OLS.

49

IV / 2SLS in STATA This is the 1970 census data used in Angrist and Krueger (1991), “Does Compulsory School Attendance Affect Schooling and Earnings?” Quarterly Journal of Economics. These regressions are simplified versions of the ones that appear in the paper. Variable Definitions: educ age age2 earnings QOB

= = = = =

years of education age age squared weekly earnings quarter of birth

. tab QOB, sum(educ) | Summary of educ QOB | Mean Std. Dev. Freq. ------------+-----------------------------------1 | 11.399598 3.390094 62628 2 | 11.443421 3.3860724 60888 3 | 11.556048 3.341424 64088 4 | 11.575434 3.3206336 59595 ------------+-----------------------------------Total | 11.493343 3.3606635 247199

This tabulation shows that having a later quarter of birth is associated with higher levels of education.

* OLS Regression . reg earnings educ age age2 Source | SS df MS -------------+-----------------------------Model | 748343256 3 249447752 Residual | 3.7329e+09247195 15100.9863 -------------+-----------------------------Total | 4.4812e+09247198 18128.1061

Number of obs F( 3,247195) Prob > F R-squared Adj R-squared Root MSE

= 247199 =16518.64 = 0.0000 = 0.1670 = 0.1670 = 122.89

-----------------------------------------------------------------------------earnings | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | 16.37824 .0736116 222.50 0.000 16.23396 16.52251 age | 19.28083 2.997912 6.43 0.000 13.40501 25.15666 age2 | -.2058034 .0332103 -6.20 0.000 -.2708946 -.1407122 _cons | -430.9793 67.44107 -6.39 0.000 -563.162 -298.7966 ------------------------------------------------------------------------------

This OLS regression shows that one additional year of schooling is associated with $16.38 higher weekly earnings. We might be concerned that this estimate is biased upwards due to unobserved ability.

50

* First Stage Regression . regress educ age age2 QOB1 QOB2 QOB3 Source | SS df MS -------------+-----------------------------Model | 5911.05802 5 1182.2116 Residual | 2785957.74247193 11.2703747 -------------+-----------------------------Total | 2791868.8247198 11.294059

Number of obs F( 5,247193) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

247199 104.90 0.0000 0.0021 0.0021 3.3571

-----------------------------------------------------------------------------educ | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | .1168128 .0819057 1.43 0.154 -.0437202 .2773459 age2 | -.0018177 .0009073 -2.00 0.045 -.003596 -.0000393 QOB1 | -.1368288 .0193097 -7.09 0.000 -.1746753 -.0989824 QOB2 | -.1069991 .0193863 -5.52 0.000 -.1449958 -.0690024 QOB3 | -.007479 .0191147 -0.39 0.696 -.0449433 .0299853 _cons | 10.00029 1.842381 5.43 0.000 6.389269 13.61131 -----------------------------------------------------------------------------. test QOB1 QOB2 QOB3 ( 1) ( 2) ( 3)

QOB1 = 0 QOB2 = 0 QOB3 = 0 F(

3,247193) = Prob > F =

26.15 0.0000

. predict educ_hat, xb

This is the first stage of the 2SLS regression. The QOB variables are statistically significant. The F-stat is 26.15. This is above the Staiger-Stock (1997) rule of thumb that the F-stat should be greater than 10. The “predict” command creates the predicted values of education that will be used in the second stage.

51

* Second Stage Regression . regress earnings educ_hat age age2 Source | SS df MS -------------+-----------------------------Model | 868763.173 3 289587.724 Residual | 4.4804e+09247195 18124.8116 -------------+-----------------------------Total | 4.4812e+09247198 18128.1061

Number of obs F( 3,247195) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

247199 15.98 0.0000 0.0002 0.0002 134.63

-----------------------------------------------------------------------------earnings | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ_hat | 9.888947 4.527654 2.18 0.029 1.014865 18.76303 age | 20.04195 3.327018 6.02 0.000 13.52108 26.56282 age2 | -.2177852 .0373314 -5.83 0.000 -.2909538 -.1446166 _cons | -366.2517 86.59059 -4.23 0.000 -535.967 -196.5364 ------------------------------------------------------------------------------

This is the second stage of the 2SLS regression. This 2SLS regression shows that one additional year of schooling is associated with $9.89 higher weekly earnings. The coefficient estimates in this regression are correct, but the standard errors are incorrect. To obtain correct standard errors, we should use STATAs “ivreg” command (or “ivregreess” if we are using STATA 10). In addition to providing the correct standard errors, it is easier to run. . ivreg earnings age age2 (educ = QOB1 QOB2 QOB3) Instrumental variables (2SLS) regression Source | SS df MS -------------+-----------------------------Model | 630986436 3 210328812 Residual | 3.8502e+09247195 15575.7403 -------------+-----------------------------Total | 4.4812e+09247198 18128.1061

Notice that we did not have the tell STATA to include age and age2 in the 1st stage. It knows that all of the x variables should be included in the 1st state. Number of obs F( 3,247195) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

247199 18.59 0.0000 0.1408 0.1408 124.8

-----------------------------------------------------------------------------earnings | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | 9.888934 4.19721 2.36 0.018 1.662513 18.11536 age | 20.04196 3.084202 6.50 0.000 13.997 26.08691 age2 | -.2177853 .0346069 -6.29 0.000 -.2856138 -.1499568 _cons | -366.2517 80.27094 -4.56 0.000 -523.5806 -208.9227 -----------------------------------------------------------------------------Instrumented: educ Instruments: age age2 QOB1 QOB2 QOB3 ------------------------------------------------------------------------------

STATA’s “ivreg” command gives the same coefficient estimates as the manual 2SLS regressions above, but the standard errors are correct. Notice that the 2SLS coefficient on education is smaller than the OLS coefficient. This is consistent with our concern that the OLS estimate is biased upwards. However, also notice that OLS estimate of 16.38 falls in the 95% confidence interval for the 2SLS estimate. Perhaps the OLS and 2SLS estimates are the same, except for sampling error. We should do the Hausman (1978) test to find out. 52

* Hausman (1978) test for endogeneity of the explanatory variable. . reg educ age age2 QOB1 QOB2 QOB3 Source | SS df MS -------------+-----------------------------Model | 5911.05802 5 1182.2116 Residual | 2785957.74247193 11.2703747 -------------+-----------------------------Total | 2791868.8247198 11.294059

Number of obs F( 5,247193) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

247199 104.90 0.0000 0.0021 0.0021 3.3571

-----------------------------------------------------------------------------educ | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | .1168128 .0819057 1.43 0.154 -.0437202 .2773459 age2 | -.0018177 .0009073 -2.00 0.045 -.003596 -.0000393 QOB1 | -.1368288 .0193097 -7.09 0.000 -.1746753 -.0989824 QOB2 | -.1069991 .0193863 -5.52 0.000 -.1449958 -.0690024 QOB3 | -.007479 .0191147 -0.39 0.696 -.0449433 .0299853 _cons | 10.00029 1.842381 5.43 0.000 6.389269 13.61131 -----------------------------------------------------------------------------. predict e_hat, resid . reg earnings age age2 educ e_hat Source | SS df MS -------------+-----------------------------Model | 748380501 4 187095125 Residual | 3.7329e+09247194 15100.8967 -------------+-----------------------------Total | 4.4812e+09247198 18128.1061

Number of obs F( 4,247194) Prob > F R-squared Adj R-squared Root MSE

= 247199 =12389.67 = 0.0000 = 0.1670 = 0.1670 = 122.89

-----------------------------------------------------------------------------earnings | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 20.04196 3.036825 6.60 0.000 14.08987 25.99406 age2 | -.2177854 .0340753 -6.39 0.000 -.284572 -.1509988 educ | 9.888888 4.132737 2.39 0.017 1.788833 17.98894 e_hat | 6.491408 4.133392 1.57 0.116 -1.609931 14.59275 _cons | -366.2512 79.0379 -4.63 0.000 -521.1634 -211.339 ------------------------------------------------------------------------------

In this case, we fail to reject the null hypothesis that the coefficient on e_hat is statistically significant. The OLS estimates are not significantly different from the 2SLS estimates. Therefore, there is no evidence that education is endogenous, and so we could consider using the OLS estimates. If fact, Angrist and Krueger (1991) concluded that there was little bias in the OLS coefficient on education. * This is the way to do IV/2SLS in Stata 10 . ivregress 2sls earnings age age2 (educ = QOB1 QOB2 QOB3) . estat endogenous Tests of endogeneity Ho: variables are exogenous Durbin (score) chi2(1) Wu-Hausman F(1,247194)

= =

2.4664 2.46637

53

(p = 0.1163) (p = 0.1163)

2008 American Economic Association Summer Program Department of Economics University of California Santa Barbara

Brian Duncan ECON 194SE: Econometrics (Session 2)

Summer 2010 MWF 9:30 – 11:00 (NH 2212)

Lecture #7: I.

Panel data. A. Wide versus long data. B. Within versus between variation. C. The pooled regression. D. The fixed effects regression. E. The differenced regression. F. Random effects.

I.

Panel data. Panel data (also called longitudinal data) contains elements of both cross-sectional data and time series data. In cross-sectional data, each unit (a person, a state, a country, etc…) is represented in the data by one observation. In time series data, there is information on one unit at different points in time. Panel data contains information on several units at several points in time. The notation for panel data is: Yit = α + βX it + ε it , where i = 1K m and t = 1KT . The “i” subscript indicates the unit (i.e., a person) and the “t” subscript represents the time period (i.e., month, year, etc…). The total number of observations in the data set (assuming that there are no missing observations) is : n = m × T . The advantages of panel data: 1. 2. 3.

More information, more variation, more efficient estimates. Can study the dynamics of change. Control for time invariant unobserved characteristics.

54

A.

Wide versus long data. There are two ways panel data can be entered into a statistics program: wide format or long format. Wide format looks like this:

Wide Format id 1 2 3 4

female wage80 wage90 wage00 age80 age90 age00 edu80 edu90 edu00 1 12.56 13.22 14.34 21 31 41 12 16 16 0 10.54 11.96 19.89 19 29 39 12 12 16 0 7.89 7.99 10.50 26 36 46 16 16 16 1 12.67 25.25 55.76 18 28 38 12 16 18 The table above lists four individuals, along with there gender and their wage, age, and years of education in the years 1980, 1990, and 2000. Although it is common for data to come in wide format, most regression analysis requires that the data be in long format. Writing the exact same information in long format looks like this:

Long Format id 1 1 1 2 2 2 3 3 3 4 4 4

year 1980 1990 2000 1980 1990 2000 1980 1990 2000 1980 1990 2000

wage 12.56 13.22 14.34 10.54 11.96 19.89 7.89 7.99 10.50 12.67 25.25 55.76

age 21 31 41 19 29 39 26 36 46 18 28 38

education 12 16 16 12 12 16 16 16 16 12 16 18

female 1 1 1 0 0 0 0 0 0 1 1 1

Notice that variables “wage” and “education” vary both between individuals, and within individuals across time. We write this type of variable as “ wageit ”. Conversely, the variable “female” varies between individuals, but not within individuals. We write this type of variable as “ femalei ” (note that we do not write the t subscript). Finally, the variable “year” varies within individuals, but not across individuals. We write this type of variable as “ yeart ”. You can used STATA’s “reshape” command to convert a data set from wide to long or from long to wide.

55

B.

Within versus between variation. A regression estimates coefficients based on the variation in the data. In panel data, there are two types of variation: “between” variation and “within” variation. Between variation, also called cross-sectional variation, represents the variation from one individual to the next individual (or unit). Within variation, also called time-series variation, represents the variation within an individual (or unit) from one time period to the next.

Y11 Y12 Y1T

Between variation

Y21 L Ym1 Y22 Ym 2 O L YmT

Within variation

Within variation

Between variation

X 11 X 12 M X 1T

X 21 L X m1 X 22 X m2 O M L X mT

We can use one or both types of variation to estimate our regression coefficients. The pooled regression model uses both type of variation, without any distinction.

C.

The pooled regression. The simplest approach to estimating a panel data model is to ignore the fact that the data is a panel. For example, the pooled regression estimates the model: Yit = α + βX it + ε it ,

(1)

where the total number of observations is n = m × T . The pooled regression makes no reference to the fact that the data is organized by unit and time. The pooled regression is somewhat naïve because it makes the following assumptions: 1. 2. 3.

The intercept is the same for each unit and for each time period. The betas are the same for each unit and are constant over time. The structure of the error is the same for each unit and for each time period.

While we can relax each of these assumptions with more sophisticated models, we cannot relax all of them at the same time. The pooled regression identifies β using both the within and between variation. In certain cases, this may be appropriate. However, in other cases, it can lead to pooled regression bias.

56

Pooled Regression Bias Y

Pooled Regression Person 1

Y1 Person 2

Y2

X X2

X1

Here is a situation where the relationship between X and Y is the same for person 1 and person 2. However, because person 1 and 2 have difference intercepts, the pooled regression line is biased. The fixed effects gives each individual their own intercept, while restricting all individuals to have the same slope.

D.

The fixed effects regression. The fixed effects regression allows each unit to have a different intercept. The fixed effects regression model is written as: Yit = α + βX it + γ i + ε it

(2)

The γ i term represents an individual specific fixed-effect. Another way to think about the fixed effects regression is that it allows each person’s regression line to pass through the mean of their individual data, while restricting everyone to the same beta (slope). Because each person gets their own regression line, the fixed effects regression does not compare one person to another person. This means that the fixed effects regression only uses the within variation to identify the slope coefficients. There are two ways to estimate a fixed effects regression: the lease squares dummy variable model, and the deviations from group means model.

57

The least squares dummy variable model (LSDV). To estimate the least squares dummy variable model, you must first create m dummy variables, d1i , d 2i ,K, d mi , where d1i is a dummy variable equal to 1 if the observation comes from person 1 and zero otherwise. After you have created these dummy variables, you include all but one of them in the following regression: Yit = α 0 + βX it + α 1 d1i + α 2 d 2i + K + α m −1 d m −1,i + ε it ,

(3)

This regression gives each individual their own intercept. For example, the intercept for person 5 is “ α 0 + α 5 ”. We had to omit one of the dummy variables to avoid the dummy variable trap. This model allows each person’s regression line to pass through the mean of their individual data, but restricts all individuals to the same β . The LSDV model can become difficult to estimate if m is large. If all we are interested in is the estimate of β from (3), then we can used the deviation from group means model. You can use STATA’s “xi” command to create the individual dummy variable automatically.

Deviations from group means model. Starting from (2): Yit = α + βX it + γ i + ε it

∑t Yit = ∑t (α + βX it + γ i + ε it ) . ∑t Yit

= Tα + β ∑t X it + Tγ i + ∑t ε it .

Divide by T: Yi = α + βX i + γ i + ε i .

(6)

Subtract (2) – (6): Yit − Yi = (α + βX it + γ i + ε it ) − (α + βX i + γ i + ε i ) . Yit − Yi = β ( X it − X i ) + (ε it − ε i ) .

58

Notice how the α and γ i terms cancel out. The deviations from group means model is written as: ~ ~ Yit = βX it + ε~it . (the deviations from group means regression)

(7)

~ ~ Yit and X it are deviations from group means, which is why this model is called the deviations from group means regression. What is important to note is that βˆ

from this model is mathematically identical to the βˆ from the LSDV model. However, in (7) all of the individual intercept terms have dropped out. In fact, notice how every variable that does not vary across time drops out of this model. Things like gender and race may explain some of the variation in wages between individuals, but they cannot explain the variation within an individual (because they do not vary within an individual). Thus, the true power of the fixed effects regression is that is can control for things that are not observed. Anything that varies between individuals, but not within an individual, is controlled for (i.e., held constant) by the fixed effects regression. Note: although the betas from the LSDV and deviations from group means models are mathematically identical, their standard errors are not. This is because the deviations from group means model reports an incorrect degrees of freedom (it does not account for the individual intercepts that dropped out of the model). You can use STATA’s “xtreg” command to estimate a fixed effects regression with the correct standard errors.

E.

The differenced regression. The differenced regression is an intuitive model that is very similar (in some cases identical to) a fixed effects regression. The differenced regressions also begins with model (2): Yit = α + βX it + γ i + ε it . If (2) is true then the following is true: Yit −1 = α + βX it −1 + γ i + ε it −1 . Therefore, Yit − Yit −1 = (α + βX it + γ i + ε it ) − (α + βX it −1 + ε t −1 ) . Yit − Yit −1 = β ( X it − X it −1 ) + (ε it − ε t −1 ) .

59

ΔYit = βΔX it + Δε it (the differenced regression). This regression is very similar to the fixed effects regression. Anything that varies between individuals, but not within an individual, drops out of this regression.

F.

Random effects. Consider the regression equation: Yit = α + βX it + γ i + ε it . The random effects estimator assumes that γ i is randomly distributed. If this is the case, then above equation can be written as: Yit = α + β X it + ν it , where ν it = γ i + ε it . The random effects assumption implies that γ i is uncorrelated with the repressors in the model (i.e., the X variables). This might be appropriate if the X variables were randomly assigned. If the random effects assumption is true, then the random effects estimator is more efficient than the fixed effects estimator. However, if the assumption is not true, then the random effects estimator is biased and inconsistent. 1. 2. 3. 4.

The random effects (RE) estimator can be calculated using GLS. The RE estimator is similar to the pooled regression in that it uses both the between and within variation. The RE estimator does not control for individual time invariant unobserved factors (the FE estimator does). The fixed effects estimator is usually considered a more convincing estimator.

60

Pooled Regression / Fixed Effects / Random Effects in STATA The data contains men between the ages of 30 and 49. Variable Definitions: wage educ black hisp occ_exp eocc_exp2 married union poorhlth year personID

= = = = = = = = = = =

Hourly wage. Year of education (does not vary over time in this sample). Dummy variable for black. Dummy variable for Hispanic. Year working in current occupation. occ_exp squared. Dummy variable for married. Dummy variable (the person is in a union). Dummy variable (is person is in poor health). year (8 years, from 1980 to 1987). Person identifier (545 individuals in the sample).

In this sample m = 545 and T = 8. Therefore m×T = 4360. . * The pooled OLS regression . regress wage educ black hisp occ_exp occ_exp2 married union poorhlth year Source | SS df MS -------------+-----------------------------Model | 7464.12379 9 829.347088 Residual | 37234.1302 4350 8.55957017 -------------+-----------------------------Total | 44698.254 4359 10.254245

Number of obs F( 9, 4350) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

4360 96.89 0.0000 0.1670 0.1653 2.9257

-----------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .4880365 .027161 17.97 0.000 .4347871 .5412859 black | -.6740385 .1432051 -4.71 0.000 -.9547934 -.3932836 hisp | .0256833 .1265909 0.20 0.839 -.2224994 .2738661 occ_exp | .0776697 .0242704 3.20 0.001 .0300873 .1252521 occ_exp2 | -.0062058 .0025019 -2.48 0.013 -.0111108 -.0013007 married | .550235 .0942421 5.84 0.000 .3654725 .7349975 union | .9150828 .104183 8.78 0.000 .710831 1.119335 poorhlth | -.2888598 .3436057 -0.84 0.401 -.962502 .3847824 year | .2995305 .0243322 12.31 0.000 .251827 .3472341 _cons | -594.4283 48.1833 -12.34 0.000 -688.8921 -499.9644 ------------------------------------------------------------------------------

This is the pooled OLS regression. This regression uses both the within and between variation. Time invariant variables (educ, black, hisp) are in the regression. The union coefficient suggestions that union workers earn 92 cents more per hour than do non-union workers. This coefficient could be biased if there are unobserved factors that are correlated with union and wage.

61

. xtreg wage educ black hisp occ_exp occ_exp2 married union poorhlth year, i(personID) fe Fixed-effects (within) regression Group variable: personID

Number of obs Number of groups

= =

4360 545

R-sq:

Obs per group: min = avg = max =

8 8.0 8

within = 0.1711 between = 0.0494 overall = 0.0887

corr(u_i, Xb)

F(6,3809) Prob > F

= 0.0327

= =

131.09 0.0000

-----------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | (dropped) black | (dropped) hisp | (dropped) occ_exp | .0455284 .0181076 2.51 0.012 .0100268 .08103 occ_exp2 | -.0056926 .0018696 -3.04 0.002 -.0093582 -.0020271 married | .314382 .1038621 3.03 0.002 .1107512 .5180127 union | .4495928 .1100836 4.08 0.000 .2337644 .6654213 poorhlth | .0368509 .2692501 0.14 0.891 -.4910373 .5647392 year | .3392024 .0184001 18.43 0.000 .3031274 .3752774 _cons | -667.1464 36.45789 -18.30 0.000 -738.6253 -595.6675 -------------+---------------------------------------------------------------sigma_u | 2.4191171 sigma_e | 2.0037536 rho | .59309163 (fraction of variance due to u_i) -----------------------------------------------------------------------------F test that all u_i=0: F(544, 3809) = 10.05 Prob > F = 0.0000

This is the fixed effects regression. This regression uses the within variation. Time invariant variables (educ, black, hisp) drop out of this regression. The union coefficient remains statistically significant, but the magnitude is about half of the pooled OLS coefficient. Unobserved individual specific time invariant factors (i.e., things that make individuals different, but that do not change over time within an individual) cannot bias the coefficients in this regression even if they are unobserved. However, time varying unobserved factors could still bias the coefficients in this regression.

62

. xtreg wage educ black hisp occ_exp occ_exp2 married union poorhlth year, i(personID) re Random-effects GLS regression Group variable: personID

Number of obs Number of groups

= =

4360 545

R-sq:

Obs per group: min = avg = max =

8 8.0 8

within = 0.1709 between = 0.1583 overall = 0.1632

Random effects u_i ~ Gaussian corr(u_i, X) = 0 (assumed)

Wald chi2(9) Prob > chi2

= =

885.93 0.0000

-----------------------------------------------------------------------------wage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .4787061 .0568358 8.42 0.000 .36731 .5901021 black | -.6425453 .3070805 -2.09 0.036 -1.244412 -.0406785 hisp | .0474549 .2750781 0.17 0.863 -.4916882 .586598 occ_exp | .0498912 .0179531 2.78 0.005 .0147039 .0850786 occ_exp2 | -.0057024 .0018531 -3.08 0.002 -.0093344 -.0020703 married | .3722155 .0965327 3.86 0.000 .183015 .5614161 union | .549521 .1032984 5.32 0.000 .3470599 .7519822 poorhlth | -.0022857 .2654205 -0.01 0.993 -.5225004 .5179289 year | .331984 .0181063 18.34 0.000 .2964963 .3674716 _cons | -658.4575 35.84474 -18.37 0.000 -728.7119 -588.2031 -------------+---------------------------------------------------------------sigma_u | 2.1338751 sigma_e | 2.0037536 rho | .53141727 (fraction of variance due to u_i) ------------------------------------------------------------------------------

This is the random effects regression. This regression uses both the within and between variation. Time invariant variables (educ, black, hisp) are in this regression. The random effects assumption is that the individual random effects are uncorrelated with the variables included in the regression. This would be appropriate if the variable of interest were randomly assigned. In this case, if individuals were randomly assigned to a union, then this estimator would be appropriate. This is probably not the case in this example, and so the fixed effects regression is more appropriate.

63

2008 American Economic Association Summer Program Department of Economics University of California Santa Barbara

Brian Duncan ECON 194SE: Econometrics (Session 2)

Summer 2010 MWF 9:30 – 11:00 (NH 2212)

Lecture #8: I. II. III. IV.

I.

Maximum-likelihood estimation. Binary choice models. The linear probability model. Probit and logit models.

Maximum-likelihood estimation: Different distributions generate different samples; and so, any one sample is more likely to have come from some distributions than from others.

X 2 X 3 X 6 X 7 X1 X 5 X 4

The observations select the distribution that is the most likely to have generated it. The maximum-likelihood estimator of a parameter β is the value of βˆ that is most likely to have generated the sample. In general, if Yi is an independently distributed random variable (iid), then the probability of obtaining a given sample is: n

p(Y1 ) p(Y2 ) L p(Yn ) = ∏ p(Yi ) .

(the likelihood function)

i =1

The maximum-likelihood estimator maximizes the likelihood function.

64

The likelihood function depends on the sample, the functional form (i.e., the probability distribution) and the parameters. Linear Example: Consider the model: Yi = α + βX i + ε i , where ε i ~ N (0, σ 2 ) . Then, Yi ~ N (α + βX i , σ 2 ) . Notice that by assuming that the error is normally distributed we have defined a functional form for the distribution of Yi . The probability distribution for Yi is written: f (Yi ) =

⎤ ⎡ 1 exp ⎢− (Yi − α − βX i ) 2 ⎥ 2 2 ⎣ 2σ ⎦ 2πσ 1

The likelihood function is: L(Y1 , Y2 ,..., YN ;α , β , σ 2 ) = f (Y1 ) f (Y2 ) L f (YN ) n

=∏ i =1

⎡ 1 ⎤ exp ⎢− (Yi − α − βX i ) 2 ⎥ 2 ⎣ 2σ ⎦ 2πσ 1

2

We want to maximize the likelihood function with respect to the parameters α , β , and σ 2 . To maximize the likelihood function, it is necessary to differentiate it with respect to its parameters, and set the resulting first-order equations equal to zero. It is easier to work with the log likelihood rather than the likelihood function itself (this is acceptable because L is non-negative and the log function is monotonic). ⎡ 1 ⎤ L = ∏ (2π ) −1 / 2 (σ 2 ) −1 / 2 exp ⎢− (Yi − α − β X i ) 2 ⎥ . 2 ⎣ 2σ ⎦ i =1 n

1 1 ⎡ 1 ⎤ ln( L) = ∑ ⎢− log(2π ) − log(σ 2 ) − (Yi − α − βX i ) 2 ⎥ 2 2 2σ ⎣ 2 ⎦

⎛n⎞ ⎛n⎞ ⎛ 1 ⎞ ln( L) = −⎜ ⎟ log(2π ) − ⎜ ⎟ log(σ 2 ) − ⎜ 2 ⎟∑ (Yi − α − βX i ) 2 . ⎝ 2⎠ ⎝ 2⎠ ⎝ 2σ ⎠

65

The first-order conditions for choosing the values α , β , and σ 2 that maximize ln(L) are: 1 ∂ (ln L) = 2 ∂α σ

∑ (Y

i

− αˆ − βˆX i ) = 0 .

∂ (ln L) 1 = 2 ∂β σ

∑X

i

(Yi − αˆ − βˆX i ) = 0 .

∂ (ln L) − n 1 = + 2 2 ∂σ 2σˆ 4 2σˆ

∑ (Y

i

2 − αˆ − βˆX i ) = 0 .

The first two equations are equivalent to the OLS normal equations, and so,

αˆ = Y − βˆX , βˆ =

∑ ( X i − X )(Yi − Y ) . ∑ ( X i − X )2

In this example, the ML estimates are the same as the OLS estimates. This is a bit misleading, because normally maximum likelihood estimates are different than OLS estimates. However, in a linear model, with a normally distributed errors, the ML estimates of α and β are equal to the OLS estimates. II.

Binary choice models.

A binary choice model is one in which the dependent variable represents a choice between two distinct outcomes. For example, consider the variable: ⎧1 if person i voted votei = ⎨ ⎩0 otherwise The variable “vote” is a dummy, or binary, variable. Estimating a regression using OLS when the dependent variable is a dummy variable is called the linear probability model. III.

The linear probability model.

Consider the model: votei = α + βX i + ε i , ⎧1 if person i voted where votei = ⎨ ⎩0 otherwise

66

Estimating the above model with OLS is called the linear probability model. In the model above, the expected value that person i will “vote” given X is: E (votei | X i ) = α + βX i . The expected value of a zero/one variable represents that probability that this variable will be one. In other words, E (votei | X i ) represents the probability that person i will vote. For convenience, let pi represent the probability that person i will vote: pi = E (votei | X i ) = α + βX i . Of course, this means that the probability that person i will not vote is 1 − pi . After estimating an OLS regression, predicted values are defined as: pˆ i ≡ Yˆi = αˆ + βˆX i .

In econometrics, Yi and Yˆi have always had different interpretations: Yi is the observed value and Yˆ is the predicted value. When Y is a dummy variable, this i

difference in interpretation is more pronounced: Yi is a zero/one variable but Yˆi is a continuous variable. This is because Yˆ represents a predicted probability, pˆ . i

i

For example, in the equation above, Yi represents whether or not person i actually voted, whereas Yˆ represents the estimated probability that person i will vote. i

Similarly, in the regression above, the marginal effect of X on Y is: ∂votei ∂pi = =β. ∂X i ∂X i

The OLS coefficient β represents the change in the probability of voting due to a one-unit increase in X. Once you understand that the predicted values represent predicted probabilities, and that the coefficients represent marginal effects on those predicted probabilities, the linear probability model becomes an intuitive and powerful tool. However, there are some problems inherent in the linear probability model. The main problems are: a. b.

Heteroscedastic errors. The condition 0 ≤ pˆ i ≤ 1 does not always hold.

c.

R 2 is a poor measure of goodness of fit.

67

However, it is important to note that, although the linear probability model has its problems, it is still a powerful tool when used and interpreted correctly. For example, αˆ and βˆ are unbiased estimates, and therefore, the predicted probabilities, pˆ i , are unbiased. A.

Heteroscedastic errors.

In a binary choice model, the dependent variable, Yi , takes on two values, zero and one. This means that the error term, which is normally written as:

ε i = Yi − α − βX i , can be broken into two cases: Yi

εi

Probability

0 1

− α − βX i 1 − α − βX i

1 − pi pi

With the above table in mind, we can calculate the variance of the error term as: var(ε i ) = E (ε i2 ) . Given X i , the probability that Yi will be one is pi and the probability that Yi will be zero is 1 − pi . Therefore, the expected value of ε i2 can be written as: var(ε i ) = (1 − pi )(−α − βX i ) 2 + ( pi )(1 − α − βX i ) 2 . Substituting the definition of pi yields: var(ε i ) = (1 − α − βX i )(−α − βX i ) 2 + (α + βX i )(1 − α − βX i ) 2 . The square of a number is equal to the square of the negative of that number: var(ε i ) = (1 − α − βX i )(α + βX i ) 2 + (α + βX i )(1 − α − βX i ) 2 . Collecting terms gives:

68

var(ε i ) = (1 − α − βX i )(α + βX i )[(α + βX i ) + (1 − α − βX i )] . The last term on the right-hand side is equal to one, and so: var(ε i ) = (1 − α − βX i )(α + βX i ) . Substituting the definition of pi yields: var(ε i ) = (1 − pi ) pi . B.

The condition 0 ≤ pˆ i ≤ 1 does not always hold.

The reason that estimating a binary choice model with OLS is called the linear probability model is that the predicted probabilities are calculated as a linear function of the independent variables. That is: pˆ i = αˆ + βˆX i .

If we graph the predicted probabilities, we get something like: pˆi =αˆ +βˆXi

1

X For some values of X, the predicted probabilities will be less than zero, while for others the predicted probabilities will be greater than one. This is an unintuitive notion, because probabilities normally lie between zero and one. Of course, pˆ i is an estimated probability, and estimates often exceed what is actually possible. For example, a person’s true probability of voting must be less that or equal to one. Suppose that a person’s true (unobserved) probability of voting is 98%. There is always error in a regression, so we might estimate that the probability that this person will vote is 100%. Of course, this person might not actually vote. This type of error in a regression is nothing new, although it is disconcerting in a probability model. In fact, we might estimate that the probability a person will vote is 110%. All we are saying is that the person with the 110% estimate is more likely to vote than the person with the 100% estimate. Possible solutions:

69

1.

Calculate predicted probabilities as (usually a bad solution): ⎧0 ⎪⎪ pˆ i = ⎨αˆ + βˆX i ⎪ ⎪⎩1

2. C.

if αˆ + βˆX i < 0 if 0 ≤ αˆ + βˆX ≤ 1 i

if αˆ + βˆX i > 1

Use probit or logit estimation.

R 2 is a poor measure of goodness of fit. Consider the best possible fit for a linear probability model: pˆi =αˆ +βˆXi

1

εˆi

0.5

εˆi X

In the figure above, every time the estimated probability is greater that ½, the observed observation is 1, and every time the estimated probability is less than ½, the observed observation is 0. In a goodness of fit sense, the regression line can’t get any better because it can perfectly predict every outcome. However, the R 2 from this regression will be low. The error in a regression is calculated from the difference between the predicted values and the observed values. In the linear probability model, the predicted value, pˆ i , is a continuous variable, but the observed value, Yi , is a binary variable. Therefore, there will always be a lot of error in the regression, and R 2 will be low. However, R 2 isn’t measuring how well the regression line predicts observations, and so it is a poor measure of goodness of fit.

70

IV.

Probit and logit models.

The linear probability model is a powerful tool for estimating binary choice models. However, as we discussed in the last lecture, it has several limitations. Probit and logit model are maximum likelihood estimators that can overcome the limitations of the linear probability model. However, both the probit and logit models require an additional assumption: they must specify the actual distribution of the error term. If this specification is correct, then probit and logit produce estimates that are more efficient than the linear probability model. Consider, again, the binary choice model: votei = α + βX i + ε i , ⎧1 if person i voted where votei = ⎨ ⎩0 otherwise If we want to restrict estimated probabilities to be between zero and one, then the linear probability model estimates look like: ⎧0 ⎪⎪ pˆ i = ⎨αˆ + βˆX i ⎪ ⎩⎪1

1

if αˆ + βˆX i < 0 if 0 ≤ αˆ + βˆX ≤ 1 i

if αˆ + βˆX i > 1

X The linear probability models estimates can be interpreted as a cumulative distribution function (CDF). CDF: Cumulative distribution function:

A function that returns the probability that a random variable will be less than or equal to a particular value.

71

The relationship between a PDF and a CDF

The linear probability model estimates a linear CDF, the probit model estimates a normal CDF and the logit model estimates a logistic CDF. The normal distribution: PDF: φ ( x) =

⎡ 1 (x − μ)2 ⎤ exp ⎢− ⎥. 2 σ 2π ⎣ 2 σ ⎦ 1

x−μ

CDF: Φ ( x) =

σ

1

σ 2π



−∞

⎡ 1 ⎤ exp ⎢− t 2 ⎥ dt . ⎣ 2 ⎦

The logistic distribution: PDF:

f ( x) =

ex e−x or, equivalently, f x ( ) = . (1 + e x ) 2 (1 + e − x ) 2

ex 1 CDF: F ( x) = or, equivalently, F ( x) = . x 1+ e 1 + e −x

72

A.

Specifying the probit and logit likelihood function.

Consider the model: Yi* = α + βX i + ε i ,

(1)

Yi * is a continuous variable. However, it is not observed. (1) is called the “latent” or “index” equation. The index equation captures the fact that everyone for whom Yi = 0 is not the same. For example, in the voting model, votei = 0 for all of those who did not vote. However, some of those who did not vote might have considered voting (been on the margin) while others would not have voted under any circumstances. Thus, Yi * captures a voting preference, while Yi captures voting activity. Yi * is like utility, it is a conceptual concept that is never directly observed. If an individual’s preference for voting crosses a certain threshold, then he or she votes. Specifically, what we observe is defined as: ⎧1 Yi * > 0 Yi = ⎨ ⎩0 otherwise

The outcome that we observe, Yi , depends on whether or not the index variable that we do not observe, Yi * , crosses a certain threshold (in this case zero). The probit model assumes that the index variable, Yi * , follows the normal distribution, whereas the logit model assumes it follows the logistic distribution. The two models are otherwise equivalent. Technically, there is no need for an error term in (1) because we assume a distribution rather than an error. However, economists usually formulate the model with an error term, in which case, we assume that the error term is normally or logistically distributed (which is equivalent to directly assuming that Yi * is normally or logistically distributed). The probability that we will observe Yi = 1 (i.e., a person votes) is: Prob(Yi = 1) = Prob(Yi* > 0) . Substituting in the definition of Yi * yields: Prob(Yi = 1) = Prob(α + βX i + ε i > 0) .

73

Rearranging terms: Prob(Yi = 1) = Prob(ε i > −α − βX i ) . The CDF represents the probability that a random variable will be less than or equal to a particular value. If the distribution is symmetric, then: Prob(Yi = 1) = CDF (α + βX i ) , where CDF is the cumulative distribution function of ε i . Likewise, the probability that we will observe Yi = 0 (i.e., a person does not votes) is: Prob(Yi = 0) = 1 − CDF (α + βX i ) . The maximum likelihood function is: L = ∏ CDF (α + βX i ) ∏ [1 − CDF (α + βX i )] Yi =1

Yi = 0

For the logit model: L = ∏ F (α + βX i ) ∏ [1 − F (α + βX i )] Yi =1

Yi = 0

F (α + βX i ) =

exp[α + βX i ] . 1 + exp[α + βX i ]

1 − F (α + βX i ) =

1 1 + exp[α + βX i ]

.

For the probit model: L = ∏ Φ (α + βX i ) ∏ [1 − Φ (α + βX i )] Yi =1

Yi = 0

Φ (α + βX i ) =

1

σ 2π

74

α + βX i σ



−∞

⎡ 1 ⎤ exp ⎢− t 2 ⎥ dt . ⎣ 2 ⎦

A final note: above, we derived the likelihood function by specifying the latent equation and then assuming the distribution of ε i in this equation. Economists generally think of maximum likelihood estimates in this way because that logic is similar to an OLS regression. However, that logic is, perhaps, not the most direct way to get to the likelihood function derived above. Many statisticians think of probit/logit in the following way. Returning to the voting example, suppose that every person has some probability of voting. If we assume that the probability of voting is normally distributed, then the probability the person i will vote is: Prob(Yi = 1) = Φ ( z i ) . The function Φ is a CDF, and so it returns a probability between 0 and 1. For now, think of z i as a placeholder. If everyone had the same z i value, then everyone would have the same probability of voting. However, what makes people different from each other is that they can have different values of z i . As z i gets larger, a person’s probability of voting approaches 1. Now, assume that a person’s z i value is a linear function of his or her characteristics: z i = α + βX i . The normal CDF looks something like:

1

CDF

pˆ i

α + βX i Maximum likelihood picks the values of α and β , which determines where each person falls on the z-axis (which, in turn, determines each person’s probability of voting). Everyone uses the same CDF function. People have different probabilities of voting because they fall on a different position on the z-axis. Likewise, changing a person’s X value changes his or her position on the z-axis (provided that β is not zero).

75

B.

Calculating marginal effects.

Probit and logit estimates the model: Yi* = α + βX i + ε i ,

(1)

⎧1 Yi * > 0 where Yi = ⎨ ⎩0 otherwise

In this model: ∂Y * =β. ∂X In words, βˆ is the estimated marginal effect of X on Y * , the “index” variable. Unlike in the linear probability model, βˆ from a probit or logit function is not the estimated marginal effect on the predicted probability, pi . When someone says, “the marginal effect from a probit (or logit) model,” they don’t mean βˆ . Instead, they mean the same thing they mean when they say, “the marginal effect from the linear probability model,” namely, the marginal effect X has on the probability that Y will be 1. That is, in all models, marginal effect means: ∂Prob(Y = 1 | X ) ∂pi = . ∂X ∂X

(marginal effect)

Unlike the coefficients from the linear probability model, the coefficients from a probit or logit model must be transformed into marginal effects. The difference between a change in Yi * and a change in pˆ i is: pˆ i 1

Change in

CDF

pˆ i

(marginal effect)

Yˆi* = αˆ + βˆX i * Change in Yˆi

βˆ =

∂Yˆi* ∂X

(probit coefficient)

76

To calculate marginal effects, recall that: LPM:

pi = Prob(Yi = 1 | X i ) = α + βX i .

Probit:

pi = Prob(Yi = 1 | X i ) = Φ (α + βX i ) .

Logit:

pi = Prob(Yi = 1 | X i ) = F (α + βX i ) .

Thus, marginal effects are calculated as: LPM:

∂pi =β . ∂X i

Probit:

∂pi = φ (α + βX i ) β . ∂X i

Logit:

∂pi = f (α + βX i ) β . ∂X i exp[α + βX i ] ∂pi = β . ∂X i (1 + exp[α + βX i ]) 2

Note that the probit and logit marginal effects are not constant (they are a function of X i ). As usual, the marginal effects can be calculated at the mean of the data.

77