Chapter 6

Bivariate Least Squares 6.1

The Bivariate Linear Model

Consider the model yi = α + βxi + ui , i = 1, 2, . . . , n,

(6.1)

where yi is the dependent variable, xi is the explanatory variable, and ui is the unobservable disturbance. In matrix form, we have ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ y1 1 x1 u1 ¶ ⎜ u ⎟ ⎜ y2 ⎟ ⎜ 1 x2 ⎟ µ ⎜ ⎟ ⎜ ⎟ α ⎜ 2 ⎟ = + ⎜ . ⎟, (6.2) ⎜ .. ⎟ ⎜ .. .. ⎟ β ⎝ . ⎠ ⎝ . ⎝ .. ⎠ . ⎠ yn 1 xn un or, more compactly, y = Xβ + u. This is the linear regression model . It is linear in the variables (given α and β), linear in the parameters (given xi ), and linear in the disturbances. Example 6.1 Neither of the following two equations is linear. yi = ( α + βxi ) ui

(6.3)

yi = αxβi + ui

(6.4)

2

6.1.1

Assumptions

For the disturbances, we assume that 48

6.1. THE BIVARIATE LINEAR MODEL

49

(i) E( ui ) = 0, for all i (ii) E( u2i ) = σ 2 , for all i (iii) E( ui uj ) = 0, for all i 6= j For the independent variable, We suppose (iv) xi nonstochastic for all i (v) xi nonconstant For purposes of inference in finite samples, we sometime assume iid

(vi) ui vN( 0, σ 2 ), for all i.

6.1.2

Line Fitting

Consider a scatter of plots as shown in Figure 6.1. 14

x 2 3 4 5 6

y 12 7 8 5 3

12

10

8

6

4

2

0 0

2

4

6

8

10

12

14

Figure 6.1: Scatter plot We assume that x and y are linearly related. That is, yi = α + βxi .

(6.5)

Of course, we see that no single line “matches” all the observations. Since no single line is entirely consistent with the data, we might choose α and β that best “fits” the data, in some sense. Define ui = yi − ( α + βxi ), for i = 1, 2, . . . , n

(6.6)

50

CHAPTER 6. BIVARIATE LEAST SQUARES

as the discrepancy between the line choosen and the observations. Our objective then is to choose α and β to minimize the discrepancy. A possible criterion for minimum discrepancy is X min | ui |, (6.7) α,β

i

but this will be zero for any line passing through (x, y). Another possibility is X |ui |, (6.8) min α,β

i

which is called the minimum absolute distance (MAD) or L1 estimator. The MAD estimator has problems since the mathematics (and statistical distributions) are intractible. A closely related choice is X min u2i . (6.9) α,β

i

This yields the least squares or L2 estimator.

6.2

Least Squares Regression

6.2.1

The First-Order Conditions

Let φ=

n X

e2i =

i=1

n X ( yi − α − βxi )2

(6.10)

i=1

Then the minimum values, α b and βb say, must satisfy the following first-order conditions: n

∂φ X −2 ( yi − α − βxi ) = 0= ∂α i=1

(6.11)

n

0=

∂φ X −2 ( yi − α − βxi ) xi = ∂β i=1

(6.12)

Now, these first-order conditions may be written as n X i=1

yi =

n X ( α + βxi ) i=1

(6.13)

6.2. LEAST SQUARES REGRESSION n X i=1

Thus, (??) implies

51

n X yi xi = ( αxi + βx2i )

(6.14)

i=1

b α b = y − βx, Pn where y = i=1 yi /n and x = i=1 xi /n. Substituting (??) into (??) yields Pn

n X

yi xi

n X b ) xi + βx2 ] [ ( y − βx i

=

i=1

i=1 n X

= y

i=1

and n X i=1

xi + βb

n X i=1

yi xi − nyx = βb

n X ( yi − y )( xi − x ) = βb i=1

b we have After solving for β,

6.2.2

βb =

(6.15)

xi ( xi − x )

à n X

x2i

i=1 n X i=1

2

− nx

!

( xi − x )2

Pn

( y − y )( xi − i=1 Pn i 2 i=1 ( xi − x )

(6.16)

x)

(6.17)

(6.18)

The Second-Order Conditions

b and in turn, α Note that the least squares estimators β, b, are unique. Taking the second derivatives of the objective function yields ∂2φ = 2n, ∂α2

n X ∂2φ = 2 x2i , ∂β 2 i=1

and

Thus, the Hessian matrix is H=

n X ∂2φ xi . =2 ∂α∂β i=1

µ

2

P2n n

i=1

xi

¶ Pn 2 P i=1 xi , n 2 i=1 x2i

which is a positive definite matrix, so α b and βb are, in fact, minimums.

(6.19)

(6.20)

(6.21)

(6.22)

52

CHAPTER 6. BIVARIATE LEAST SQUARES

6.2.3

Matrix Interpretation

Now, the first-order conditions require (??) and (??), or n X

yi =

i=1

n X

yi xi =

i=1

n X b i) (α b + βx

(6.23)

i=1

n X b 2) (α bxi + βx i

(6.24)

i=1

b In matrix form, These are the normal equations, and are linear in α b and β. we have µ Pn ¶ µ n Pni=1 yi = Pn i=1 yi xi i=1 xi

which has the solution µ

α b βb



= =

=

¶µ ¶ Pn α b Pni=1 x2i , βb i=1 xi

(6.25)

¶−1 µ Pn ¶ Pn Pni=1 x2i Pni=1 yi i=1 xi i=1 xi i=1 yi xi µ Pn ¶ ¶ µ Pn Pn 2 1 − i=1 xi i=1 xi i=1 yi P P Pn P n n − i=1 xi n n i=1 x2i − ( ni=1 xi )2 i=1 yi xi ⎛ Pn ⎞ P P P n n n 2 i=1 xi i=1 yi − i=1 xi i=1 xi yi 1 ⎝ ⎠. Pn Pn Pn Pn Pn n i=1 x2i − ( i=1 xi )2 − x y +n xy

µ

Pnn

i=1

i

i=1

i

i=1

i i

b according to this formula, is Now, β, βb =

=

= =

Pn Pn Pn x y − i=1 xi i=1 yi i=1 Pn i i 2 Pn n i=1 xi − ( i=1 xi )2 P Pn xi yi − x ni=1 yi Pi=1 P n x2 − x ni=1 xi Pni=1 i xi yi − nxy Pi=1 n x2 − nx2 Pni=1 i ( y − y )( xi − x ) i=1 Pn i , 2 i=1 ( xi − x )

n

(6.26)

6.3. BASIC STATISTICAL PROPERTIES

while α b =

= = = = = =

6.3 6.3.1

Pn Pn Pn x2i i=1 yi − i=1 xi i=1 xi yi Pn Pn n i=1 x2i − ( i=1 xi )2 Pn P y i=1 x2i − x ni=1 xi yi Pn Pn 2 xi i=1 xi − x P Pi=1 P P y ni=1 x2i − yx ni=1 xi + yx ni=1 xi − x ni=1 xi yi Pn P n 2 i=1 xi − x i=1 xi Pn P Pn P n n 2 2 y( i=1 xi − x i=1 xi ) + x i=1 yi − x i=1 xi yi Pn P n 2−x x x i=1 i i=1 i Pn Pn x2 i=1 yi − x i=1 xi yi P y + Pn x2i − x ni=1 xi Pi=1 P (x n yi − ni=1 xi yi ) Pn y + Pni=1 2 x i=1 xi − x i=1 xi b y − βx. Pn

i=1

53

(6.27)

Basic Statistical Properties

Method Of Moments Interpretation

Suppose, for the moment, that xi is a random variable that is uncorrelated with ui . Let µx = E( xi ). Then µy = E( yi ) = E( α + βxi + ui ) = α + βµx .

(6.28)

E( yi − µy ) = α + βxi + ui − ( α + βµx ) = β( xi − µx ) + ui ,

(6.29)

Thus,

and E( yi − µy )( xi − µx ) = β E( xi − µx )2 + E( xi − µx )ui = β E( xi − µx )2 , (6.30) since xi and ui are uncorrelated. Solving for β, we have β=

E( xi − µx )( yi − µy ) , E( xi − µx )2

(6.31)

while α = µy − βµx .

(6.32)

Thus the least-squares estimators are method of moments estimators with sample moments replacing the population moments.

54

6.3.2

CHAPTER 6. BIVARIATE LEAST SQUARES

Mean of Estimates

Next, note that Pn ( xi − x ) yi b β = Pi=1 n ( xi − x )2 Pni=1 i − x )( α + βxi + ui ) i=1 ( x P = n 2 i=1 ( xi − x ) Pn Pn Pn ( xi − x ) ( xi − x ) ui i=1 ( xi − x ) xi P Pi=1 = α Pni=1 + β + n n 2 2 2 ( x − x ) ( x − x ) i i i=1 i=1 i=1 ( xi − x ) Pn ( xi − x ) ui = β + Pi=1 . (6.33) n 2 i=1 ( xi − x ) So, E( βb ) = β, since E( ui ) = 0 for all i. Thus, βb is an unbiased estimator of β. Further,

α b = y − βx (6.34) Pn n X ( xi − x ) yi yi = − x Pi=1 n 2 n i=1 ( xi − x ) i=1 ¶ n µ X xi − x 1 = yi − x Pn 2 n i=1 ( xi − x ) i=1 ¶ n µ X 1 xi − x = α − x Pn 2 n i=1 ( xi − x ) i=1 ¶ X ¶ n µ n µ X ( xi − x ) xi xi − x xi 1 + ui − x Pn − x Pn +β 2 2 n n i=1 ( xi − x ) i=1 ( xi − x ) i=1 i=1 ¶ n µ X 1 xi − x = α ui − x Pn 2 n i=1 ( xi − x ) i=1

So we have E( α b ) = α, and α b is an unbiased estimator of α.

6.3.3

Variance of Estimates

Now,

where

Pn n ( xi − x ) ui X b = wi ui , β − β = Pi=1 n 2 i=1 ( xi − x ) i=1 ( xi − x ) . wi = Pn 2 i=1 ( xi − x )

(6.35)

(6.36)

6.3. BASIC STATISTICAL PROPERTIES

55

So, Var( βb ) = E( βb − β )2 n X = E( wi ui )2 = E( w1 u1 + w2 u2 + · · · + wn un )2 i=1

= E[( w12 u21 + w22 u22 + · · · + wn2 u2n ) + (w1 u1 w2 u2 + · · · wn−1 un−1 wn un )] = w12 σ2 + w22 σ 2 + · · · + wn2 σ 2 Ã !2 n n X X ( x − x ) i Pn = σ2 wi2 = σ 2 2 j=1 ( xj − x ) i=1 i=1 =

σ2 . 2 i=1 ( xi − x )

Pn

(6.37)

Next, we note that

where

α b−α=

n µ X 1 i=1

xi − x − x Pn 2 n i=1 ( xi − x )

vi =

µ



ui =

1 xi − x − x Pn 2 n i=1 ( xi − x )

n X

vi ui ,

(6.38)

i=1



.

(6.39)

So, in a fashion similar to the one above, we find that

and

Var( α b −α) = σ

2

n

Pn

x2i , 2 i=1 ( xi − x )

Pn

i=1

Pn xi . Cov( α b, βb ) = E( α b − α )( βb − β ) = σ2 Pn i=1 n i=1 ( xi − x )2

6.3.4

(6.40)

(6.41)

Estimation of σ 2

Next, we would like to get an estimate of σ 2 . Let n

s2 =

1 X 2 e , n − 2 i=1 i

(6.42)

56

CHAPTER 6. BIVARIATE LEAST SQUARES

where ei

= yi − α − βxi b xi − x ) = ( yi − y ) − ( α − α ) − β( b xi − x ) = ( yi − y ) − β(

b xi − x ) = β( xi − x ) + ( ui − u ) − β( = −( βb − β )( xi − x ) + ( ui − u )

(6.43) (6.44)

.

So, n X

e2i

=

i=1

n X [−( βb − β )( xi − x ) + ( ui − u )]2

= ( βb − β )2

n X ( xi − x )2 i=1 n X

−2( βb − β )

Now, we have

i=1

2 E[( βb − β )

E(

( xi − x )( ui − u ) +

n X ( ui − u )2 . i=1

n X ( xi − x )2 ] = σ 2 ,

(6.46)

i=1

n n X X ( ui − u )2 ) = E( ( u2i − u2 )) i=1

i=1

= E(

n X i=1

and E(( βb − β )

= E

= E

Ã

n X

i=1 Ã n X i=1

Therefore,

(6.45)

i=1

E(

n X i=1

u2i ) −

n X 1 ui )2 = ( n − 1 )σ 2 , (6.47) E( n i=1

n X ( xi − x )( ui − u ))

(6.48)

i=1

wi ui wi ui

!Ã !

n n X X ( xi − x )ui − ( xi − x )u i=1

i=1

!

n n X X ( xi − x )ui = wi ( xi − x )σ2 = σ 2 . i=1

i=1

e2i ) = σ 2 − 2σ 2 + ( n − 1 )σ 2 = ( n − 2 )σ2

(6.49)

6.4. STATISTICAL PROPERTIES UNDER NORMALITY

57

and 2 2 E( s ) = σ .

6.4 6.4.1

(6.50)

Statistical Properties Under Normality Distribution of βb iid

Suppose that ui vN( 0, σ 2 ). Then,

where wi =

where q =

and

βb = β +

1

2 i=1 ( xi −x )

wi ui ,

(6.51)

i=1

Then βb is also a normal random variable. Specifically,

Sn xi −x 2. i=1 ( xi −x )

Sn

n X

βb ∼ N( β, σ 2 q ),

(6.52)

βb − β0 ∼ N( β − beta0 , σ 2 q ),

(6.53)

. Thus, for a given β0 ,

βb − β0 zβ = p ∼N σ2q

Ã

Now, suppose that H0 : β = β0 . Then,

βb − β0 p ,1 σ2q

!

.

βb − β0 p ∼ N( 0, 1 ), σ2q

(6.54)

(6.55)

while for H0 : β = β1 > β0 we have

βb − β0 p ∼N σ2 q

Ã

β1 − β0 p ,1 σ2q

!

,

(6.56)

and we would expect the statistic to be centered to the right of zero.

6.4.2

Distribution of α b iid

Again, suppose that ui vN( 0, σ 2 ). Then, α b=α+

n X i=1

vi ui ,

(6.57)

58

CHAPTER 6. BIVARIATE LEAST SQUARES

where vi =

where p =

1 n

n

) − Snx( x(ix−x b is a normal random variable. Specifically, 2 . Then α i −x ) i=1

α b ∼ N( α, σ 2 p ),

(6.58)

α b − α0 ∼ N( α − beta0 , σ 2 p ),

(6.59)

Sn x2 Sn i=1 i 2. ( x i −x ) i=1

and

Then, for a given α0 ,

α b − α0 zα = p ∼N σ2 p

Ã

Now, suppose that H0 : α = α0 . Then,

α b − α0 p ,1 σ2p

!

α b − α0 p ∼ N( 0, 1 ), σ2 p

while for H0 : α = α1 6= α0 we have

α b − α0 p ∼N σ2 p

Ã

α1 − α0 p ,1 σ2 p

.

(6.60)

(6.61)

!

,

(6.62)

and we would expect the statistic not to be centered around zero. It is important to note that Pn x2 , p = Pn i=1 i n i=1 ( xi − x )2 p and so σ 2 p grows small as n grows large. Thus, the noncentrality of the distribution of the statistic zα will also grow under the alternative. A similar statemetn can be made concerning q and the statistic zβ .

6.4.3

t-Distribution

In most cases, we do not know the value of σ 2 . A possible alternative is to use s2 , whereupon we obtain

under H0 : α = α0 , and

α b − α0 p ∼ tn−2 , s2 p βb − β0 p ∼ tn−2 , s2 q

(6.63)

(6.64)

6.4. STATISTICAL PROPERTIES UNDER NORMALITY

59

under H0 : β = β0 . The t-distibution is quite similar to the standard normal except being slightly fatter. This reflects the added uncertainty introduced by using s2 rather than σ2 . However, as n increases and the precision of s2 becomes better, the tdistibution grows closer and closer to the N( 0, 1 ). We loose two degrees of freedom (and so tn−2 ) because of the fact that we estimated two coefficients, namely α and β. Just as in the case of zα and zβ , we would expect the t-distribution to be off-center if the null hypothesis were not true.

6.4.4

Maximum Likelihood iid

Suppose that ui vN( 0, σ 2 ). Then, iid

yi v N( α + βxi , σ2 ).

(6.65)

Then, the pdf of yi is given by ½ ¾ 1 2 f ( yi ) = √ exp − 2 [ yi − ( α + βxi )] . 2σ 2πσ 2 1

(6.66)

Since the observations are independent, we can write the joint likelihood function as f ( y1 , y2 , . . . , yn ) = f ( y1 )f ( y2 ) · · · f ( yn ) ( ) n 1 1 X 2 = − 2 [ yi − ( α + βxi )] n exp 2σ i=1 ( 2πσ 2 ) 2 = L( α, β, σ 2 |y, x ).

(6.67)

Now, let L = log L( α, β, σ 2 |y, x ). We seek to maximize n 1 X n n [ yi − ( α + βxi )]2 . L = − log( 2π ) − log( σ 2 ) − 2 2 2 2σ i=1

(6.68)

Note that for (??) to be a maximum with respect to α and β, we most minimize P n 2 i=1 [ yi − ( α + βxi )] . The first-order conditions for (??) are n ∂L 1 X b i )] = 0, [ yi − ( α b + βx = 2 ∂α σ i=1 n 1 X ∂L b i ) xi ] = 0, [ yi − ( α b + βx = 2 ∂β σ i=1

(6.69)

(6.70)

60

CHAPTER 6. BIVARIATE LEAST SQUARES

xi 2 3 4 5 6 20

yi 12 7 8 5 3 35

xi − x -2 -1 0 1 2 0

yi − y 5 0 1 -2 -4 0

(xi − x)2 4 1 0 1 4 10

(xi − x)(yi − y) -24 -7 0 5 6 -20

Table 6.1: Summary table. and n 1 X n ∂L + 4 [ yi − ( α + βxi )]2 . =− c2 2σ ∂β 2σ i=1

(6.71)

Note that the first two conditions imply that

b α = y − βx,

and βb =

Pn

( x − x )( yi − i=1 Pn i 2 i=1 ( xi − x )

(6.72)

y)

,

(6.73)

since these are the same as the normal equations (except for σ 2 ). The third condition yields Pn 2 i=1 [ yi − ( α + βxi )] c2 = σ Pn 2 n n−2 2 i=1 ei = (6.74) = s . n n

6.5

An Example

Consider the scatter graph given in Figure 6.1. From this, we construct Table 6.1. Thus, we have Pn ( x − x )( yi − y ) −20 b Pn i β = i=1 = = −2, 2 10 i=1 ( xi − x ) and

b = 7 − ( −2 )4 = 15. α b = y − βx

6.5. AN EXAMPLE

61

b i α b + βx 11 9 7 5 3

b i βx -4 -6 -8 -10 -12

ei 1 -2 1 0 0

e2i 1 4 1 0 0

Table 6.2: Residual calculations. Now, we calculate the residuals in Table 6.2, and find n X

e2i = 6

i=1

and 2

s =

Pn

2 i=1 ei

=

6 = 2. 3

n−2 b namely σ2 q, is provided by An estimate of the variance of β, 1 1 =2 = 0.2. 2 10 (x − x) i=1 i

s2 q = s2 Pn

Suppose we wish to test H0 : β = 0 against H1 : β 6= 0. Then

under the null hypothesis, but

βb − 0 p ∼ t3 s2 q

−2 −2 βb − 0 p = =√ = −4.2 2 0.45 0.2 s q

is clearly in the left-hand 2.5% tail of the t3 -distribution. Thus, we would reject the null hypothesis at the 95% significance level.

Chapter 7

Linear Least Squares 7.1

Multiple Regression Model

The general k-variable linear model can be written as yi = β1 xi1 + β2 xi2 + · · · + βk xik + ui Using matrix techniques, we can ⎛ ⎞ ⎛ y1 x11 x12 ⎜ y2 ⎟ ⎜ x21 x22 ⎜ ⎟ ⎜ ⎜ .. ⎟ = ⎜ .. .. ⎝ . ⎠ ⎝ . . yn

xn1

xn2

or, more compactly, as

7.1.1

i = 1, 2, . . . , n.

(7.1)

equivalently write this model as ⎞⎛ ⎞ ⎛ ⎞ β1 u1 · · · x1k ⎜ ⎟ ⎜ ⎟ · · · x2k ⎟ ⎟ ⎜ β2 ⎟ ⎜ u2 ⎟ .. ⎟ ⎜ .. ⎟ + ⎜ .. ⎟ , .. . . ⎠⎝ . ⎠ ⎝ . ⎠

(7.2)

y = Xβ + u.

(7.3)

· · · xnk

βk

un

Assumptions

In the general k-variable linear model, we make the following assumptions about the disturbances: (i) E( ui ) = 0

i = 1, 2, . . . , n

(ii) E( u2i ) = 0

i = 1, 2, . . . , n

(iii) E( ui uj ) = 0

i 6= j

These assumptions can be written in matrix notaion as E( u ) = 0 62

(7.4)

7.1. MULTIPLE REGRESSION MODEL

63

and Cov( u ) = E( uu0 ) = σ 2 In ,

(7.5)

where In is an n × n identity matrix. The nonstochastic assumptions are (iv) X is nonstochastic. (v) X has full column rank (the columns are linearly independent). Sometimes, we will also assume that the ui ’s are normally distrubuted (vi) u ∼ E( 0, σ 2 In ).

7.1.2

Plane Fitting

Suppose k = 3 and xi1 = 1. Then, yi = β1 + β2 xi2 + β3 xi3 + ui

i = 1, 2, . . . , n.

(7.6)

Now, ybi = βb1 + βb2 xi2 + βb3 xi3

i = 1, 2, . . . , n.

(7.7)

define planes in the three-dimentional space of y, x2 , and x3 . We seek to choose βb1 , βb2 and βb3 so that the points on the plane corresponding to xi2 and xi3 , namely ybi , will be close to yi . That is, we will “fit” a plane to the observations. As we did in the two-dimensional case, we choose to measure closeness in the vertical distance. That is,

and

ei = yi − ( βb1 + βb2 xi2 + βb3 xi3 ) φ=

n X

i = 1, 2, . . . , n,

e2t

(7.8)

(7.9)

i=1

7.1.3

Least Squares

In general, we want to min β

n X [ yi − ( βb1 + βb2 xi2 + · · · + βbk xik )]2 i=1

(7.10)

64

CHAPTER 7. LINEAR LEAST SQUARES

7.2 7.2.1

Least Squares Regression The OLS Estimator

As was stated above, we seek to minimize φ=

n X [ yi − ( β1 + β2 xi2 + · · · + βk xik )]2

(7.11)

i=1

with respect to the coefficients β1 , β2 , . . . , βk . The first-order conditions are 0= 0=

∂φ ∂β1

= 2

∂φ ∂β2

= 2

n X [ yi − ( βb1 + βb2 xi2 + · · · + βbk xik )]xi1 , i=1

.. . 0=

∂φ ∂βk

[ yi − ( βb1 + βb2 xi2 + · · · + βbk xik )]xi2 ,

n X [ yi − ( βb1 + βb2 xi2 + · · · + βbk xik )]xik .

= 2

i=1

where (βb1 , βb2 , · · · , βbk ) are solutions. tions: βb1 βb1

βb1

n X i=1 n X i=1

n X i=1

x2i1 + βb2

(7.12)

i=1 n X

n X i=1

xi2 xi1 + βb2 xik xi1 + βb2

Rearranging, we have the normal equa-

xi1 xi2 + · · · + βbk

n X i=1

n X i=1

x2i2 + · · · + βbk

n X i=1 n X

xi1 xik xi2 xik

= =

i=1

xik xi2 + · · · + βbk

n X

xi1 yi

i=1 n X

xi2 yi

n X

xik yi

i=1

.. .

n X

x2ik

=

i=1

i=1

(7.13)

or ⎛ ⎜ ⎜ ⎜ ⎝

Pn 2 Pn i=1 xi1 i=1 xi2 xi1 .. Pn . i=1 xik xi1

Pn i=1 xi1 xi2 P n 2 i=1 xi2 .. Pn . i=1 xik xi2

P · · · Pni=1 xi1 xik n ··· i=1 xi2 xik .. .. . Pn . 2 ··· i=1 xik

⎞⎛ ⎟⎜ ⎟⎜ ⎟⎜ ⎠⎜ ⎝

βb1 βb2 .. . βbk



⎛ Pn ⎞ i=1 xi1 yi P ⎟ ⎜ n ⎟ ⎟ ⎜ i=1 xi2 yi ⎟ ⎟=⎜ ⎟. .. ⎟ ⎝ ⎠ . ⎠ Pn i=1 xik yi (7.14)

7.2. LEAST SQUARES REGRESSION

As in the bivariate model, ⎛ ⎜ ⎜ X=⎜ ⎝

so ⎛

and

⎜ ⎜ X0 X = ⎜ ⎝

Pn 2 Pn i=1 xi1 i=1 xi2 xi1 .. Pn . i=1 xik xi1

65

x11 x12 .. .

x21 x22 .. .

··· ··· .. .

x1n

x2n

· · · xnn

Pn i=1 xi1 xi2 P n 2 i=1 xi2 .. Pn . i=1 xik xi2

xn1 xn2 .. .



⎟ ⎟ ⎟, ⎠

Pn · · · Pi=1 xi1 xik n ··· i=1 xi2 xik .. .. . Pn . 2 ··· i=1 xik

⎛ Pn xi1 yi Pi=1 n ⎜ i=1 xi2 yi ⎜ X0 y = ⎜ .. ⎝ Pn . i=1 xik yi

(7.15)



⎟ ⎟ ⎟, ⎠



⎟ ⎟ ⎟. ⎠

(7.16)

(7.17)

Thus, we can write the normal equations in matrix notation: b = X0 y. X0 Xβ

(7.18)

b = ( X0 X )−1 X0 y, β

(7.19)

where β0 = (βb1 , βb2 , · · · , βbk ).Therefore, we have the unique solution as long as |X0 X| 6= 0, which is assured by Assumption (v).

7.2.2

Some Algebraic Results

b Define the fitted value for each i as ybi = xi1 βb1 + xi2 βb2 + · · · + xik βbk = x0i β whereupon b b = Xβ. y (7.20)

Next define the OLS residual for each i as ei = yi − ybi so Then,

b. e=y−y

b) X0 e = X0 ( y − y 0 = X y − X0 X(X0 X)−1 X0 y = 0,

(7.21)

(7.22)

66

CHAPTER 7. LINEAR LEAST SQUARES

and we say that the residuals are orthogonal to the regressors. Also, we find b0 y y

b 0 (Xβ b + e) = (Xβ)

b +β b 0 X0 e b 0 X0 Xβ = β b b 0 X0 Xβ = β

b0 y b. = y

(7.23)

Now, suppose that the first coefficient is the intercept. Then, the first column of X and hence the first row of X0 are all ones. This means that ⎛

⎜ ⎜ 0 = X0 e = ⎜ ⎝ ⎛

So,

Pn

i=1 ei

⎜ ⎜ = ⎜ ⎝

1 x12 .. .

1 x22 .. .

x1n x2n Pn Pn i=1 ei i=1 xi2 ei .. Pn . i=1 xik ei

··· ··· .. .

1 xn2 .. .

· · · xnn ⎞

⎞⎛ ⎟⎜ ⎟⎜ ⎟⎜ ⎠⎝

e1 e2 .. . en

⎟ ⎟ ⎟. ⎠

⎞ ⎟ ⎟ ⎟ ⎠ (7.24)

= 0, which means that n X i=1

Finally, we note that e = = = = = = =

yi =

n X i=1

ybi + ei =

n X i=1

ybi .

b y − Xβ y − (X0 X)−1 X0 y [ In − (X0 X)−1 X0 ]y My M( Xβ + u ) [ In − (X0 X)−1 X0 ]( Xβ ) + Mu Mu.

(7.25)

(7.26)

We see that the OLS residuals are a linear transformation of the underlying disturbances. The matrix M , which is sometimes called ”the idempotent matrix” plays an important role in the sequel, has the property M = M · M or of idempotence.

7.2. LEAST SQUARES REGRESSION

7.2.3

67

The R2 Statistic

Define the following: SSE =

n X

e2i =

i=1

n X ( yi − ybi )2 ,

(7.27)

i=1

SST =

n X ( yi − y )2 ,

(7.28)

i=1

SSR =

n X ( ybi − y )2 .

(7.29)

i=1

Note that SSE is the variation of actuals around the fitted plane and is called the unexplained sum-of-squares. SSR, or residual sum-of-squares, is variation of the fitted values around the sample mean and SST, or total sum-of-squares, is the variation of the actual around the sample mean. The three sums-of-squares are closely related. Consider SST − SSE = =

n X [( yi − y )2 − ( yi − ybi )2 ] i=1

n X i=1

=

n X i=1

=

n X i=1

y 2 − 2y y 2 − 2y y 2 − 2y

n X

yi + 2

i=1

i=1

n X

yi +

i=1

ybi +

i=1

n X

n X

n X i=1

n X i=1

n X = ( ybi − y )2 = SSR.

yi ybi −

ybi2 ybi2

n X i=1

ybi2

(7.30)

i=1

Thus SST = SSE + SSR. We now define

R2 = 1 −

SSE SST SSE SSR = − = . SST SST SST SST

(7.31)

as the percent of of total variation explained by the model. This statistic can also be interpreted as a squared correlation coefficient. Consider the sample second moments, n

Var( y ) =

1X 1 ( yi − y )2 = SST, n i=1 n

(7.32)

68

CHAPTER 7. LINEAR LEAST SQUARES n

1X 1 b) = Var( y ( ybi − y )2 = SSR, n i=1 n

and

(7.33)

n

b) = Cov( y, y

=

1X ( yi − y )( ybi − y ) n i=1 n

1X yi ybi − yb yi − yyi + y 2 n i=1 n

=

1X 2 yb − 2yb yi + y 2 n i=1 i n

= =

1X ( ybi − y )2 n i=1 1 SSR. n

(7.34)

Then the correlation between yi and ybi can be written as, ρby,b = y

=

=

b) Cov( y, y p b ) Var( y b) Var( y q

1 n SSR

1 1 n SST n SSR

√ SSR √ , SST

so

SSR (7.35) = R2 . SST This statistic is variously called the coefficient of determination, the multiple correlation statistic and the ”R”-squared statistic. = ρb2y,b y

7.3 7.3.1

Basic Statistical Results b Mean and Covariance of β

From the previous section, under assumption (v), we have b β

= = = =

( X0 X )−1 X0 y ( X0 X )−1 X0 ( Xβ + u ) ( X0 X )−1 X0 Xβ + ( X0 X )−1 X0 u β + ( X0 X )−1 X0 u.

(7.36)

7.3. BASIC STATISTICAL RESULTS

69

Since X is nonstochasticby assumption (iv), we have, also using assumption (i), b ) = β + E[( X0 X )−1 X0 u ] E( β = β + ( X0 X )−1 X0 E( u ) = β.

(7.37)

Thus, the OLS estimator is unbiased. Also, using assumptions (ii) and (iii), b ) = E( β b − β )( β b − β )0 Cov( β = E[( X0 X )−1 X0 uu0 X( X0 X )−1 ] = ( X0 X )−1 X0 E( uu0 )X( X0 X )−1 = ( X0 X )−1 X0 σ 2 In X( X0 X )−1 = σ 2 ( X0 X )−1 .

7.3.2

(7.38)

Best Linear Unbiased Estimator (BLUE)

The OLS estimator is linear in y and it is an unbiased estimator, as we saw above. Let e = Ay, e β (7.39)

where A is a k×n matrix that is nonstochastic, be any other unbiased estimator. e ) = β. Define That is, E( β e − ( X0 X )−1 X0 . A=A

Then, e β

= [ A+( X0 X )−1 X0 ]y = [ A+( X0 X )−1 X0 ][ Xβ + u ] = AXβ + β + [ A + ( X0 X )−1 X0 ]u.

(7.40)

(7.41)

e is an unbiased estimator, so Now, β

e ) = AXβ + β + [ A + ( X0 X )−1 X0 ] E( u ) E( β = AXβ + β = β,

(7.42)

which implies that for all β, AX = 0. Thus,

and

e = β + [ A + ( X0 X )−1 X0 ]u, β e ) = E( β e − β )( β e − β )0 Cov( β = E{[ A + ( X0 X )−1 X0 ]uu0 [ A + ( X0 X )−1 X0 ]0 }

(7.43)

70

CHAPTER 7. LINEAR LEAST SQUARES [ A + ( X0 X )−1 X0 ] E( uu0 )[ A + ( X0 X )−1 X0 ]0 [ A + ( X0 X )−1 X0 ]σ 2 In [ A + ( X0 X )−1 X0 ]0 σ 2 [ AA0 + ( X0 X )−1 ] σ 2 AA0 + σ 2 ( X0 X )−1 .

= = = =

(7.44) This shows that the covariance matrix of any other linear unbiased estimator exceeds the covariance matrix of the OLS estimator by a possitive semi-definite matrix σ2 AA0 . Hence, OLS is said to be best linear unbiased estimator (BLUE). Note that we have used all of the assumptions (i)-(v) to get to this point.

7.3.3

Consistancy

Typically, the elements of X0 X are unbounded (they go to infinity) P as n gets very large. For example, the 1, 1 element is n and the j, j element is ni=1 x2ij . Therefore, lim ( X0 X )−1 = 0, (7.45) n→∞

b converge to zero. This means that the distribution and the variances of β collapses about its expected value, namely β. So, b = β, plim β

(7.46)

e = Mu,

(7.47)

n→∞

and OLS estimation is consistent. A more formal proof of this property will be given in the chapter on stochastic regressors.

7.3.4

Estimation Of σ 2

Recall that 0

where M = In − X( X X )

−1

0

X . Then,

e0 e = ( Mu )0 Mu = u0 MMu = u0 Mu,

(7.48)

since M is symmetric and idempotent. Also, e0 e = tr e0 e = tr u0 Mu = tr Mu0 u,

(7.49)

since e0 e is a scalar, and tr AB = tr BA, when both multiplications are defined. Thus, 0 E( e e ) = = = =

0 E( tr Mu u ) tr M E( u0 u ) tr Mσ 2 In σ 2 tr M.

(7.50)

7.3. BASIC STATISTICAL RESULTS

71

But, tr M = = = =

tr( In − X( X0 X )−1 X0 ) tr In − tr(, X( X0 X )−1 X0 ) n − tr(( X0 X )−1 X0 X ) n − k.

Now, define s2 = Then

e0 e . n−k

(7.51)

(7.52)

σ2( n − k ) E( e0 e ) (7.53) = = σ2 , n−k n−k so s2 is an unbiased estimator of σ2 . We can also establish that s2 is a consistent estimator of σ 2 . That is, plim s2 = σ 2 . (7.54) 2 E( s ) =

n→∞

7.3.5

Prediction

Suppose that we wish to predict yp = β1 xp1 + β2 xp2 + · · · + βk xpk + up = xp 0 β + up .

(7.55)

Note that E( yp |xp ) = x0p β. A natural choice for a predictor is b ybp = x0 β =

=

=

p x0p ( X0 X )−1 X0 y x0p ( X0 X )−1 X0 ( X0 β + u x0p β + x0p ( X0 X )−1 X0 u.

Now, and

(7.56)

0

E( yp |xp ) = xp β,

(7.58)

E[( yp − ybp )|xp ] = 0. Hence, ybp is an unbiased predictor of yp . We also have 0

while

(7.57)

(7.59)

0

Var( ybp ) = E( ybp − xp β )2 = σ2 xp ( X0 X )−1 xp , 0

MSPE( ybp ) = E( yp − ybp ) = σ 2 [ 1 + xp ( X0 X )−1 xp ].

(7.60) (7.61)

It can be shown as above that ybp is the best (minimum variance) linear unbiased predictor (BLUP) of yp .

72

7.4 7.4.1

CHAPTER 7. LINEAR LEAST SQUARES

Statistical Properties Under Normality b Distribution Of β

Suppose that we introduce assumption (vi), so the ui ’s are normal: u ∼ E( 0, σ2 In ). Recall that

(7.62)

b = β + ( X0 X )−1 X0 u, β

(7.63)

b ∼ E( 0, σ 2 ( X0 X )−1 ). β

(7.64)

b is linear in u and X and hence ( X X ) so β also normally distributed: 0

−1

b is X are nonstochastic. Then, β 0

Thus, we may test H0 : βi = βi 0 with

b − βi 0 β q ∼ E( 0, 1 ). σ 2 ( X0 X )−1 ii

(7.65)

More will be said on this statistic for use in inference in the next chapter.

7.4.2

Maximum Likelihood Estimation

Now, yi = β1 xi1 + β2 xi2 + · · · + βk xik + ui = xi 0 β + ui ,

(7.66)

is linear in ui , so yi is also normal given xi : yi ∼ N ( xi 0 β, σ2 ). Further, the yi ’s are independent. Thus, the density for yi is given by ½ ¾ 1 1 2 0 exp − 2 [ yi − xi β ] . f ( yi ) = √ 2σ 2πσ 2

(7.67)

(7.68)

Since the yi ’s are independent, the joint likelihood function is f ( y1 , y2 , . . . , yn ) = f ( y1 )f ( y2 ) · · · f ( yn ) ( ) n 1 1 X 0 2 = − 2 [ yi − xi β ] n exp 2σ i=1 ( 2πσ 2 ) 2 = L( β, σ 2 |y, X ).

(7.69)

Let L = log L( β, σ 2 |y, X ). We wish to maximize L with respect β. However, this means that we minimize the sum of squares, so 0 −1 0 b X y, β MLE = ( X X )

(7.70)

7.4. STATISTICAL PROPERTIES UNDER NORMALITY

73

which is the OLS estimator from above. It is easily shown that

7.4.3

0 c2 MLE = e e = n − k s2 . σ n n

b and s2 Efficiency of β

(7.71)

b is the MLE and unbiased, we find then it is the minimum variance unSince β biased estimator (BUE). And s2 is not the MLE, so it is not BUE. On the other c2 MLE is biased, so it is not BUE either. The will both be equivalent in hand, σ large samples and be asymptotically BUE.

Chapter 8

Confidence Intervals and Hypothesis Tests 8.1 8.1.1

Introduction Model and Assumptions

The model is a k-variable linear model: y = Xβ + u,

(8.1)

where y and u are both n × 1 vectors, X is a n × k matrix and β is a k × 1 vector. We make the following assumptions about the disturbances: (i) E( u ) = 0 and (ii),(iii) Cov( u ) = E( uu0 ) = σ 2 In , where In is an n × n identity matrix. The nonstochastic assumptions are (iv) X is nonstochastic. (v) X has full column rank (the columns are linearly independent). For inferences, we assume that u are normally distributed. That is, (vi) u ∼ ( 0, σ 2 In ). 74

8.1. INTRODUCTION

8.1.2

75

Ordinary Least Squares Estimation

b of β, define For some estimate β

e = y − Xβ

(8.2)

φ = e0 e

(8.3)

and Choosing βb to minimize φ yields the ordinary least squares (OLS) estimator b = ( X0 X )−1 X0 y, β

Substitution yields b β

8.1.3

= ( X0 X )−1 X0 ( Xβ + u ) = ( X0 X )−1 X0 Xβ + ( X0 X )−1 X0 u = β + ( X0 X )−1 X0 u.

(8.4)

(8.5)

b Properties of β

Since X is nonstochastic,

b ] = β + E[( X0 X )−1 X0 u ] E[ β = β + ( X0 X )−1 X0 E[ u ] = β.

(8.6)

Thus, the OLS estimator is unbiased. Also, b ) = E( β b − β )( β b − β )0 Cov( β = E[( X0 X )−1 X0 uu0 X( X0 X )−1 ] = ( X0 X )−1 X0 E( uu0 )X( X0 X )−1 = ( X0 X )−1 X0 σ 2 In X( X0 X )−1 = σ 2 ( X0 X )−1 .

(8.7)

The elements of X0 X are unbounded as n gets very large. Therefore, lim ( X0 X )−1 = 0,

n→∞

(8.8)

b converge to zero. This means that the distribution and the variances of β collapses about its expected value, namely β. So, b = β, plim β

n→∞

and OLS estimation is consistent.

(8.9)

76 CHAPTER 8. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS b are the best linear unbiased (BLUE) in that they have The OLS estimates β minimum variance in the class of unbiased estimators of β that are also linear in y. Suppose that u is normal, then the linear transformation b − β = ( X0 X )−1 X0 u β

is also normal.

b − β ∼ N( 0, σ 2 ( X0 X )−1 ). β

or

(8.10) (8.11)

b ∼ N( β, σ 2 ( X0 X )−1 ). β

(8.12) b Moreover, the β are maximum likelihood and hence minimum variance in the class of unbiased estimators.

8.1.4

Properties of e

Now, the OLS residuals are e = = = = = = =

b y − Xβ y − X(X0 X)−1 X0 y [ In − X(X0 X)−1 X0 ]y My M =In − X(X0 X)−1 X0 M( Xβ + u ) [ In − X(X0 X)−1 X0 ]( Xβ ) + Mu Mu.

(8.13)

(8.14)

Since MX = 0. Thus, the OLS residuals are a linear transformation of the underlying disturbances. Also, X0 e = X0 Mu = 0,

(8.15)

again, since MX = 0, and the OLS residuals are orthogonal or linearly unrelated to X. When u are normal, then the linear transformation e = Mu is also normal. Specifically, e ∼ N( 0, σ2 M ) (8.16) since

E e = E Mu = M E u = 0,

(8.17)

and 0

E ee

= E Muu0 M0 = M ( E uu0 ) M0 ¡ ¢ = M σ 2 I M0 = σ 2 M,

(8.18)

8.2. TESTS BASED ON THE χ2 DISTRIBUTION

77

since MM0 = M.

8.2 8.2.1

Tests Based on the χ2 Distribution The χ2 Distribution

Suppose that z1 , z2 , . . . , zn are iid N(0, 1) random variables. Then, n X i=1

8.2.2 Now,

ui = yi − x0i β ∼ N(0, σ 2 ),

(8.20)

ui ∼ N(0, 1) σ

(8.21)

n ³ X ui ´2 i=1

Now,

(8.19)

Distribution of (n − k)s2 /σ 2

so and

zi2 ∼ χ2n .

σ

=

n X u2 i

i=1

σ2

∼ χ2n .

b ei = yi − x0i β

(8.22)

(8.23)

is an estimate of ui and we might expect that

n X e2i ∼ χ2n . 2 σ i=1

(8.24)

However, this would be wrong as only n − k of the observations are independent since e satisfies the k equations X0 e = 0. The properties of e = Mu follow from the properties of M = In −X(X0 X)−1 X0 , which is symmetric idempotent and positive semi-definite and hence has some very special properties. First, rank(M) = tr(M) =n − k. Second, we can write the decomposition M = QDn−k Q0 where Dn−k is a diagonal matrix with its first n − k diagonals unity and the remainder zero, and Q0 Q = In so Q0 = Q−1 . Let v = Q0 u (8.25) then v ∼ N(0, In ) and u = Qv. Substitution yields 1 0 ee = σ2 =

1 0 u Mu σ2 1 0 u QDn−k Q0 u σ2

(8.26)

78 CHAPTER 8. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 1 0 0 v Q QDn−k Q0 Qv σ2 1 0 = v Dn−k v σ2 1 Pn−k 2 = v σ 2 i=1 i Pn−k vi 2 2 = i=1 ( ) ∼ χn−k σ =

Thus,

n X e2i ∼ χ2n−k σ2 i=1

and

Pn

i=1

(n−k)

n−k σ2

e2i

= (n−k)

(8.27)

s2 ∼ χ2n−k . σ2

(8.28)

b − β = (X0 X)−1 X0 u are jointly normal and Not only so but e = Mu and β b − β)0 ] = E[Muu0 X(X0 X)−1 ] E[e(β 2

0

= Mσ In X(X X)

(8.29)

−1

= σ 2 MX(X0 X)−1 = 0

b since it so they are uncorrelated and independent and s2 is independent of β, is a function only of e.

8.2.3

A Confidence Interval

Now, let a and b be numbers such that Pr( b ≤ χ2n−k ≤ a ) = 0.95,

(8.30)

say. Then a and b can be obtained from a table. Thus, ¶ s2 = 0.95 Pr b ≤ ( n − k ) 2 ≤ a σ µ ¶ 1 1 σ2 Pr ≥ ≥ = 0.95 b ( n − k )s2 a ¶ µ ( n − k )s2 ( n − k )s2 ≥ σ2 ≥ = 0.95 Pr b a µ

establishes a 95% confidence interval for σ 2 .

(8.31)

8.3. TESTS BASED ON THE T DISTRIBUTION

For example, for n − k = 14, we have µ ¶ 14s2 Pr 5.64 ≤ 2 ≤ 26.12 = 0.95 σ µ ¶ 14s2 14s2 Pr ≥ σ2 ≥ = 0.95 5.63 26.12

79

(8.32)

and if s2 = 4.0, then Pr or is the confidence interval.

8.2.4

µ

56 56 ≥ σ2 ≥ 5.63 26.12



= 0.95

¢ ¡ Pr 10 ≥ σ 2 ≥ 2.1 = 0.95

(8.33) (8.34)

A Hypothesis Test

Suppose that H0 : σ 2 = σ02 , Then we know that (n−k)

H1 : σ2 6= σ02 . s2 ∼ χ2n−k . σ02

(8.35)

under the null hypothesis. Choose α = 0.05, say, then critical values corresponding to 2.5% tails are 5.63 and 26.12 for n − k = 14. Thus, if 5.62 ≤ 14

s2 ≤ 26.12, σ02

(8.36)

we fail to reject the null hypothesis. Otherwise, we reject it at the 5% level of confidence. For example, suppose that s2 = 4.0 and σ02 = 1, then 14

s2 = 56 σ02

(8.37)

and we reject the null hypothesis since we fall into the right-hand 2.5% tail.

8.3 8.3.1

Tests Based on the t Distribution The t Distribution

Suppose that z is a N(0, 1) random variable and that w ∼ χ2m independent of z. Then, z p w ∼ tm . (8.38) m

80 CHAPTER 8. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

8.3.2

The Distribution of ( βbi − βi )/( s2 dii )1/2

We have seen that

βbi ∼ N( βi , σ 2 dii ), 0

(8.39) 1/2

where dii is the (i, i) element of the matrix (X X)

. Then,

βbi − βi ∼ N( 1, 0 ), z=√ 2 σ dii

(8.40)

while w = (n − k)

s2 ∼ χ2n−k . σ02

(8.41)

Since βb and s2 are independent, we have q

8.3.3

bi −βi β √

βbi − βi = √ 2 ∼ tn−k . s dii /( n − k )

σ 2 dii

( n−k )s2 σ02

(8.42)

Confidence Interval for βi

First, obtain a such that Pr( −a ≤ tn−k ≤ a ) = 0.95,

(8.43)

say, from a table. Then, Pr Pr

Ã

Ã

βbi − βi −a ≤ √ 2 ≤a σ dii

p p βbi − βi βbi + a σ 2 dii ≥ √ 2 ≥ βbi − a σ 2 dii σ dii

!

!

= 0.95 = 0.95

(8.44)

√ and βbi ± a σ 2 dii defines a 95% confidence interval for βi .

8.3.4

Testing a Hypothesis

Suppose that H0 : βi = βi0 , We know that

H1 : βi 6= βi0 .

βb − βi √i ∼ tn−k . σ 2 dii

(8.45)

8.4. TESTS BASED ON THE F DISTRIBUTION

81

Now, choose α = 0.05, say, then critical values corresponding to 2.5% tails of a t distribution with (n − k) degrees of freedom are ±a, say, so if βbi − βi −a ≤ √ 2 ≤ a, σ dii

(8.46)

we fail to reject the null hypothesis. Otherwise we reject the null hypothesis in favor of the alternative.

8.3.5

1/2

b − β )/ ( s2 c0 ( X0 X )−1 c ) The Distribution of c0 ( β

Consider the linear combination Then,

b c0 β.

c0 βb ∼ N( c0 β, σ 2 c0 ( X0 X )−1 c ),

Then,

b −β) c0 ( β p ∼ N( 1, 0 ). 2 0 σ c ( X0 X )−1 c

(8.47) (8.48)

(8.49)

As before, we use s2 instead of σ2 , so while

b −β) c0 ( β p ∼ tn−k . 2 0 s c ( X0 X )−1 c

(8.50)

We can perform inferences and calculate confidence intervals as before.

8.4 8.4.1

Tests Based on the F Distribution The F Distribution

Suppose that v ∼ χ2l and v ∼ χ2m If v and w are independent, then v/l ∼ Fl,m . w/m

8.4.2

(8.51)

Distribution of (Rβb − r)0 [s2 R(X 0 X)−1 R0 ]−1 (Rβb − r)/q

Suppose we are interested in testing a set of q linear restrictions. Examples would be β1 + β2 + ... + βk = 1 and β3 = 2β2 . More generally, we consider H0 : Rβ = r

H1 : Rβ 6= r

82 CHAPTER 8. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

where r is a q × 1 known vector and R is a q × k known matrix. Due to the b then under the null hypothesis, we have multivariate normality of β, and hence

b Rβ−r ∼N (0, σ 2 R(X0 X)−1 R0 )

0 2 b b (Rβ−r) [σ R(X0 X)−1 R0 )]−1 (Rβ−r) ∼ χ2q .

(8.52)

(8.53)

b and s2 are independent so (n − k) s22 ∼χ2 Recall that β n−k is independent of σ the quadratic form in (8.53). Thus, under the null hypothesis, 0 2 b b [σ R(X0 X)−1 R0 )]−1 (Rβ−r)/q (Rβ−r) ∼ Fq,n−k s2 (n − k) σ2 /(n − k)

(8.54)

and after some simplification 0 2 b b (Rβ−r) [s R(X0 X)−1 R0 )]−1 (Rβ−r)/q ∼ Fq,n−k .

(8.55)

Under the alternative hypothesis Rβ 6= r, then the numerator diverges at the rate n and we expect large positive values of the statistic with high probability. Accordingly, we only consult the RHS tail values of the distribution to establish critical values. Values of the statistic exceeding these critical values are rare events under the null but typical under the alternative, so we reject when the realization exceeds the critical value.

8.4.3

The Distribution of [( SSEr − SSEu )/q ]/SSEu /( n − k )

The most common form of linear restrictions that occur are zero restrictions. Suppose the model of interest can be written as y = X1 β1 + X2 β2 + u

(8.56)

and H0 : β2 = 0

H1 : β2 6= 0.

Define the “unrestricted” residuals

and

b − X2 β b eu = y − X1 β 1 2 SSEu = e0u eu

(8.57)

(8.58)

from the OLS regression of y on X1 and X2 . Next, define the “restricted” residuals b er = y − X1 β 1

(8.59)

8.4. TESTS BASED ON THE F DISTRIBUTION

83

and SSEr = e0r er

(8.60)

from the OLS regression of y on X1 only. Now, SSEr ≥ SSEu , but under H0 : β2 = 0, we expect SSEu ∼ χ2n−(k1 +k2 ) σ2

and

SSEr ∼ χ2n−k1 σ2

(8.61)

to have similar values. We therefore might expect SSEr /(n − k1 ) ∼ Fn−k1 ,n−(k1 +k2 ) , SSEu /(n − (k1 + k2 ))

(8.62)

but unfortunately, SSEr and SSEu are not independent because they both satisfy X01 eu = X01 er = 0.

(8.63)

The appropriate ratio can be determined by applying the results of the previous section. Specifically, we take R = (0 : Ik2 ) and r = 0 whereupon the restictions Rβ = r are equivalent to β2 = 0. For this choice of R and r we b − r =β b and using the results for inverses of partitioned matrices have Rβ 2 R(X0 X)−1 R0

= (X02 X2 − X02 X1 (X01 X1 )−1 X01 X2 )−1 = (X02 M1 X2 )−1

(8.64)

where M1 = In − X1 (X01 X1 )−1 X01 . Substitution yields 0 2 b b 0 [σ 2 (X0 M1 X2 )−1 ]−1 β b b [σ R(X0 X)−1 R0 )]−1 (Rβ−r)/q = β (Rβ−r) 2 2 2 0 1 b 0 b = β X M1 X2 β 2 σ2 2 2 0 1 b 0 b = β X M1 M1 X2 β 2 σ2 2 2 1 0 = (M1 y − eu ) (M1 y − eu ) σ2 1 0 = (y M1 y − 2e0u M1 y − e0u eu ) σ2 1 0 = (y M1 y − e0u eu ) σ2 1 0 = (e er −e0u eu ) (8.65) σ2 r

where we use the results M1 y

b + X2 β b + eu ) = M1 (X1 β 1 2 b + eu ) = M1 (X β 2

2

b + M1 eu = M1 X2 β 2

(8.66)

84 CHAPTER 8. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

and M1 eu = M1 My = My = eu . Since SSEr = e0r er and SSEu = e0u eu , then SSEr − SSEu is the sum-ofsquares with k2 degrees of freedom that are independent of SSEu and form (canceling σ2 in the numerator and denominator) ( SSEr − SSEu )/k2 ∼ Fk1 ,n−(k1 +k2 ) . SSEu /(n − (k1 + k2 ))

(8.67)

Under the null hypothesis, this value will usually be small. Under the alternative of β2 6= 0, however, we would expect SSEu to be much smaller than SSEr and the above ratio to be large.

8.4.4

Testing a Hypothesis

We can consult the tables to find the critical point, c, corresponding to α = 0.05, say. Then, if ( SSEr − SSEu )/k1 > c, (8.68) SSEu /(n − (k1 + k2 ))

we reject the null hypothesis at the 5% level. Note that for k1 = 1, that is, one restriction, s ( SSEr − SSEu )/k1 ∼ tn−(k1 +k2 ) SSEu /(n − (k1 + k2 ))

8.5

(8.69)

An Example

Consider the model Yt = β1 + β2 Xt,2 + β3 Xt,3 + ut ,

(8.70)

where Yt is wheat yield, Xt,2 is the amount of fertilizer applied and Xt,3 is the annual rainfall. The data are given in Table 8.5. After we rescale the data, we obtain the estimates of the βs given in Table 8.5. Now, ⎛ ⎞ 55.9141 −0.4189 −15.4560 0.0371 0.0773 ⎠ , X0 X = ⎝ −0.4189 (8.71) −15.4560 0.0773 4.3277 P and s2 = t e2t /(T − 3) = 0.5232/4 = 0.1308. Now, R2 = 1 − 0.5232/13.5 = 2 0.9612 and R = 1 − 6/4 · 0.0388 = 0.9419. Recall b − β )( β b − β )0 ] = Cov( β b ) = σ 2 ( X0 X )−1 , E[( β

(8.72)

est. Cov( βb ) = s2 ( X0 X )−1 .

(8.73)

which we estimate using

8.5. AN EXAMPLE

85

Wheat Yield (Bushels/Acre) 40 45 50 65 70 70 80

Fertilizer (Pounds/Acre) 100 200 300 400 500 600 700

Rainfall (Inches/Year) 36 33 37 37 34 32 36

Table 8.1: Wheat yield data. Parameter β1 β2 β3

Estimate 1.1329 0.6893 0.6028

Table 8.2: Wheat yield parameter estimates. Thus, est. Var( βb1 ) = s2 d11 = 7.3133 est. Var( βb2 ) = s2 d22 = 0.0049 est. Var( βb3 ) = s2 d33 = 0.5660

For H0 : β1 = 0 vs H1 : β1 6= 0, we have

βb − β10 1.1329 q1 = = 0.4189 ∼ t4 . 2.7043 Var(βb1 )

(8.74)

Now, a 95% acceptance region for a t4 distribution is −2.776 ≤ t4 ≤ 2.776. Thus, we fail to reject the null hypothesis. For H0 : β2 = 0 vs H1 : β2 6= 0, we have βb − β20 0.6893 q2 = 9.8965 ∼ t4 . = 0.0697 Var(βb2 )

(8.75)

and we reject the null hypothesis at the 95% confidence level. In fact, we reject at the 99.9% confidence level, where the acceptance region is −7.173 ≤ t4 ≤ 7.173.