Chapter 6
Bivariate Least Squares 6.1
The Bivariate Linear Model
Consider the model yi = α + βxi + ui , i = 1, 2, . . . , n,
(6.1)
where yi is the dependent variable, xi is the explanatory variable, and ui is the unobservable disturbance. In matrix form, we have ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ y1 1 x1 u1 ¶ ⎜ u ⎟ ⎜ y2 ⎟ ⎜ 1 x2 ⎟ µ ⎜ ⎟ ⎜ ⎟ α ⎜ 2 ⎟ = + ⎜ . ⎟, (6.2) ⎜ .. ⎟ ⎜ .. .. ⎟ β ⎝ . ⎠ ⎝ . ⎝ .. ⎠ . ⎠ yn 1 xn un or, more compactly, y = Xβ + u. This is the linear regression model . It is linear in the variables (given α and β), linear in the parameters (given xi ), and linear in the disturbances. Example 6.1 Neither of the following two equations is linear. yi = ( α + βxi ) ui
(6.3)
yi = αxβi + ui
(6.4)
2
6.1.1
Assumptions
For the disturbances, we assume that 48
6.1. THE BIVARIATE LINEAR MODEL
49
(i) E( ui ) = 0, for all i (ii) E( u2i ) = σ 2 , for all i (iii) E( ui uj ) = 0, for all i 6= j For the independent variable, We suppose (iv) xi nonstochastic for all i (v) xi nonconstant For purposes of inference in finite samples, we sometime assume iid
(vi) ui vN( 0, σ 2 ), for all i.
6.1.2
Line Fitting
Consider a scatter of plots as shown in Figure 6.1. 14
x 2 3 4 5 6
y 12 7 8 5 3
12
10
8
6
4
2
0 0
2
4
6
8
10
12
14
Figure 6.1: Scatter plot We assume that x and y are linearly related. That is, yi = α + βxi .
(6.5)
Of course, we see that no single line “matches” all the observations. Since no single line is entirely consistent with the data, we might choose α and β that best “fits” the data, in some sense. Define ui = yi − ( α + βxi ), for i = 1, 2, . . . , n
(6.6)
50
CHAPTER 6. BIVARIATE LEAST SQUARES
as the discrepancy between the line choosen and the observations. Our objective then is to choose α and β to minimize the discrepancy. A possible criterion for minimum discrepancy is X min | ui |, (6.7) α,β
i
but this will be zero for any line passing through (x, y). Another possibility is X |ui |, (6.8) min α,β
i
which is called the minimum absolute distance (MAD) or L1 estimator. The MAD estimator has problems since the mathematics (and statistical distributions) are intractible. A closely related choice is X min u2i . (6.9) α,β
i
This yields the least squares or L2 estimator.
6.2
Least Squares Regression
6.2.1
The First-Order Conditions
Let φ=
n X
e2i =
i=1
n X ( yi − α − βxi )2
(6.10)
i=1
Then the minimum values, α b and βb say, must satisfy the following first-order conditions: n
∂φ X −2 ( yi − α − βxi ) = 0= ∂α i=1
(6.11)
n
0=
∂φ X −2 ( yi − α − βxi ) xi = ∂β i=1
(6.12)
Now, these first-order conditions may be written as n X i=1
yi =
n X ( α + βxi ) i=1
(6.13)
6.2. LEAST SQUARES REGRESSION n X i=1
Thus, (??) implies
51
n X yi xi = ( αxi + βx2i )
(6.14)
i=1
b α b = y − βx, Pn where y = i=1 yi /n and x = i=1 xi /n. Substituting (??) into (??) yields Pn
n X
yi xi
n X b ) xi + βx2 ] [ ( y − βx i
=
i=1
i=1 n X
= y
i=1
and n X i=1
xi + βb
n X i=1
yi xi − nyx = βb
n X ( yi − y )( xi − x ) = βb i=1
b we have After solving for β,
6.2.2
βb =
(6.15)
xi ( xi − x )
à n X
x2i
i=1 n X i=1
2
− nx
!
( xi − x )2
Pn
( y − y )( xi − i=1 Pn i 2 i=1 ( xi − x )
(6.16)
x)
(6.17)
(6.18)
The Second-Order Conditions
b and in turn, α Note that the least squares estimators β, b, are unique. Taking the second derivatives of the objective function yields ∂2φ = 2n, ∂α2
n X ∂2φ = 2 x2i , ∂β 2 i=1
and
Thus, the Hessian matrix is H=
n X ∂2φ xi . =2 ∂α∂β i=1
µ
2
P2n n
i=1
xi
¶ Pn 2 P i=1 xi , n 2 i=1 x2i
which is a positive definite matrix, so α b and βb are, in fact, minimums.
(6.19)
(6.20)
(6.21)
(6.22)
52
CHAPTER 6. BIVARIATE LEAST SQUARES
6.2.3
Matrix Interpretation
Now, the first-order conditions require (??) and (??), or n X
yi =
i=1
n X
yi xi =
i=1
n X b i) (α b + βx
(6.23)
i=1
n X b 2) (α bxi + βx i
(6.24)
i=1
b In matrix form, These are the normal equations, and are linear in α b and β. we have µ Pn ¶ µ n Pni=1 yi = Pn i=1 yi xi i=1 xi
which has the solution µ
α b βb
¶
= =
=
¶µ ¶ Pn α b Pni=1 x2i , βb i=1 xi
(6.25)
¶−1 µ Pn ¶ Pn Pni=1 x2i Pni=1 yi i=1 xi i=1 xi i=1 yi xi µ Pn ¶ ¶ µ Pn Pn 2 1 − i=1 xi i=1 xi i=1 yi P P Pn P n n − i=1 xi n n i=1 x2i − ( ni=1 xi )2 i=1 yi xi ⎛ Pn ⎞ P P P n n n 2 i=1 xi i=1 yi − i=1 xi i=1 xi yi 1 ⎝ ⎠. Pn Pn Pn Pn Pn n i=1 x2i − ( i=1 xi )2 − x y +n xy
µ
Pnn
i=1
i
i=1
i
i=1
i i
b according to this formula, is Now, β, βb =
=
= =
Pn Pn Pn x y − i=1 xi i=1 yi i=1 Pn i i 2 Pn n i=1 xi − ( i=1 xi )2 P Pn xi yi − x ni=1 yi Pi=1 P n x2 − x ni=1 xi Pni=1 i xi yi − nxy Pi=1 n x2 − nx2 Pni=1 i ( y − y )( xi − x ) i=1 Pn i , 2 i=1 ( xi − x )
n
(6.26)
6.3. BASIC STATISTICAL PROPERTIES
while α b =
= = = = = =
6.3 6.3.1
Pn Pn Pn x2i i=1 yi − i=1 xi i=1 xi yi Pn Pn n i=1 x2i − ( i=1 xi )2 Pn P y i=1 x2i − x ni=1 xi yi Pn Pn 2 xi i=1 xi − x P Pi=1 P P y ni=1 x2i − yx ni=1 xi + yx ni=1 xi − x ni=1 xi yi Pn P n 2 i=1 xi − x i=1 xi Pn P Pn P n n 2 2 y( i=1 xi − x i=1 xi ) + x i=1 yi − x i=1 xi yi Pn P n 2−x x x i=1 i i=1 i Pn Pn x2 i=1 yi − x i=1 xi yi P y + Pn x2i − x ni=1 xi Pi=1 P (x n yi − ni=1 xi yi ) Pn y + Pni=1 2 x i=1 xi − x i=1 xi b y − βx. Pn
i=1
53
(6.27)
Basic Statistical Properties
Method Of Moments Interpretation
Suppose, for the moment, that xi is a random variable that is uncorrelated with ui . Let µx = E( xi ). Then µy = E( yi ) = E( α + βxi + ui ) = α + βµx .
(6.28)
E( yi − µy ) = α + βxi + ui − ( α + βµx ) = β( xi − µx ) + ui ,
(6.29)
Thus,
and E( yi − µy )( xi − µx ) = β E( xi − µx )2 + E( xi − µx )ui = β E( xi − µx )2 , (6.30) since xi and ui are uncorrelated. Solving for β, we have β=
E( xi − µx )( yi − µy ) , E( xi − µx )2
(6.31)
while α = µy − βµx .
(6.32)
Thus the least-squares estimators are method of moments estimators with sample moments replacing the population moments.
54
6.3.2
CHAPTER 6. BIVARIATE LEAST SQUARES
Mean of Estimates
Next, note that Pn ( xi − x ) yi b β = Pi=1 n ( xi − x )2 Pni=1 i − x )( α + βxi + ui ) i=1 ( x P = n 2 i=1 ( xi − x ) Pn Pn Pn ( xi − x ) ( xi − x ) ui i=1 ( xi − x ) xi P Pi=1 = α Pni=1 + β + n n 2 2 2 ( x − x ) ( x − x ) i i i=1 i=1 i=1 ( xi − x ) Pn ( xi − x ) ui = β + Pi=1 . (6.33) n 2 i=1 ( xi − x ) So, E( βb ) = β, since E( ui ) = 0 for all i. Thus, βb is an unbiased estimator of β. Further,
α b = y − βx (6.34) Pn n X ( xi − x ) yi yi = − x Pi=1 n 2 n i=1 ( xi − x ) i=1 ¶ n µ X xi − x 1 = yi − x Pn 2 n i=1 ( xi − x ) i=1 ¶ n µ X 1 xi − x = α − x Pn 2 n i=1 ( xi − x ) i=1 ¶ X ¶ n µ n µ X ( xi − x ) xi xi − x xi 1 + ui − x Pn − x Pn +β 2 2 n n i=1 ( xi − x ) i=1 ( xi − x ) i=1 i=1 ¶ n µ X 1 xi − x = α ui − x Pn 2 n i=1 ( xi − x ) i=1
So we have E( α b ) = α, and α b is an unbiased estimator of α.
6.3.3
Variance of Estimates
Now,
where
Pn n ( xi − x ) ui X b = wi ui , β − β = Pi=1 n 2 i=1 ( xi − x ) i=1 ( xi − x ) . wi = Pn 2 i=1 ( xi − x )
(6.35)
(6.36)
6.3. BASIC STATISTICAL PROPERTIES
55
So, Var( βb ) = E( βb − β )2 n X = E( wi ui )2 = E( w1 u1 + w2 u2 + · · · + wn un )2 i=1
= E[( w12 u21 + w22 u22 + · · · + wn2 u2n ) + (w1 u1 w2 u2 + · · · wn−1 un−1 wn un )] = w12 σ2 + w22 σ 2 + · · · + wn2 σ 2 Ã !2 n n X X ( x − x ) i Pn = σ2 wi2 = σ 2 2 j=1 ( xj − x ) i=1 i=1 =
σ2 . 2 i=1 ( xi − x )
Pn
(6.37)
Next, we note that
where
α b−α=
n µ X 1 i=1
xi − x − x Pn 2 n i=1 ( xi − x )
vi =
µ
¶
ui =
1 xi − x − x Pn 2 n i=1 ( xi − x )
n X
vi ui ,
(6.38)
i=1
¶
.
(6.39)
So, in a fashion similar to the one above, we find that
and
Var( α b −α) = σ
2
n
Pn
x2i , 2 i=1 ( xi − x )
Pn
i=1
Pn xi . Cov( α b, βb ) = E( α b − α )( βb − β ) = σ2 Pn i=1 n i=1 ( xi − x )2
6.3.4
(6.40)
(6.41)
Estimation of σ 2
Next, we would like to get an estimate of σ 2 . Let n
s2 =
1 X 2 e , n − 2 i=1 i
(6.42)
56
CHAPTER 6. BIVARIATE LEAST SQUARES
where ei
= yi − α − βxi b xi − x ) = ( yi − y ) − ( α − α ) − β( b xi − x ) = ( yi − y ) − β(
b xi − x ) = β( xi − x ) + ( ui − u ) − β( = −( βb − β )( xi − x ) + ( ui − u )
(6.43) (6.44)
.
So, n X
e2i
=
i=1
n X [−( βb − β )( xi − x ) + ( ui − u )]2
= ( βb − β )2
n X ( xi − x )2 i=1 n X
−2( βb − β )
Now, we have
i=1
2 E[( βb − β )
E(
( xi − x )( ui − u ) +
n X ( ui − u )2 . i=1
n X ( xi − x )2 ] = σ 2 ,
(6.46)
i=1
n n X X ( ui − u )2 ) = E( ( u2i − u2 )) i=1
i=1
= E(
n X i=1
and E(( βb − β )
= E
= E
Ã
n X
i=1 Ã n X i=1
Therefore,
(6.45)
i=1
E(
n X i=1
u2i ) −
n X 1 ui )2 = ( n − 1 )σ 2 , (6.47) E( n i=1
n X ( xi − x )( ui − u ))
(6.48)
i=1
wi ui wi ui
!Ã !
n n X X ( xi − x )ui − ( xi − x )u i=1
i=1
!
n n X X ( xi − x )ui = wi ( xi − x )σ2 = σ 2 . i=1
i=1
e2i ) = σ 2 − 2σ 2 + ( n − 1 )σ 2 = ( n − 2 )σ2
(6.49)
6.4. STATISTICAL PROPERTIES UNDER NORMALITY
57
and 2 2 E( s ) = σ .
6.4 6.4.1
(6.50)
Statistical Properties Under Normality Distribution of βb iid
Suppose that ui vN( 0, σ 2 ). Then,
where wi =
where q =
and
βb = β +
1
2 i=1 ( xi −x )
wi ui ,
(6.51)
i=1
Then βb is also a normal random variable. Specifically,
Sn xi −x 2. i=1 ( xi −x )
Sn
n X
βb ∼ N( β, σ 2 q ),
(6.52)
βb − β0 ∼ N( β − beta0 , σ 2 q ),
(6.53)
. Thus, for a given β0 ,
βb − β0 zβ = p ∼N σ2q
Ã
Now, suppose that H0 : β = β0 . Then,
βb − β0 p ,1 σ2q
!
.
βb − β0 p ∼ N( 0, 1 ), σ2q
(6.54)
(6.55)
while for H0 : β = β1 > β0 we have
βb − β0 p ∼N σ2 q
Ã
β1 − β0 p ,1 σ2q
!
,
(6.56)
and we would expect the statistic to be centered to the right of zero.
6.4.2
Distribution of α b iid
Again, suppose that ui vN( 0, σ 2 ). Then, α b=α+
n X i=1
vi ui ,
(6.57)
58
CHAPTER 6. BIVARIATE LEAST SQUARES
where vi =
where p =
1 n
n
) − Snx( x(ix−x b is a normal random variable. Specifically, 2 . Then α i −x ) i=1
α b ∼ N( α, σ 2 p ),
(6.58)
α b − α0 ∼ N( α − beta0 , σ 2 p ),
(6.59)
Sn x2 Sn i=1 i 2. ( x i −x ) i=1
and
Then, for a given α0 ,
α b − α0 zα = p ∼N σ2 p
Ã
Now, suppose that H0 : α = α0 . Then,
α b − α0 p ,1 σ2p
!
α b − α0 p ∼ N( 0, 1 ), σ2 p
while for H0 : α = α1 6= α0 we have
α b − α0 p ∼N σ2 p
Ã
α1 − α0 p ,1 σ2 p
.
(6.60)
(6.61)
!
,
(6.62)
and we would expect the statistic not to be centered around zero. It is important to note that Pn x2 , p = Pn i=1 i n i=1 ( xi − x )2 p and so σ 2 p grows small as n grows large. Thus, the noncentrality of the distribution of the statistic zα will also grow under the alternative. A similar statemetn can be made concerning q and the statistic zβ .
6.4.3
t-Distribution
In most cases, we do not know the value of σ 2 . A possible alternative is to use s2 , whereupon we obtain
under H0 : α = α0 , and
α b − α0 p ∼ tn−2 , s2 p βb − β0 p ∼ tn−2 , s2 q
(6.63)
(6.64)
6.4. STATISTICAL PROPERTIES UNDER NORMALITY
59
under H0 : β = β0 . The t-distibution is quite similar to the standard normal except being slightly fatter. This reflects the added uncertainty introduced by using s2 rather than σ2 . However, as n increases and the precision of s2 becomes better, the tdistibution grows closer and closer to the N( 0, 1 ). We loose two degrees of freedom (and so tn−2 ) because of the fact that we estimated two coefficients, namely α and β. Just as in the case of zα and zβ , we would expect the t-distribution to be off-center if the null hypothesis were not true.
6.4.4
Maximum Likelihood iid
Suppose that ui vN( 0, σ 2 ). Then, iid
yi v N( α + βxi , σ2 ).
(6.65)
Then, the pdf of yi is given by ½ ¾ 1 2 f ( yi ) = √ exp − 2 [ yi − ( α + βxi )] . 2σ 2πσ 2 1
(6.66)
Since the observations are independent, we can write the joint likelihood function as f ( y1 , y2 , . . . , yn ) = f ( y1 )f ( y2 ) · · · f ( yn ) ( ) n 1 1 X 2 = − 2 [ yi − ( α + βxi )] n exp 2σ i=1 ( 2πσ 2 ) 2 = L( α, β, σ 2 |y, x ).
(6.67)
Now, let L = log L( α, β, σ 2 |y, x ). We seek to maximize n 1 X n n [ yi − ( α + βxi )]2 . L = − log( 2π ) − log( σ 2 ) − 2 2 2 2σ i=1
(6.68)
Note that for (??) to be a maximum with respect to α and β, we most minimize P n 2 i=1 [ yi − ( α + βxi )] . The first-order conditions for (??) are n ∂L 1 X b i )] = 0, [ yi − ( α b + βx = 2 ∂α σ i=1 n 1 X ∂L b i ) xi ] = 0, [ yi − ( α b + βx = 2 ∂β σ i=1
(6.69)
(6.70)
60
CHAPTER 6. BIVARIATE LEAST SQUARES
xi 2 3 4 5 6 20
yi 12 7 8 5 3 35
xi − x -2 -1 0 1 2 0
yi − y 5 0 1 -2 -4 0
(xi − x)2 4 1 0 1 4 10
(xi − x)(yi − y) -24 -7 0 5 6 -20
Table 6.1: Summary table. and n 1 X n ∂L + 4 [ yi − ( α + βxi )]2 . =− c2 2σ ∂β 2σ i=1
(6.71)
Note that the first two conditions imply that
b α = y − βx,
and βb =
Pn
( x − x )( yi − i=1 Pn i 2 i=1 ( xi − x )
(6.72)
y)
,
(6.73)
since these are the same as the normal equations (except for σ 2 ). The third condition yields Pn 2 i=1 [ yi − ( α + βxi )] c2 = σ Pn 2 n n−2 2 i=1 ei = (6.74) = s . n n
6.5
An Example
Consider the scatter graph given in Figure 6.1. From this, we construct Table 6.1. Thus, we have Pn ( x − x )( yi − y ) −20 b Pn i β = i=1 = = −2, 2 10 i=1 ( xi − x ) and
b = 7 − ( −2 )4 = 15. α b = y − βx
6.5. AN EXAMPLE
61
b i α b + βx 11 9 7 5 3
b i βx -4 -6 -8 -10 -12
ei 1 -2 1 0 0
e2i 1 4 1 0 0
Table 6.2: Residual calculations. Now, we calculate the residuals in Table 6.2, and find n X
e2i = 6
i=1
and 2
s =
Pn
2 i=1 ei
=
6 = 2. 3
n−2 b namely σ2 q, is provided by An estimate of the variance of β, 1 1 =2 = 0.2. 2 10 (x − x) i=1 i
s2 q = s2 Pn
Suppose we wish to test H0 : β = 0 against H1 : β 6= 0. Then
under the null hypothesis, but
βb − 0 p ∼ t3 s2 q
−2 −2 βb − 0 p = =√ = −4.2 2 0.45 0.2 s q
is clearly in the left-hand 2.5% tail of the t3 -distribution. Thus, we would reject the null hypothesis at the 95% significance level.
Chapter 7
Linear Least Squares 7.1
Multiple Regression Model
The general k-variable linear model can be written as yi = β1 xi1 + β2 xi2 + · · · + βk xik + ui Using matrix techniques, we can ⎛ ⎞ ⎛ y1 x11 x12 ⎜ y2 ⎟ ⎜ x21 x22 ⎜ ⎟ ⎜ ⎜ .. ⎟ = ⎜ .. .. ⎝ . ⎠ ⎝ . . yn
xn1
xn2
or, more compactly, as
7.1.1
i = 1, 2, . . . , n.
(7.1)
equivalently write this model as ⎞⎛ ⎞ ⎛ ⎞ β1 u1 · · · x1k ⎜ ⎟ ⎜ ⎟ · · · x2k ⎟ ⎟ ⎜ β2 ⎟ ⎜ u2 ⎟ .. ⎟ ⎜ .. ⎟ + ⎜ .. ⎟ , .. . . ⎠⎝ . ⎠ ⎝ . ⎠
(7.2)
y = Xβ + u.
(7.3)
· · · xnk
βk
un
Assumptions
In the general k-variable linear model, we make the following assumptions about the disturbances: (i) E( ui ) = 0
i = 1, 2, . . . , n
(ii) E( u2i ) = 0
i = 1, 2, . . . , n
(iii) E( ui uj ) = 0
i 6= j
These assumptions can be written in matrix notaion as E( u ) = 0 62
(7.4)
7.1. MULTIPLE REGRESSION MODEL
63
and Cov( u ) = E( uu0 ) = σ 2 In ,
(7.5)
where In is an n × n identity matrix. The nonstochastic assumptions are (iv) X is nonstochastic. (v) X has full column rank (the columns are linearly independent). Sometimes, we will also assume that the ui ’s are normally distrubuted (vi) u ∼ E( 0, σ 2 In ).
7.1.2
Plane Fitting
Suppose k = 3 and xi1 = 1. Then, yi = β1 + β2 xi2 + β3 xi3 + ui
i = 1, 2, . . . , n.
(7.6)
Now, ybi = βb1 + βb2 xi2 + βb3 xi3
i = 1, 2, . . . , n.
(7.7)
define planes in the three-dimentional space of y, x2 , and x3 . We seek to choose βb1 , βb2 and βb3 so that the points on the plane corresponding to xi2 and xi3 , namely ybi , will be close to yi . That is, we will “fit” a plane to the observations. As we did in the two-dimensional case, we choose to measure closeness in the vertical distance. That is,
and
ei = yi − ( βb1 + βb2 xi2 + βb3 xi3 ) φ=
n X
i = 1, 2, . . . , n,
e2t
(7.8)
(7.9)
i=1
7.1.3
Least Squares
In general, we want to min β
n X [ yi − ( βb1 + βb2 xi2 + · · · + βbk xik )]2 i=1
(7.10)
64
CHAPTER 7. LINEAR LEAST SQUARES
7.2 7.2.1
Least Squares Regression The OLS Estimator
As was stated above, we seek to minimize φ=
n X [ yi − ( β1 + β2 xi2 + · · · + βk xik )]2
(7.11)
i=1
with respect to the coefficients β1 , β2 , . . . , βk . The first-order conditions are 0= 0=
∂φ ∂β1
= 2
∂φ ∂β2
= 2
n X [ yi − ( βb1 + βb2 xi2 + · · · + βbk xik )]xi1 , i=1
.. . 0=
∂φ ∂βk
[ yi − ( βb1 + βb2 xi2 + · · · + βbk xik )]xi2 ,
n X [ yi − ( βb1 + βb2 xi2 + · · · + βbk xik )]xik .
= 2
i=1
where (βb1 , βb2 , · · · , βbk ) are solutions. tions: βb1 βb1
βb1
n X i=1 n X i=1
n X i=1
x2i1 + βb2
(7.12)
i=1 n X
n X i=1
xi2 xi1 + βb2 xik xi1 + βb2
Rearranging, we have the normal equa-
xi1 xi2 + · · · + βbk
n X i=1
n X i=1
x2i2 + · · · + βbk
n X i=1 n X
xi1 xik xi2 xik
= =
i=1
xik xi2 + · · · + βbk
n X
xi1 yi
i=1 n X
xi2 yi
n X
xik yi
i=1
.. .
n X
x2ik
=
i=1
i=1
(7.13)
or ⎛ ⎜ ⎜ ⎜ ⎝
Pn 2 Pn i=1 xi1 i=1 xi2 xi1 .. Pn . i=1 xik xi1
Pn i=1 xi1 xi2 P n 2 i=1 xi2 .. Pn . i=1 xik xi2
P · · · Pni=1 xi1 xik n ··· i=1 xi2 xik .. .. . Pn . 2 ··· i=1 xik
⎞⎛ ⎟⎜ ⎟⎜ ⎟⎜ ⎠⎜ ⎝
βb1 βb2 .. . βbk
⎞
⎛ Pn ⎞ i=1 xi1 yi P ⎟ ⎜ n ⎟ ⎟ ⎜ i=1 xi2 yi ⎟ ⎟=⎜ ⎟. .. ⎟ ⎝ ⎠ . ⎠ Pn i=1 xik yi (7.14)
7.2. LEAST SQUARES REGRESSION
As in the bivariate model, ⎛ ⎜ ⎜ X=⎜ ⎝
so ⎛
and
⎜ ⎜ X0 X = ⎜ ⎝
Pn 2 Pn i=1 xi1 i=1 xi2 xi1 .. Pn . i=1 xik xi1
65
x11 x12 .. .
x21 x22 .. .
··· ··· .. .
x1n
x2n
· · · xnn
Pn i=1 xi1 xi2 P n 2 i=1 xi2 .. Pn . i=1 xik xi2
xn1 xn2 .. .
⎞
⎟ ⎟ ⎟, ⎠
Pn · · · Pi=1 xi1 xik n ··· i=1 xi2 xik .. .. . Pn . 2 ··· i=1 xik
⎛ Pn xi1 yi Pi=1 n ⎜ i=1 xi2 yi ⎜ X0 y = ⎜ .. ⎝ Pn . i=1 xik yi
(7.15)
⎞
⎟ ⎟ ⎟, ⎠
⎞
⎟ ⎟ ⎟. ⎠
(7.16)
(7.17)
Thus, we can write the normal equations in matrix notation: b = X0 y. X0 Xβ
(7.18)
b = ( X0 X )−1 X0 y, β
(7.19)
where β0 = (βb1 , βb2 , · · · , βbk ).Therefore, we have the unique solution as long as |X0 X| 6= 0, which is assured by Assumption (v).
7.2.2
Some Algebraic Results
b Define the fitted value for each i as ybi = xi1 βb1 + xi2 βb2 + · · · + xik βbk = x0i β whereupon b b = Xβ. y (7.20)
Next define the OLS residual for each i as ei = yi − ybi so Then,
b. e=y−y
b) X0 e = X0 ( y − y 0 = X y − X0 X(X0 X)−1 X0 y = 0,
(7.21)
(7.22)
66
CHAPTER 7. LINEAR LEAST SQUARES
and we say that the residuals are orthogonal to the regressors. Also, we find b0 y y
b 0 (Xβ b + e) = (Xβ)
b +β b 0 X0 e b 0 X0 Xβ = β b b 0 X0 Xβ = β
b0 y b. = y
(7.23)
Now, suppose that the first coefficient is the intercept. Then, the first column of X and hence the first row of X0 are all ones. This means that ⎛
⎜ ⎜ 0 = X0 e = ⎜ ⎝ ⎛
So,
Pn
i=1 ei
⎜ ⎜ = ⎜ ⎝
1 x12 .. .
1 x22 .. .
x1n x2n Pn Pn i=1 ei i=1 xi2 ei .. Pn . i=1 xik ei
··· ··· .. .
1 xn2 .. .
· · · xnn ⎞
⎞⎛ ⎟⎜ ⎟⎜ ⎟⎜ ⎠⎝
e1 e2 .. . en
⎟ ⎟ ⎟. ⎠
⎞ ⎟ ⎟ ⎟ ⎠ (7.24)
= 0, which means that n X i=1
Finally, we note that e = = = = = = =
yi =
n X i=1
ybi + ei =
n X i=1
ybi .
b y − Xβ y − (X0 X)−1 X0 y [ In − (X0 X)−1 X0 ]y My M( Xβ + u ) [ In − (X0 X)−1 X0 ]( Xβ ) + Mu Mu.
(7.25)
(7.26)
We see that the OLS residuals are a linear transformation of the underlying disturbances. The matrix M , which is sometimes called ”the idempotent matrix” plays an important role in the sequel, has the property M = M · M or of idempotence.
7.2. LEAST SQUARES REGRESSION
7.2.3
67
The R2 Statistic
Define the following: SSE =
n X
e2i =
i=1
n X ( yi − ybi )2 ,
(7.27)
i=1
SST =
n X ( yi − y )2 ,
(7.28)
i=1
SSR =
n X ( ybi − y )2 .
(7.29)
i=1
Note that SSE is the variation of actuals around the fitted plane and is called the unexplained sum-of-squares. SSR, or residual sum-of-squares, is variation of the fitted values around the sample mean and SST, or total sum-of-squares, is the variation of the actual around the sample mean. The three sums-of-squares are closely related. Consider SST − SSE = =
n X [( yi − y )2 − ( yi − ybi )2 ] i=1
n X i=1
=
n X i=1
=
n X i=1
y 2 − 2y y 2 − 2y y 2 − 2y
n X
yi + 2
i=1
i=1
n X
yi +
i=1
ybi +
i=1
n X
n X
n X i=1
n X i=1
n X = ( ybi − y )2 = SSR.
yi ybi −
ybi2 ybi2
n X i=1
ybi2
(7.30)
i=1
Thus SST = SSE + SSR. We now define
R2 = 1 −
SSE SST SSE SSR = − = . SST SST SST SST
(7.31)
as the percent of of total variation explained by the model. This statistic can also be interpreted as a squared correlation coefficient. Consider the sample second moments, n
Var( y ) =
1X 1 ( yi − y )2 = SST, n i=1 n
(7.32)
68
CHAPTER 7. LINEAR LEAST SQUARES n
1X 1 b) = Var( y ( ybi − y )2 = SSR, n i=1 n
and
(7.33)
n
b) = Cov( y, y
=
1X ( yi − y )( ybi − y ) n i=1 n
1X yi ybi − yb yi − yyi + y 2 n i=1 n
=
1X 2 yb − 2yb yi + y 2 n i=1 i n
= =
1X ( ybi − y )2 n i=1 1 SSR. n
(7.34)
Then the correlation between yi and ybi can be written as, ρby,b = y
=
=
b) Cov( y, y p b ) Var( y b) Var( y q
1 n SSR
1 1 n SST n SSR
√ SSR √ , SST
so
SSR (7.35) = R2 . SST This statistic is variously called the coefficient of determination, the multiple correlation statistic and the ”R”-squared statistic. = ρb2y,b y
7.3 7.3.1
Basic Statistical Results b Mean and Covariance of β
From the previous section, under assumption (v), we have b β
= = = =
( X0 X )−1 X0 y ( X0 X )−1 X0 ( Xβ + u ) ( X0 X )−1 X0 Xβ + ( X0 X )−1 X0 u β + ( X0 X )−1 X0 u.
(7.36)
7.3. BASIC STATISTICAL RESULTS
69
Since X is nonstochasticby assumption (iv), we have, also using assumption (i), b ) = β + E[( X0 X )−1 X0 u ] E( β = β + ( X0 X )−1 X0 E( u ) = β.
(7.37)
Thus, the OLS estimator is unbiased. Also, using assumptions (ii) and (iii), b ) = E( β b − β )( β b − β )0 Cov( β = E[( X0 X )−1 X0 uu0 X( X0 X )−1 ] = ( X0 X )−1 X0 E( uu0 )X( X0 X )−1 = ( X0 X )−1 X0 σ 2 In X( X0 X )−1 = σ 2 ( X0 X )−1 .
7.3.2
(7.38)
Best Linear Unbiased Estimator (BLUE)
The OLS estimator is linear in y and it is an unbiased estimator, as we saw above. Let e = Ay, e β (7.39)
where A is a k×n matrix that is nonstochastic, be any other unbiased estimator. e ) = β. Define That is, E( β e − ( X0 X )−1 X0 . A=A
Then, e β
= [ A+( X0 X )−1 X0 ]y = [ A+( X0 X )−1 X0 ][ Xβ + u ] = AXβ + β + [ A + ( X0 X )−1 X0 ]u.
(7.40)
(7.41)
e is an unbiased estimator, so Now, β
e ) = AXβ + β + [ A + ( X0 X )−1 X0 ] E( u ) E( β = AXβ + β = β,
(7.42)
which implies that for all β, AX = 0. Thus,
and
e = β + [ A + ( X0 X )−1 X0 ]u, β e ) = E( β e − β )( β e − β )0 Cov( β = E{[ A + ( X0 X )−1 X0 ]uu0 [ A + ( X0 X )−1 X0 ]0 }
(7.43)
70
CHAPTER 7. LINEAR LEAST SQUARES [ A + ( X0 X )−1 X0 ] E( uu0 )[ A + ( X0 X )−1 X0 ]0 [ A + ( X0 X )−1 X0 ]σ 2 In [ A + ( X0 X )−1 X0 ]0 σ 2 [ AA0 + ( X0 X )−1 ] σ 2 AA0 + σ 2 ( X0 X )−1 .
= = = =
(7.44) This shows that the covariance matrix of any other linear unbiased estimator exceeds the covariance matrix of the OLS estimator by a possitive semi-definite matrix σ2 AA0 . Hence, OLS is said to be best linear unbiased estimator (BLUE). Note that we have used all of the assumptions (i)-(v) to get to this point.
7.3.3
Consistancy
Typically, the elements of X0 X are unbounded (they go to infinity) P as n gets very large. For example, the 1, 1 element is n and the j, j element is ni=1 x2ij . Therefore, lim ( X0 X )−1 = 0, (7.45) n→∞
b converge to zero. This means that the distribution and the variances of β collapses about its expected value, namely β. So, b = β, plim β
(7.46)
e = Mu,
(7.47)
n→∞
and OLS estimation is consistent. A more formal proof of this property will be given in the chapter on stochastic regressors.
7.3.4
Estimation Of σ 2
Recall that 0
where M = In − X( X X )
−1
0
X . Then,
e0 e = ( Mu )0 Mu = u0 MMu = u0 Mu,
(7.48)
since M is symmetric and idempotent. Also, e0 e = tr e0 e = tr u0 Mu = tr Mu0 u,
(7.49)
since e0 e is a scalar, and tr AB = tr BA, when both multiplications are defined. Thus, 0 E( e e ) = = = =
0 E( tr Mu u ) tr M E( u0 u ) tr Mσ 2 In σ 2 tr M.
(7.50)
7.3. BASIC STATISTICAL RESULTS
71
But, tr M = = = =
tr( In − X( X0 X )−1 X0 ) tr In − tr(, X( X0 X )−1 X0 ) n − tr(( X0 X )−1 X0 X ) n − k.
Now, define s2 = Then
e0 e . n−k
(7.51)
(7.52)
σ2( n − k ) E( e0 e ) (7.53) = = σ2 , n−k n−k so s2 is an unbiased estimator of σ2 . We can also establish that s2 is a consistent estimator of σ 2 . That is, plim s2 = σ 2 . (7.54) 2 E( s ) =
n→∞
7.3.5
Prediction
Suppose that we wish to predict yp = β1 xp1 + β2 xp2 + · · · + βk xpk + up = xp 0 β + up .
(7.55)
Note that E( yp |xp ) = x0p β. A natural choice for a predictor is b ybp = x0 β =
=
=
p x0p ( X0 X )−1 X0 y x0p ( X0 X )−1 X0 ( X0 β + u x0p β + x0p ( X0 X )−1 X0 u.
Now, and
(7.56)
0
E( yp |xp ) = xp β,
(7.58)
E[( yp − ybp )|xp ] = 0. Hence, ybp is an unbiased predictor of yp . We also have 0
while
(7.57)
(7.59)
0
Var( ybp ) = E( ybp − xp β )2 = σ2 xp ( X0 X )−1 xp , 0
MSPE( ybp ) = E( yp − ybp ) = σ 2 [ 1 + xp ( X0 X )−1 xp ].
(7.60) (7.61)
It can be shown as above that ybp is the best (minimum variance) linear unbiased predictor (BLUP) of yp .
72
7.4 7.4.1
CHAPTER 7. LINEAR LEAST SQUARES
Statistical Properties Under Normality b Distribution Of β
Suppose that we introduce assumption (vi), so the ui ’s are normal: u ∼ E( 0, σ2 In ). Recall that
(7.62)
b = β + ( X0 X )−1 X0 u, β
(7.63)
b ∼ E( 0, σ 2 ( X0 X )−1 ). β
(7.64)
b is linear in u and X and hence ( X X ) so β also normally distributed: 0
−1
b is X are nonstochastic. Then, β 0
Thus, we may test H0 : βi = βi 0 with
b − βi 0 β q ∼ E( 0, 1 ). σ 2 ( X0 X )−1 ii
(7.65)
More will be said on this statistic for use in inference in the next chapter.
7.4.2
Maximum Likelihood Estimation
Now, yi = β1 xi1 + β2 xi2 + · · · + βk xik + ui = xi 0 β + ui ,
(7.66)
is linear in ui , so yi is also normal given xi : yi ∼ N ( xi 0 β, σ2 ). Further, the yi ’s are independent. Thus, the density for yi is given by ½ ¾ 1 1 2 0 exp − 2 [ yi − xi β ] . f ( yi ) = √ 2σ 2πσ 2
(7.67)
(7.68)
Since the yi ’s are independent, the joint likelihood function is f ( y1 , y2 , . . . , yn ) = f ( y1 )f ( y2 ) · · · f ( yn ) ( ) n 1 1 X 0 2 = − 2 [ yi − xi β ] n exp 2σ i=1 ( 2πσ 2 ) 2 = L( β, σ 2 |y, X ).
(7.69)
Let L = log L( β, σ 2 |y, X ). We wish to maximize L with respect β. However, this means that we minimize the sum of squares, so 0 −1 0 b X y, β MLE = ( X X )
(7.70)
7.4. STATISTICAL PROPERTIES UNDER NORMALITY
73
which is the OLS estimator from above. It is easily shown that
7.4.3
0 c2 MLE = e e = n − k s2 . σ n n
b and s2 Efficiency of β
(7.71)
b is the MLE and unbiased, we find then it is the minimum variance unSince β biased estimator (BUE). And s2 is not the MLE, so it is not BUE. On the other c2 MLE is biased, so it is not BUE either. The will both be equivalent in hand, σ large samples and be asymptotically BUE.
Chapter 8
Confidence Intervals and Hypothesis Tests 8.1 8.1.1
Introduction Model and Assumptions
The model is a k-variable linear model: y = Xβ + u,
(8.1)
where y and u are both n × 1 vectors, X is a n × k matrix and β is a k × 1 vector. We make the following assumptions about the disturbances: (i) E( u ) = 0 and (ii),(iii) Cov( u ) = E( uu0 ) = σ 2 In , where In is an n × n identity matrix. The nonstochastic assumptions are (iv) X is nonstochastic. (v) X has full column rank (the columns are linearly independent). For inferences, we assume that u are normally distributed. That is, (vi) u ∼ ( 0, σ 2 In ). 74
8.1. INTRODUCTION
8.1.2
75
Ordinary Least Squares Estimation
b of β, define For some estimate β
e = y − Xβ
(8.2)
φ = e0 e
(8.3)
and Choosing βb to minimize φ yields the ordinary least squares (OLS) estimator b = ( X0 X )−1 X0 y, β
Substitution yields b β
8.1.3
= ( X0 X )−1 X0 ( Xβ + u ) = ( X0 X )−1 X0 Xβ + ( X0 X )−1 X0 u = β + ( X0 X )−1 X0 u.
(8.4)
(8.5)
b Properties of β
Since X is nonstochastic,
b ] = β + E[( X0 X )−1 X0 u ] E[ β = β + ( X0 X )−1 X0 E[ u ] = β.
(8.6)
Thus, the OLS estimator is unbiased. Also, b ) = E( β b − β )( β b − β )0 Cov( β = E[( X0 X )−1 X0 uu0 X( X0 X )−1 ] = ( X0 X )−1 X0 E( uu0 )X( X0 X )−1 = ( X0 X )−1 X0 σ 2 In X( X0 X )−1 = σ 2 ( X0 X )−1 .
(8.7)
The elements of X0 X are unbounded as n gets very large. Therefore, lim ( X0 X )−1 = 0,
n→∞
(8.8)
b converge to zero. This means that the distribution and the variances of β collapses about its expected value, namely β. So, b = β, plim β
n→∞
and OLS estimation is consistent.
(8.9)
76 CHAPTER 8. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS b are the best linear unbiased (BLUE) in that they have The OLS estimates β minimum variance in the class of unbiased estimators of β that are also linear in y. Suppose that u is normal, then the linear transformation b − β = ( X0 X )−1 X0 u β
is also normal.
b − β ∼ N( 0, σ 2 ( X0 X )−1 ). β
or
(8.10) (8.11)
b ∼ N( β, σ 2 ( X0 X )−1 ). β
(8.12) b Moreover, the β are maximum likelihood and hence minimum variance in the class of unbiased estimators.
8.1.4
Properties of e
Now, the OLS residuals are e = = = = = = =
b y − Xβ y − X(X0 X)−1 X0 y [ In − X(X0 X)−1 X0 ]y My M =In − X(X0 X)−1 X0 M( Xβ + u ) [ In − X(X0 X)−1 X0 ]( Xβ ) + Mu Mu.
(8.13)
(8.14)
Since MX = 0. Thus, the OLS residuals are a linear transformation of the underlying disturbances. Also, X0 e = X0 Mu = 0,
(8.15)
again, since MX = 0, and the OLS residuals are orthogonal or linearly unrelated to X. When u are normal, then the linear transformation e = Mu is also normal. Specifically, e ∼ N( 0, σ2 M ) (8.16) since
E e = E Mu = M E u = 0,
(8.17)
and 0
E ee
= E Muu0 M0 = M ( E uu0 ) M0 ¡ ¢ = M σ 2 I M0 = σ 2 M,
(8.18)
8.2. TESTS BASED ON THE χ2 DISTRIBUTION
77
since MM0 = M.
8.2 8.2.1
Tests Based on the χ2 Distribution The χ2 Distribution
Suppose that z1 , z2 , . . . , zn are iid N(0, 1) random variables. Then, n X i=1
8.2.2 Now,
ui = yi − x0i β ∼ N(0, σ 2 ),
(8.20)
ui ∼ N(0, 1) σ
(8.21)
n ³ X ui ´2 i=1
Now,
(8.19)
Distribution of (n − k)s2 /σ 2
so and
zi2 ∼ χ2n .
σ
=
n X u2 i
i=1
σ2
∼ χ2n .
b ei = yi − x0i β
(8.22)
(8.23)
is an estimate of ui and we might expect that
n X e2i ∼ χ2n . 2 σ i=1
(8.24)
However, this would be wrong as only n − k of the observations are independent since e satisfies the k equations X0 e = 0. The properties of e = Mu follow from the properties of M = In −X(X0 X)−1 X0 , which is symmetric idempotent and positive semi-definite and hence has some very special properties. First, rank(M) = tr(M) =n − k. Second, we can write the decomposition M = QDn−k Q0 where Dn−k is a diagonal matrix with its first n − k diagonals unity and the remainder zero, and Q0 Q = In so Q0 = Q−1 . Let v = Q0 u (8.25) then v ∼ N(0, In ) and u = Qv. Substitution yields 1 0 ee = σ2 =
1 0 u Mu σ2 1 0 u QDn−k Q0 u σ2
(8.26)
78 CHAPTER 8. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS 1 0 0 v Q QDn−k Q0 Qv σ2 1 0 = v Dn−k v σ2 1 Pn−k 2 = v σ 2 i=1 i Pn−k vi 2 2 = i=1 ( ) ∼ χn−k σ =
Thus,
n X e2i ∼ χ2n−k σ2 i=1
and
Pn
i=1
(n−k)
n−k σ2
e2i
= (n−k)
(8.27)
s2 ∼ χ2n−k . σ2
(8.28)
b − β = (X0 X)−1 X0 u are jointly normal and Not only so but e = Mu and β b − β)0 ] = E[Muu0 X(X0 X)−1 ] E[e(β 2
0
= Mσ In X(X X)
(8.29)
−1
= σ 2 MX(X0 X)−1 = 0
b since it so they are uncorrelated and independent and s2 is independent of β, is a function only of e.
8.2.3
A Confidence Interval
Now, let a and b be numbers such that Pr( b ≤ χ2n−k ≤ a ) = 0.95,
(8.30)
say. Then a and b can be obtained from a table. Thus, ¶ s2 = 0.95 Pr b ≤ ( n − k ) 2 ≤ a σ µ ¶ 1 1 σ2 Pr ≥ ≥ = 0.95 b ( n − k )s2 a ¶ µ ( n − k )s2 ( n − k )s2 ≥ σ2 ≥ = 0.95 Pr b a µ
establishes a 95% confidence interval for σ 2 .
(8.31)
8.3. TESTS BASED ON THE T DISTRIBUTION
For example, for n − k = 14, we have µ ¶ 14s2 Pr 5.64 ≤ 2 ≤ 26.12 = 0.95 σ µ ¶ 14s2 14s2 Pr ≥ σ2 ≥ = 0.95 5.63 26.12
79
(8.32)
and if s2 = 4.0, then Pr or is the confidence interval.
8.2.4
µ
56 56 ≥ σ2 ≥ 5.63 26.12
¶
= 0.95
¢ ¡ Pr 10 ≥ σ 2 ≥ 2.1 = 0.95
(8.33) (8.34)
A Hypothesis Test
Suppose that H0 : σ 2 = σ02 , Then we know that (n−k)
H1 : σ2 6= σ02 . s2 ∼ χ2n−k . σ02
(8.35)
under the null hypothesis. Choose α = 0.05, say, then critical values corresponding to 2.5% tails are 5.63 and 26.12 for n − k = 14. Thus, if 5.62 ≤ 14
s2 ≤ 26.12, σ02
(8.36)
we fail to reject the null hypothesis. Otherwise, we reject it at the 5% level of confidence. For example, suppose that s2 = 4.0 and σ02 = 1, then 14
s2 = 56 σ02
(8.37)
and we reject the null hypothesis since we fall into the right-hand 2.5% tail.
8.3 8.3.1
Tests Based on the t Distribution The t Distribution
Suppose that z is a N(0, 1) random variable and that w ∼ χ2m independent of z. Then, z p w ∼ tm . (8.38) m
80 CHAPTER 8. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS
8.3.2
The Distribution of ( βbi − βi )/( s2 dii )1/2
We have seen that
βbi ∼ N( βi , σ 2 dii ), 0
(8.39) 1/2
where dii is the (i, i) element of the matrix (X X)
. Then,
βbi − βi ∼ N( 1, 0 ), z=√ 2 σ dii
(8.40)
while w = (n − k)
s2 ∼ χ2n−k . σ02
(8.41)
Since βb and s2 are independent, we have q
8.3.3
bi −βi β √
βbi − βi = √ 2 ∼ tn−k . s dii /( n − k )
σ 2 dii
( n−k )s2 σ02
(8.42)
Confidence Interval for βi
First, obtain a such that Pr( −a ≤ tn−k ≤ a ) = 0.95,
(8.43)
say, from a table. Then, Pr Pr
Ã
Ã
βbi − βi −a ≤ √ 2 ≤a σ dii
p p βbi − βi βbi + a σ 2 dii ≥ √ 2 ≥ βbi − a σ 2 dii σ dii
!
!
= 0.95 = 0.95
(8.44)
√ and βbi ± a σ 2 dii defines a 95% confidence interval for βi .
8.3.4
Testing a Hypothesis
Suppose that H0 : βi = βi0 , We know that
H1 : βi 6= βi0 .
βb − βi √i ∼ tn−k . σ 2 dii
(8.45)
8.4. TESTS BASED ON THE F DISTRIBUTION
81
Now, choose α = 0.05, say, then critical values corresponding to 2.5% tails of a t distribution with (n − k) degrees of freedom are ±a, say, so if βbi − βi −a ≤ √ 2 ≤ a, σ dii
(8.46)
we fail to reject the null hypothesis. Otherwise we reject the null hypothesis in favor of the alternative.
8.3.5
1/2
b − β )/ ( s2 c0 ( X0 X )−1 c ) The Distribution of c0 ( β
Consider the linear combination Then,
b c0 β.
c0 βb ∼ N( c0 β, σ 2 c0 ( X0 X )−1 c ),
Then,
b −β) c0 ( β p ∼ N( 1, 0 ). 2 0 σ c ( X0 X )−1 c
(8.47) (8.48)
(8.49)
As before, we use s2 instead of σ2 , so while
b −β) c0 ( β p ∼ tn−k . 2 0 s c ( X0 X )−1 c
(8.50)
We can perform inferences and calculate confidence intervals as before.
8.4 8.4.1
Tests Based on the F Distribution The F Distribution
Suppose that v ∼ χ2l and v ∼ χ2m If v and w are independent, then v/l ∼ Fl,m . w/m
8.4.2
(8.51)
Distribution of (Rβb − r)0 [s2 R(X 0 X)−1 R0 ]−1 (Rβb − r)/q
Suppose we are interested in testing a set of q linear restrictions. Examples would be β1 + β2 + ... + βk = 1 and β3 = 2β2 . More generally, we consider H0 : Rβ = r
H1 : Rβ 6= r
82 CHAPTER 8. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS
where r is a q × 1 known vector and R is a q × k known matrix. Due to the b then under the null hypothesis, we have multivariate normality of β, and hence
b Rβ−r ∼N (0, σ 2 R(X0 X)−1 R0 )
0 2 b b (Rβ−r) [σ R(X0 X)−1 R0 )]−1 (Rβ−r) ∼ χ2q .
(8.52)
(8.53)
b and s2 are independent so (n − k) s22 ∼χ2 Recall that β n−k is independent of σ the quadratic form in (8.53). Thus, under the null hypothesis, 0 2 b b [σ R(X0 X)−1 R0 )]−1 (Rβ−r)/q (Rβ−r) ∼ Fq,n−k s2 (n − k) σ2 /(n − k)
(8.54)
and after some simplification 0 2 b b (Rβ−r) [s R(X0 X)−1 R0 )]−1 (Rβ−r)/q ∼ Fq,n−k .
(8.55)
Under the alternative hypothesis Rβ 6= r, then the numerator diverges at the rate n and we expect large positive values of the statistic with high probability. Accordingly, we only consult the RHS tail values of the distribution to establish critical values. Values of the statistic exceeding these critical values are rare events under the null but typical under the alternative, so we reject when the realization exceeds the critical value.
8.4.3
The Distribution of [( SSEr − SSEu )/q ]/SSEu /( n − k )
The most common form of linear restrictions that occur are zero restrictions. Suppose the model of interest can be written as y = X1 β1 + X2 β2 + u
(8.56)
and H0 : β2 = 0
H1 : β2 6= 0.
Define the “unrestricted” residuals
and
b − X2 β b eu = y − X1 β 1 2 SSEu = e0u eu
(8.57)
(8.58)
from the OLS regression of y on X1 and X2 . Next, define the “restricted” residuals b er = y − X1 β 1
(8.59)
8.4. TESTS BASED ON THE F DISTRIBUTION
83
and SSEr = e0r er
(8.60)
from the OLS regression of y on X1 only. Now, SSEr ≥ SSEu , but under H0 : β2 = 0, we expect SSEu ∼ χ2n−(k1 +k2 ) σ2
and
SSEr ∼ χ2n−k1 σ2
(8.61)
to have similar values. We therefore might expect SSEr /(n − k1 ) ∼ Fn−k1 ,n−(k1 +k2 ) , SSEu /(n − (k1 + k2 ))
(8.62)
but unfortunately, SSEr and SSEu are not independent because they both satisfy X01 eu = X01 er = 0.
(8.63)
The appropriate ratio can be determined by applying the results of the previous section. Specifically, we take R = (0 : Ik2 ) and r = 0 whereupon the restictions Rβ = r are equivalent to β2 = 0. For this choice of R and r we b − r =β b and using the results for inverses of partitioned matrices have Rβ 2 R(X0 X)−1 R0
= (X02 X2 − X02 X1 (X01 X1 )−1 X01 X2 )−1 = (X02 M1 X2 )−1
(8.64)
where M1 = In − X1 (X01 X1 )−1 X01 . Substitution yields 0 2 b b 0 [σ 2 (X0 M1 X2 )−1 ]−1 β b b [σ R(X0 X)−1 R0 )]−1 (Rβ−r)/q = β (Rβ−r) 2 2 2 0 1 b 0 b = β X M1 X2 β 2 σ2 2 2 0 1 b 0 b = β X M1 M1 X2 β 2 σ2 2 2 1 0 = (M1 y − eu ) (M1 y − eu ) σ2 1 0 = (y M1 y − 2e0u M1 y − e0u eu ) σ2 1 0 = (y M1 y − e0u eu ) σ2 1 0 = (e er −e0u eu ) (8.65) σ2 r
where we use the results M1 y
b + X2 β b + eu ) = M1 (X1 β 1 2 b + eu ) = M1 (X β 2
2
b + M1 eu = M1 X2 β 2
(8.66)
84 CHAPTER 8. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS
and M1 eu = M1 My = My = eu . Since SSEr = e0r er and SSEu = e0u eu , then SSEr − SSEu is the sum-ofsquares with k2 degrees of freedom that are independent of SSEu and form (canceling σ2 in the numerator and denominator) ( SSEr − SSEu )/k2 ∼ Fk1 ,n−(k1 +k2 ) . SSEu /(n − (k1 + k2 ))
(8.67)
Under the null hypothesis, this value will usually be small. Under the alternative of β2 6= 0, however, we would expect SSEu to be much smaller than SSEr and the above ratio to be large.
8.4.4
Testing a Hypothesis
We can consult the tables to find the critical point, c, corresponding to α = 0.05, say. Then, if ( SSEr − SSEu )/k1 > c, (8.68) SSEu /(n − (k1 + k2 ))
we reject the null hypothesis at the 5% level. Note that for k1 = 1, that is, one restriction, s ( SSEr − SSEu )/k1 ∼ tn−(k1 +k2 ) SSEu /(n − (k1 + k2 ))
8.5
(8.69)
An Example
Consider the model Yt = β1 + β2 Xt,2 + β3 Xt,3 + ut ,
(8.70)
where Yt is wheat yield, Xt,2 is the amount of fertilizer applied and Xt,3 is the annual rainfall. The data are given in Table 8.5. After we rescale the data, we obtain the estimates of the βs given in Table 8.5. Now, ⎛ ⎞ 55.9141 −0.4189 −15.4560 0.0371 0.0773 ⎠ , X0 X = ⎝ −0.4189 (8.71) −15.4560 0.0773 4.3277 P and s2 = t e2t /(T − 3) = 0.5232/4 = 0.1308. Now, R2 = 1 − 0.5232/13.5 = 2 0.9612 and R = 1 − 6/4 · 0.0388 = 0.9419. Recall b − β )( β b − β )0 ] = Cov( β b ) = σ 2 ( X0 X )−1 , E[( β
(8.72)
est. Cov( βb ) = s2 ( X0 X )−1 .
(8.73)
which we estimate using
8.5. AN EXAMPLE
85
Wheat Yield (Bushels/Acre) 40 45 50 65 70 70 80
Fertilizer (Pounds/Acre) 100 200 300 400 500 600 700
Rainfall (Inches/Year) 36 33 37 37 34 32 36
Table 8.1: Wheat yield data. Parameter β1 β2 β3
Estimate 1.1329 0.6893 0.6028
Table 8.2: Wheat yield parameter estimates. Thus, est. Var( βb1 ) = s2 d11 = 7.3133 est. Var( βb2 ) = s2 d22 = 0.0049 est. Var( βb3 ) = s2 d33 = 0.5660
For H0 : β1 = 0 vs H1 : β1 6= 0, we have
βb − β10 1.1329 q1 = = 0.4189 ∼ t4 . 2.7043 Var(βb1 )
(8.74)
Now, a 95% acceptance region for a t4 distribution is −2.776 ≤ t4 ≤ 2.776. Thus, we fail to reject the null hypothesis. For H0 : β2 = 0 vs H1 : β2 6= 0, we have βb − β20 0.6893 q2 = 9.8965 ∼ t4 . = 0.0697 Var(βb2 )
(8.75)
and we reject the null hypothesis at the 95% confidence level. In fact, we reject at the 99.9% confidence level, where the acceptance region is −7.173 ≤ t4 ≤ 7.173.