Econometrics 2 — Fall 2005
Maximum Likelihood (ML) Estimation Heino Bohn Nielsen
1 of 30
Outline (1) Introduction. (2) ML estimation defined. (3) Example I: Binomial trials. (4) Example II: Linear regression. (5) Classical test principles.
• Wald test. • Likelihood ratio (LR) test. • Lagrange multiplier (LM) test or score test.
2 of 30
Introduction: The Main Idea • Consider stochastic variables yt (1 × 1) and xt (K × 1) and the observations µ ¶ yt , t = 1, 2, ..., T. xt Assume that we have a conditional model for yt | xt in mind.
• The likelihood analysis is based on knowledge of the distribution: yt | xt ∼ density(θ).
(∗)
For a given θ, the density defines the probability of observing yt | xt. θML, maximizes the probability of the data. The ML estimator, b
• Formulating (∗) carefully, the ML estimator is based on a statistical description of the data. We can (and should) test the specification.
3 of 30
The Likelihood Principle • The “probability” of observing yt | xt is given by the conditional density f (yt | xt; θ). If the observations are IID, the probability of y1, ..., yT (given X ) is f (y1, y2, ..., yT | X; θ) =
T Y t=1
f (yt | xt; θ) .
• The likelihood contribution for observation t is defined as Lt (θ) = f (yt | xt; θ) , and the likelihood function for the sample is defined as
L (θ | Y ; X) = f (y1, y2, ..., yT | X; θ) =
T Y t=1
f (yt | xt; θ) =
T Y t=1
Lt (θ | yt; xt) .
θML is chosen to maximize L (θ | Y ; X). • The ML estimator b I.e. to maximize the likelihood of observing the data given the model.
4 of 30
The ML Estimator • Easier to maximize the log-likelihood function log L(θ) =
T X t=1
• Define the score vector as
log Lt (θ | yt; xt) .
T
T
∂ log L(θ) X ∂ log Lt(θ) X = = s(θ) = st(θ), ∂θ ∂θ (K×1) t=1 t=1
where st(θ) is the score for each observation.
• The first order conditions, the K so-called likelihood equations, state s(b θML) =
T X t=1
st(b θML) = 0.
Might be difficult to solve in practice. Use numerical optimization.
5 of 30
Properties of ML Assume that the true model is contained in the statistical model (correct specification). Then under some regularity conditions:
θML = θ. • The ML estimator is consistent, plim b
• The ML estimator is asymptotically normal ´ √ ³ b T · θML − θ → N(0, V ).
Here V = I(θ)−1 is the asymptotic variance, and the negative expected Hessian, ∙ 2 ¸ ∂ log Lt(θ) I(θ) = −E , ∂θ∂θ0 is called the information matrix. The more curvature of the likelihood function, the more precision.
• The ML estimator is asymptotically efficient. All other consistent and asymptotically normal estimators will have an asymptotic variance larger than I(θ)−1. I(θ)−1 is denoted the Cramèr-Rao lower bound. 6 of 30
Asymptotic Inference • Inference can be based on the asymptotic distribution, i.e. ´ ³ a −1 b b θML ∼ N θ, T V , d −1 is a consistent estimate of V . where Vb = I(θ)
• One possibility is the sample average of second derivatives !−1 Ã ¯ T 2 X ¯ 1 ∂ log Lt(θ) ¯ VbH = − · . T t=1 ∂θ∂θ0 ¯θ=bθM L (K×K)
• It can be shown than E[st(θ)st(θ)0] = I(θ). An alternative estimator is the outer-product-of-the-scores !−1 Ã T X 1 VbS = − · st(b θML)st(b θML)0 . T t=1
7 of 30
Example I: Binomial Trials • Consider random draws from a pool of red and blue balls. We are interested in the proportion of red balls, p. • Let yt = Consider a data set of T draws
½
y1, y2, ..., yT ,
1 if a ball is red 0 if a ball is blue
e.g. 1, 0, 0, 1, ..., 0, 1.
• The model implies that prob(yt = 1) = p and prob(yt = 0) = 1 − p. The density function for yt is given by the Binomial
f (yt | p) = pyt · (1 − p)(1−yt) . 8 of 30
• Since the draws are independent, the likelihood function is given by L(p) =
T Y t=1
Lt(p | yt) =
and the log-likelihood function is
log L(p) =
T X t=1
T Y t=1
pyt · (1 − p)(1−yt) ,
[yt · log(p) + (1 − yt) · log (1 − p)] .
The ML estimator, pbML, is chosen to maximize this expression.
• The score for an individual observation is given by st (p) =
∂ log Lt(p) yt 1 − yt = − . ∂p p 1−p
9 of 30
• The likelihood equation is given by the first order condition ¸ T T ∙ X X yt 1 − yt − = 0. s(p) = st (p) = p 1 − p t=1 t=1 This implies that
PT T − y t t=1 t=1 yt = p 1−p ! Ã T T X X yt = p T − yt (1 − p) PT
T X t=1
yt − p ·
t=1 T X t=1
t=1
yt = p · T − p ·
pbML =
That is just the proportion of red balls.
PT
t=1 yt
T
T X
yt
t=1
.
10 of 30
• The second derivative is
∂ 2 log Lt(p) ∂ Ht = = ∂p∂p ∂p
µ
yt 1 − yt − p 1−p
¶
=−
1 − yt yt − . p2 (1 − p)2
• Now recall that E[yt] = 1 · prob(yt = 1) + 0 · prob(yt = 0) = p. Inserting that, the information is given by ∙ 2 ¸ ∂ log Lt(p) I(p) = −E ∂p∂p ¸ ∙ 1 − yt yt = −E − 2 − p (1 − p)2 p 1−p = 2+ p (1 − p)2 1 1 = + p 1−p 1 = . p (1 − p)
11 of 30
• Inference on p can be based on
¡ ¢ pbML → N p, T −1 · pbML (1 − pbML)
• ML estimation can easily be done in PcGive using numerical optimization. Specify — actual: yt — fitted: Nothing in this case. — loglik: log Lt(p | yt) = yt · log(p) + (1 − yt) · log (1 − p) . — Initial value for the parameters, denoted &0,&1,...,&k. • In our case (where the data series is denoted Bin): actual fitted loglik &0
= = = =
Bin; 0; actual*log(&0)+(1-actual)*log(1-&0); 0.5; 12 of 30
Example II: Linear Regression • Consider a linear regression model yt = x0tβ + t,
t = 1, 2, ..., T.
For ML estimation, we have to specify a distribution for Assume that the errors are IID and normal: t
t.
| xt ∼ N(0, σ 2).
This implies that
yt | xt ∼ N(x0tβ, σ 2). • The “probability” of observing yt given the model is the normal density ) ( 0 2 ¢ ¡ 1 1 (y − x β) t t . f yt | xt; β, σ 2 = √ · exp − 2 2 σ 2πσ 2 13 of 30
• Since the observations are assumed IID the probability of y1, ..., yT is T Y ¢ ¢ ¡ ¡ 2 = f y1, ..., yT | X; β, σ f yt | xt; β, σ 2 t=1
=
µ
) ( ¶T Y T 0 2 1 1 (yt − xtβ) √ . · exp − 2 2 σ 2πσ 2 t=1
• The likelihood function is given by ¢ ¡ ¢ ¡ L β, σ 2 = 2πσ 2
− T2
·
T Y t=1
exp
(
1 (yt − x0tβ)2 − 2 σ2
)
where we have σ 2 as a parameter. Alternatively we could take σ as the parameter. The log-likelihood function is T ¢ ¡ ¡ 2¢¤ 1 X T £ (yt − x0tβ)2 2 log L β, σ = − · log (2π) + log σ − · . 2 2 t=1 σ2
bML and σ The ML estimators β b2ML are chosen to maximize this expression.
14 of 30
• The log-likelihood contributions are
¡ ¢ ¡ 2¢ 1 (yt − x0tβ)2 1 1 2 log Lt β, σ = − · log (2π) − · log σ − · 2 2 2 σ2 The individual scores are given by the first derivatives ⎛ ⎞ ⎛ ⎞ 2 ∂ log Lt(β, σ ) xt (yt − x0tβ) ⎟ ⎜ ¡ ¢ ⎜ ⎟ ∂β σ2 ⎟=⎝ st β, σ 2 = ⎜ 0 2 ⎠. ⎝ ∂ log Lt(β, σ 2) ⎠ 1 1 (yt − xtβ) (K+1)×1 − + 2σ 2 2 σ4 ∂σ 2
• The first order conditions are given by ⎛ s(θ) =
⎜ ⎜ ¡ ¢ st β, σ 2 = ⎜ ⎜ ⎝
T X t=1
T X xt (yt − x0 β) t
⎞
⎟ µ ¶ ⎟ t=1 ⎟= 0 . T ⎟ 0 T 1 X (yt − x0tβ)2 ⎠ − 2+ 2σ 2 t=1 σ4 σ2
15 of 30
• The first condition:
T X t=1
implies
xt (yt −
bML = β
x0tβ)
=
T X t=1
à T X
xtx0t
t=1
xtyt −
!−1
T X t=1
T X
xtx0tβ = 0
t=1
bOLS . xtyt = β
In a regression model with normal errors, OLS is the maximum likelihood estimator.
bML), the second condition yields • Letting bt = (yt − x0tβ T
T 1 X b2t = 2 t=1 σ 4 2σ 2 σ b2ML
T 1X 2 = b, T t=1 t
which is slightly different from the OLS estimator of the variance. The ML estimator of the variance is biased but consistent. 16 of 30
• The second derivatives are given by ½ ¾ ∂ xt (yt − x0tβ) xtx0t ∂ 2 log Lt(β, σ 2) = = − ∂β σ2 σ2 ∂β∂β 0 ½ ¾ ∂ xt t xt (yt − x0tβ) xt (yt − x0tβ) ∂ 2 log Lt(β, σ 2) = = − = − ∂β∂σ 2 ∂σ 2 σ2 σ4 σ4 ( ) 2 1 1 ∂ 1 (yt − x0tβ)2 ∂ 2 log Lt(β, σ 2) t − = = + − 2 2 2 2 4 4 ∂σ ∂σ ∂σ 2σ 2 σ 2σ σ6 ) ( 2 2 0 2 x0t (yt − x0tβ) ∂ 1 (yt − xtβ) x0t t 1 ∂ log Lt(β, σ ) =− = =− 4 − 2+ ∂β 2σ 2 σ4 σ4 σ ∂σ 2∂β 0 • Using that E[ t] = 0, E[ 2t ] = σ 2, and E[ txt] = 0, gives the information matrix ⎛ ⎞ xtx0t xt t ¶ − 4 ⎟ µ σ −2E[x x0 ] 0 ⎜ − σ2 t 2 t σ 2 ⎠= I(β, σ ) = −E ⎝ . 1 −4 x0t t 1 0 σ t 2 − 4 − σ 2σ 4 σ 6 Note that the information matrix is block diagonal. 17 of 30
• Recall that
µ
bML β σ b2ML
¶
→N
µµ
β σ2
¶
¶ 1 , · I(θ)−1 . T
• This implies
µ ¶ 2 σ −1 bML → N β, (E[xtx0 ]) β , t T where the variance can be estimated by !−1 Ã T !−1 Ã T 2 X X σ bML 1 xtx0t =σ b2ML xtx0t . T T t=1 t=1
• Furthermore
µ ¶ 4 2 · σ σ b2ML → N σ 2, . T
18 of 30
Joint and Conditional Distributions • So far we have considered a model for yt | xt corresponding to the conditional distribution, f (yt | xt; θ). • From a statistical point of view a natural alternative would be a model for all the data (yt, x0t)0, corresponding to the joint density, f (yt, xt | ψ). • Recall the factorization f (yt, xt | ψ) = f (yt | xt; θ) · f (xt | π) . If the two sets of parameters, θ and π , are not related (and xt is exogenous in a certain sense), we can estimate θ in the conditional model
f (yt | xt; θ) =
f (yt, xt | ψ) f (xt | π)
19 of 30
Time Series (Non-IID) Data and Factorization • The multiplicative form f (y1, y2, ..., yT | θ) =
T Y t=1
f (yt | θ) ,
follows from the IID assumption, which cannot be made for many time series.
• For a time series, the object of interest is often E[yt | y1, ..., yt−1]. This can be used to factorize the likelihood function. Recall again that f (y1, y2, ..., yT | θ) = f (yT | y1, ..., yT −1; θ) · f (y1, y2, ..., yT −1 | θ) T Y = ... = f (y1 | θ) · f (yt | y1, ..., yt−1; θ) . t=2
Conditioning on the first observation, y1, gives a multiplicative structure T
f (y1, y2, ..., yT | θ) Y f (y2, ..., yT | y1; θ) = f (yt | y1, ..., yt−1; θ) . = f (y1 | θ) t=2
We look at y2, ..., yT | y1 where y1 in the initial value.
20 of 30
Three Classical Test Principles Consider a null hypothesis of interest, H0, and an alternative, HA. E.g.
H0 : R θ = q against HA : Rθ 6= q . (J×K)
θ and b θ denote the ML estimates under H0 and HA respectively Let e
• Wald test. Estimates the model only under HA, and look at the distance Rb θA − q normalized by the covariance matrix.
• Likelihood ratio (LR) test. Estimate under H0 and under HA and look at the loss θ) − log L(e θ). in likelihood log L(b
θ under H0 and see if the FOCs • Lagrange multiplier (LM) or score test. Estimate e P e st(θ) = 0 are significantly violated.
The tests are asymptotically equivalent.
21 of 30
LogL(q)
{
LM
{
LR
Ù
q W
q
q0
22 of 30
Wald Test ¢ ¡ • Recall, that b θ → N θ, T −1V , so that ¢ ¡ Rb θ → N Rθ, T −1RV R0 .
• If the null hypothesis is true Rθ = q , and a natural test statistic is ³ ´0 ³ ´−1 ³ ´ 0 b b ξ W = T · Rθ − q RVb R Rθ − q . Under the null this is distributed as ξ W → χ2 (J) .
• An example is the t−ratio for H0 : θi = θi0 t=
b p θi − θi0 ξW = q → N(0, 1). b V (θi)
• Requires only estimation under the alternative, HA.
23 of 30
Likelihood Ratio (LR) Test • For the LR test we estimate both under H0 and under HA. • The LR test statistic is given by à ! ³ ´ e L(θ) e b ξ LR = −2 · log = −2 · log L(θ) − log L(θ) , L(b θ) where L(e θ) and L(b θ) are the two likelihood values.
• Under the null, this is asymptotically distributed as ξ LR → χ2 (J) . • The test is insensitive to how the model and restrictions are formulated. Test is only appropriate when the models are nested.
24 of 30
Lagrange Multiplier (LM) or Score Test • Let s (·) be the score function of the unrestricted model. Recall that the score is zero at the unrestricted estimate s(b θ) =
If the restriction is true
s(e θ) =
T X t=1
T X t=1
st(b θ) = 0. st(e θ) ≈ 0.
• This can be tested by the quadratic form !Ã Ã T Ã T !!−1 Ã T ! X X X ξ LM = st(e θ)0 Var st(e θ) st(e θ) , t=1
t=1
t=1
which under the null is χ2 (J).
25 of 30
• The variance of the individual score, st(θ), is the information matrix, I(θ) = E[st(θ)st(θ)0], which can be estimated as T X 1 IbG(e θ) = st(e θ)st(e θ)0. T t=1 P • The estimated variance of Tt=1 st(b θ) is therefore
T · IbG(e θ) =
T X t=1
st(e θ)st(e θ)0.
• And the LM test can be written as !Ã T !−1 Ã T Ã T ! X X X ξ LM = s(e θ)0 st(e θ)st(e θ)0 st(e θ) , t=1
t=1
t=1
which will have a χ2 (J) distribution under the null.
26 of 30
• Note, that the quadratic form is of dimension K , but K −J elements are unrestricted P e st(θ) = 0 for these elements. Hence a χ2(J). and • In the case where the null hypothesis is H0 : θ2 = 0, for µ µ ¶ ¶ θ1 s1t (θ) θ= , and st (θ) = , s2t (θ) θ2
the LM statistic only depends on the scores corresponding to θ2 : !Ã Ã T Ã T !!−1 Ã T ! X X X s2t(e θ)0 Var s2t(e θ) s2t(e θ) . ξ LM = t=1
t=1
• Note, however, that to calculate Var
à T X t=1
t=1
!−1
s2t(e θ)
we need the full score vector, except if the covariance matrix is block-diagonal. 27 of 30
LM Tests by Auxiliary Regressions LM test are often written as T · R2 in an auxiliary regression. Here we see why:
• Define the matrix
Then
⎞ θ)0 s1(e S = ⎝ .. ⎠ (T ×K) sT (e θ)0 ⎛
0
0
ι ι = T,
Sι =
(1×1)
(K×1)
T X t=1
⎛
⎞ 1 and ι = ⎝ .. ⎠ . (T ×1) 1
st(e θ),
0
SS =
(K×K)
T X t=1
st(e θ)st(e θ)0.
• Therefore, we can write the LM statistic as !Ã T !−1 Ã T Ã T ! X X X ξ LM = st(e θ)0 st(e θ)st(e θ)0 st(e θ) t=1
0
t=1
0
−1
= ι S (S S)
t=1
0
Sι
This is just a way to compute ξ LM .
28 of 30
• Now consider the regression model ι = Sγ + residual.
(∗)
The OLS estimator and the predicted values are given by, respectively, −1
γ b = (S 0S)
−1
S 0ι and bι = Sb γ = S (S 0S)
S 0ι.
• The LM test can the written as 0
0
−1
ξ LM = ι S (S S)
ι0S (S 0S)−1 S 0ι bι0bι ESS Sι=T · = T · R2 . = T · = T · 0 0 ιι ιι TSS 0
• The regression (∗) is not always the most convenient. Often, alternative auxiliary regressions are used.
29 of 30
Examples of LM tests in a linear regression
yt = x0tβ +
t
based on T · R2 from auxiliary regressions:
• Test for omitted variables, wt. Run the regression ι = btx0tγ + btwt0 δ + residual, or bt = x0tγ + wt0 δ + residual. • Breusch-Godfrey test for no first order autocorrelation. Run the regression bt = δ · bt−1 + x0tγ + residual. • Breusch-Pagan test for no heteroskedasticity. Run the regression b2t = x0tγ + residual.
30 of 30