Maximum Likelihood (ML) Estimation

Econometrics 2 — Fall 2005 Maximum Likelihood (ML) Estimation Heino Bohn Nielsen 1 of 30 Outline (1) Introduction. (2) ML estimation defined. (3) E...
Author: Gyles Glenn
10 downloads 0 Views 132KB Size
Econometrics 2 — Fall 2005

Maximum Likelihood (ML) Estimation Heino Bohn Nielsen

1 of 30

Outline (1) Introduction. (2) ML estimation defined. (3) Example I: Binomial trials. (4) Example II: Linear regression. (5) Classical test principles.

• Wald test. • Likelihood ratio (LR) test. • Lagrange multiplier (LM) test or score test.

2 of 30

Introduction: The Main Idea • Consider stochastic variables yt (1 × 1) and xt (K × 1) and the observations µ ¶ yt , t = 1, 2, ..., T. xt Assume that we have a conditional model for yt | xt in mind.

• The likelihood analysis is based on knowledge of the distribution: yt | xt ∼ density(θ).

(∗)

For a given θ, the density defines the probability of observing yt | xt. θML, maximizes the probability of the data. The ML estimator, b

• Formulating (∗) carefully, the ML estimator is based on a statistical description of the data. We can (and should) test the specification.

3 of 30

The Likelihood Principle • The “probability” of observing yt | xt is given by the conditional density f (yt | xt; θ). If the observations are IID, the probability of y1, ..., yT (given X ) is f (y1, y2, ..., yT | X; θ) =

T Y t=1

f (yt | xt; θ) .

• The likelihood contribution for observation t is defined as Lt (θ) = f (yt | xt; θ) , and the likelihood function for the sample is defined as

L (θ | Y ; X) = f (y1, y2, ..., yT | X; θ) =

T Y t=1

f (yt | xt; θ) =

T Y t=1

Lt (θ | yt; xt) .

θML is chosen to maximize L (θ | Y ; X). • The ML estimator b I.e. to maximize the likelihood of observing the data given the model.

4 of 30

The ML Estimator • Easier to maximize the log-likelihood function log L(θ) =

T X t=1

• Define the score vector as

log Lt (θ | yt; xt) .

T

T

∂ log L(θ) X ∂ log Lt(θ) X = = s(θ) = st(θ), ∂θ ∂θ (K×1) t=1 t=1

where st(θ) is the score for each observation.

• The first order conditions, the K so-called likelihood equations, state s(b θML) =

T X t=1

st(b θML) = 0.

Might be difficult to solve in practice. Use numerical optimization.

5 of 30

Properties of ML Assume that the true model is contained in the statistical model (correct specification). Then under some regularity conditions:

θML = θ. • The ML estimator is consistent, plim b

• The ML estimator is asymptotically normal ´ √ ³ b T · θML − θ → N(0, V ).

Here V = I(θ)−1 is the asymptotic variance, and the negative expected Hessian, ∙ 2 ¸ ∂ log Lt(θ) I(θ) = −E , ∂θ∂θ0 is called the information matrix. The more curvature of the likelihood function, the more precision.

• The ML estimator is asymptotically efficient. All other consistent and asymptotically normal estimators will have an asymptotic variance larger than I(θ)−1. I(θ)−1 is denoted the Cramèr-Rao lower bound. 6 of 30

Asymptotic Inference • Inference can be based on the asymptotic distribution, i.e. ´ ³ a −1 b b θML ∼ N θ, T V , d −1 is a consistent estimate of V . where Vb = I(θ)

• One possibility is the sample average of second derivatives !−1 Ã ¯ T 2 X ¯ 1 ∂ log Lt(θ) ¯ VbH = − · . T t=1 ∂θ∂θ0 ¯θ=bθM L (K×K)

• It can be shown than E[st(θ)st(θ)0] = I(θ). An alternative estimator is the outer-product-of-the-scores !−1 Ã T X 1 VbS = − · st(b θML)st(b θML)0 . T t=1

7 of 30

Example I: Binomial Trials • Consider random draws from a pool of red and blue balls. We are interested in the proportion of red balls, p. • Let yt = Consider a data set of T draws

½

y1, y2, ..., yT ,

1 if a ball is red 0 if a ball is blue

e.g. 1, 0, 0, 1, ..., 0, 1.

• The model implies that prob(yt = 1) = p and prob(yt = 0) = 1 − p. The density function for yt is given by the Binomial

f (yt | p) = pyt · (1 − p)(1−yt) . 8 of 30

• Since the draws are independent, the likelihood function is given by L(p) =

T Y t=1

Lt(p | yt) =

and the log-likelihood function is

log L(p) =

T X t=1

T Y t=1

pyt · (1 − p)(1−yt) ,

[yt · log(p) + (1 − yt) · log (1 − p)] .

The ML estimator, pbML, is chosen to maximize this expression.

• The score for an individual observation is given by st (p) =

∂ log Lt(p) yt 1 − yt = − . ∂p p 1−p

9 of 30

• The likelihood equation is given by the first order condition ¸ T T ∙ X X yt 1 − yt − = 0. s(p) = st (p) = p 1 − p t=1 t=1 This implies that

PT T − y t t=1 t=1 yt = p 1−p ! Ã T T X X yt = p T − yt (1 − p) PT

T X t=1

yt − p ·

t=1 T X t=1

t=1

yt = p · T − p ·

pbML =

That is just the proportion of red balls.

PT

t=1 yt

T

T X

yt

t=1

.

10 of 30

• The second derivative is

∂ 2 log Lt(p) ∂ Ht = = ∂p∂p ∂p

µ

yt 1 − yt − p 1−p



=−

1 − yt yt − . p2 (1 − p)2

• Now recall that E[yt] = 1 · prob(yt = 1) + 0 · prob(yt = 0) = p. Inserting that, the information is given by ∙ 2 ¸ ∂ log Lt(p) I(p) = −E ∂p∂p ¸ ∙ 1 − yt yt = −E − 2 − p (1 − p)2 p 1−p = 2+ p (1 − p)2 1 1 = + p 1−p 1 = . p (1 − p)

11 of 30

• Inference on p can be based on

¡ ¢ pbML → N p, T −1 · pbML (1 − pbML)

• ML estimation can easily be done in PcGive using numerical optimization. Specify — actual: yt — fitted: Nothing in this case. — loglik: log Lt(p | yt) = yt · log(p) + (1 − yt) · log (1 − p) . — Initial value for the parameters, denoted &0,&1,...,&k. • In our case (where the data series is denoted Bin): actual fitted loglik &0

= = = =

Bin; 0; actual*log(&0)+(1-actual)*log(1-&0); 0.5; 12 of 30

Example II: Linear Regression • Consider a linear regression model yt = x0tβ + t,

t = 1, 2, ..., T.

For ML estimation, we have to specify a distribution for Assume that the errors are IID and normal: t

t.

| xt ∼ N(0, σ 2).

This implies that

yt | xt ∼ N(x0tβ, σ 2). • The “probability” of observing yt given the model is the normal density ) ( 0 2 ¢ ¡ 1 1 (y − x β) t t . f yt | xt; β, σ 2 = √ · exp − 2 2 σ 2πσ 2 13 of 30

• Since the observations are assumed IID the probability of y1, ..., yT is T Y ¢ ¢ ¡ ¡ 2 = f y1, ..., yT | X; β, σ f yt | xt; β, σ 2 t=1

=

µ

) ( ¶T Y T 0 2 1 1 (yt − xtβ) √ . · exp − 2 2 σ 2πσ 2 t=1

• The likelihood function is given by ¢ ¡ ¢ ¡ L β, σ 2 = 2πσ 2

− T2

·

T Y t=1

exp

(

1 (yt − x0tβ)2 − 2 σ2

)

where we have σ 2 as a parameter. Alternatively we could take σ as the parameter. The log-likelihood function is T ¢ ¡ ¡ 2¢¤ 1 X T £ (yt − x0tβ)2 2 log L β, σ = − · log (2π) + log σ − · . 2 2 t=1 σ2

bML and σ The ML estimators β b2ML are chosen to maximize this expression.

14 of 30

• The log-likelihood contributions are

¡ ¢ ¡ 2¢ 1 (yt − x0tβ)2 1 1 2 log Lt β, σ = − · log (2π) − · log σ − · 2 2 2 σ2 The individual scores are given by the first derivatives ⎛ ⎞ ⎛ ⎞ 2 ∂ log Lt(β, σ ) xt (yt − x0tβ) ⎟ ⎜ ¡ ¢ ⎜ ⎟ ∂β σ2 ⎟=⎝ st β, σ 2 = ⎜ 0 2 ⎠. ⎝ ∂ log Lt(β, σ 2) ⎠ 1 1 (yt − xtβ) (K+1)×1 − + 2σ 2 2 σ4 ∂σ 2

• The first order conditions are given by ⎛ s(θ) =

⎜ ⎜ ¡ ¢ st β, σ 2 = ⎜ ⎜ ⎝

T X t=1

T X xt (yt − x0 β) t



⎟ µ ¶ ⎟ t=1 ⎟= 0 . T ⎟ 0 T 1 X (yt − x0tβ)2 ⎠ − 2+ 2σ 2 t=1 σ4 σ2

15 of 30

• The first condition:

T X t=1

implies

xt (yt −

bML = β

x0tβ)

=

T X t=1

à T X

xtx0t

t=1

xtyt −

!−1

T X t=1

T X

xtx0tβ = 0

t=1

bOLS . xtyt = β

In a regression model with normal errors, OLS is the maximum likelihood estimator.

bML), the second condition yields • Letting bt = (yt − x0tβ T

T 1 X b2t = 2 t=1 σ 4 2σ 2 σ b2ML

T 1X 2 = b, T t=1 t

which is slightly different from the OLS estimator of the variance. The ML estimator of the variance is biased but consistent. 16 of 30

• The second derivatives are given by ½ ¾ ∂ xt (yt − x0tβ) xtx0t ∂ 2 log Lt(β, σ 2) = = − ∂β σ2 σ2 ∂β∂β 0 ½ ¾ ∂ xt t xt (yt − x0tβ) xt (yt − x0tβ) ∂ 2 log Lt(β, σ 2) = = − = − ∂β∂σ 2 ∂σ 2 σ2 σ4 σ4 ( ) 2 1 1 ∂ 1 (yt − x0tβ)2 ∂ 2 log Lt(β, σ 2) t − = = + − 2 2 2 2 4 4 ∂σ ∂σ ∂σ 2σ 2 σ 2σ σ6 ) ( 2 2 0 2 x0t (yt − x0tβ) ∂ 1 (yt − xtβ) x0t t 1 ∂ log Lt(β, σ ) =− = =− 4 − 2+ ∂β 2σ 2 σ4 σ4 σ ∂σ 2∂β 0 • Using that E[ t] = 0, E[ 2t ] = σ 2, and E[ txt] = 0, gives the information matrix ⎛ ⎞ xtx0t xt t ¶ − 4 ⎟ µ σ −2E[x x0 ] 0 ⎜ − σ2 t 2 t σ 2 ⎠= I(β, σ ) = −E ⎝ . 1 −4 x0t t 1 0 σ t 2 − 4 − σ 2σ 4 σ 6 Note that the information matrix is block diagonal. 17 of 30

• Recall that

µ

bML β σ b2ML



→N

µµ

β σ2



¶ 1 , · I(θ)−1 . T

• This implies

µ ¶ 2 σ −1 bML → N β, (E[xtx0 ]) β , t T where the variance can be estimated by !−1 Ã T !−1 Ã T 2 X X σ bML 1 xtx0t =σ b2ML xtx0t . T T t=1 t=1

• Furthermore

µ ¶ 4 2 · σ σ b2ML → N σ 2, . T

18 of 30

Joint and Conditional Distributions • So far we have considered a model for yt | xt corresponding to the conditional distribution, f (yt | xt; θ). • From a statistical point of view a natural alternative would be a model for all the data (yt, x0t)0, corresponding to the joint density, f (yt, xt | ψ). • Recall the factorization f (yt, xt | ψ) = f (yt | xt; θ) · f (xt | π) . If the two sets of parameters, θ and π , are not related (and xt is exogenous in a certain sense), we can estimate θ in the conditional model

f (yt | xt; θ) =

f (yt, xt | ψ) f (xt | π)

19 of 30

Time Series (Non-IID) Data and Factorization • The multiplicative form f (y1, y2, ..., yT | θ) =

T Y t=1

f (yt | θ) ,

follows from the IID assumption, which cannot be made for many time series.

• For a time series, the object of interest is often E[yt | y1, ..., yt−1]. This can be used to factorize the likelihood function. Recall again that f (y1, y2, ..., yT | θ) = f (yT | y1, ..., yT −1; θ) · f (y1, y2, ..., yT −1 | θ) T Y = ... = f (y1 | θ) · f (yt | y1, ..., yt−1; θ) . t=2

Conditioning on the first observation, y1, gives a multiplicative structure T

f (y1, y2, ..., yT | θ) Y f (y2, ..., yT | y1; θ) = f (yt | y1, ..., yt−1; θ) . = f (y1 | θ) t=2

We look at y2, ..., yT | y1 where y1 in the initial value.

20 of 30

Three Classical Test Principles Consider a null hypothesis of interest, H0, and an alternative, HA. E.g.

H0 : R θ = q against HA : Rθ 6= q . (J×K)

θ and b θ denote the ML estimates under H0 and HA respectively Let e

• Wald test. Estimates the model only under HA, and look at the distance Rb θA − q normalized by the covariance matrix.

• Likelihood ratio (LR) test. Estimate under H0 and under HA and look at the loss θ) − log L(e θ). in likelihood log L(b

θ under H0 and see if the FOCs • Lagrange multiplier (LM) or score test. Estimate e P e st(θ) = 0 are significantly violated.

The tests are asymptotically equivalent.

21 of 30

LogL(q)

{

LM

{

LR

Ù

q W

q

q0

22 of 30

Wald Test ¢ ¡ • Recall, that b θ → N θ, T −1V , so that ¢ ¡ Rb θ → N Rθ, T −1RV R0 .

• If the null hypothesis is true Rθ = q , and a natural test statistic is ³ ´0 ³ ´−1 ³ ´ 0 b b ξ W = T · Rθ − q RVb R Rθ − q . Under the null this is distributed as ξ W → χ2 (J) .

• An example is the t−ratio for H0 : θi = θi0 t=

b p θi − θi0 ξW = q → N(0, 1). b V (θi)

• Requires only estimation under the alternative, HA.

23 of 30

Likelihood Ratio (LR) Test • For the LR test we estimate both under H0 and under HA. • The LR test statistic is given by à ! ³ ´ e L(θ) e b ξ LR = −2 · log = −2 · log L(θ) − log L(θ) , L(b θ) where L(e θ) and L(b θ) are the two likelihood values.

• Under the null, this is asymptotically distributed as ξ LR → χ2 (J) . • The test is insensitive to how the model and restrictions are formulated. Test is only appropriate when the models are nested.

24 of 30

Lagrange Multiplier (LM) or Score Test • Let s (·) be the score function of the unrestricted model. Recall that the score is zero at the unrestricted estimate s(b θ) =

If the restriction is true

s(e θ) =

T X t=1

T X t=1

st(b θ) = 0. st(e θ) ≈ 0.

• This can be tested by the quadratic form !Ã Ã T Ã T !!−1 Ã T ! X X X ξ LM = st(e θ)0 Var st(e θ) st(e θ) , t=1

t=1

t=1

which under the null is χ2 (J).

25 of 30

• The variance of the individual score, st(θ), is the information matrix, I(θ) = E[st(θ)st(θ)0], which can be estimated as T X 1 IbG(e θ) = st(e θ)st(e θ)0. T t=1 P • The estimated variance of Tt=1 st(b θ) is therefore

T · IbG(e θ) =

T X t=1

st(e θ)st(e θ)0.

• And the LM test can be written as !Ã T !−1 Ã T Ã T ! X X X ξ LM = s(e θ)0 st(e θ)st(e θ)0 st(e θ) , t=1

t=1

t=1

which will have a χ2 (J) distribution under the null.

26 of 30

• Note, that the quadratic form is of dimension K , but K −J elements are unrestricted P e st(θ) = 0 for these elements. Hence a χ2(J). and • In the case where the null hypothesis is H0 : θ2 = 0, for µ µ ¶ ¶ θ1 s1t (θ) θ= , and st (θ) = , s2t (θ) θ2

the LM statistic only depends on the scores corresponding to θ2 : !Ã Ã T Ã T !!−1 Ã T ! X X X s2t(e θ)0 Var s2t(e θ) s2t(e θ) . ξ LM = t=1

t=1

• Note, however, that to calculate Var

à T X t=1

t=1

!−1

s2t(e θ)

we need the full score vector, except if the covariance matrix is block-diagonal. 27 of 30

LM Tests by Auxiliary Regressions LM test are often written as T · R2 in an auxiliary regression. Here we see why:

• Define the matrix

Then

⎞ θ)0 s1(e S = ⎝ .. ⎠ (T ×K) sT (e θ)0 ⎛

0

0

ι ι = T,

Sι =

(1×1)

(K×1)

T X t=1



⎞ 1 and ι = ⎝ .. ⎠ . (T ×1) 1

st(e θ),

0

SS =

(K×K)

T X t=1

st(e θ)st(e θ)0.

• Therefore, we can write the LM statistic as !Ã T !−1 Ã T Ã T ! X X X ξ LM = st(e θ)0 st(e θ)st(e θ)0 st(e θ) t=1

0

t=1

0

−1

= ι S (S S)

t=1

0



This is just a way to compute ξ LM .

28 of 30

• Now consider the regression model ι = Sγ + residual.

(∗)

The OLS estimator and the predicted values are given by, respectively, −1

γ b = (S 0S)

−1

S 0ι and bι = Sb γ = S (S 0S)

S 0ι.

• The LM test can the written as 0

0

−1

ξ LM = ι S (S S)

ι0S (S 0S)−1 S 0ι bι0bι ESS Sι=T · = T · R2 . = T · = T · 0 0 ιι ιι TSS 0

• The regression (∗) is not always the most convenient. Often, alternative auxiliary regressions are used.

29 of 30

Examples of LM tests in a linear regression

yt = x0tβ +

t

based on T · R2 from auxiliary regressions:

• Test for omitted variables, wt. Run the regression ι = btx0tγ + btwt0 δ + residual, or bt = x0tγ + wt0 δ + residual. • Breusch-Godfrey test for no first order autocorrelation. Run the regression bt = δ · bt−1 + x0tγ + residual. • Breusch-Pagan test for no heteroskedasticity. Run the regression b2t = x0tγ + residual.

30 of 30