4.1 Maximum likelihood method of estimation

Chapter 4 Maximum Likelihood Estimation 4.1 Maximum likelihood method of estimation We have already seen at the beginning of chapter 1 that for a gi...
59 downloads 0 Views 157KB Size
Chapter 4 Maximum Likelihood Estimation 4.1

Maximum likelihood method of estimation

We have already seen at the beginning of chapter 1 that for a given observed value of x of the sample X the joint p.m/d.f. fX (x|θ), as function of θ, gives us an indication as to how the chances of getting the observed results x varies with θ and that it is therefore referred to as the likelihood function of the observed results x and denoted by L (θ; x) . If the true value of θ is unknown to us then, naturally, the obvious estimate for the value of θ is that value θ˙ which makes the observed results most probable i.e. θ˙ is that value of θ which maximises the chances of getting the results we did get. Thus θ˙ maximises the likelihood L (θ; x) i.e. ´ ³ ˙ x = sup L (θ; x) L θ; θ∈Θ

θ˙ = θ˙ (x) is called the maximum likelihood estimate of θ and θ˙ (X) is called the maximum likelihood estimator of θ (or m.l.e. of θ) . Usually, but not always, the maximum likelihood estimate θ˙ is found by differentiating L (θ; x) with respect to θ, equating to zero and then solving for θ˙ or, equivalently, since log is a monotonic function, by differentiating ℓ (θ; x) = log L (θ; x) , 53

54

CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION

˙ The function ℓ (θ; x) is called the equating to zero and then solving for θ. log-likelihood function. If θ is a vector with elements (θ1 , θ2 , . . . , θk ) then ´ ³ the m.l.e θ˙ of θ consists of elements θ˙1 , θ˙2 , . . . , θ˙k which maximise L (θ; x) and are usually, but not always, obtained by differentiating ℓ (θ; x) w.r.t. θ1 , θ2 , . . . , θk , equating the resulting expressions to zero and solving simultaneously for θ˙1 , θ˙2 , . . . , θ˙k . These equations ∂ ℓ (θ; x) = 0 ∂θ1 ∂ ℓ (θ; x) = 0 ∂θ2 .. . ∂ ℓ (θ; x) = 0 ∂θk are called the maximum likelihood equations and their solutions are the m.l.e. ´ ³ ˙θ = θ˙1 , θ˙2 , . . . , θ˙k . You may need to solve these equations numerically.

4.1.1

Things to watch out for

1. The m.l.e. may not be a turning point i.e. may not be a point at which the first derivative of the likelihood (and log-likelihood) function vanishes (see figure 4.1). 2. The m.l.e. may not be unique (see figure 4.2). 3. If the m.l.e. is found numerically by an iterative procedure, it may take lots of iterations to converge because the likelihood may be very flat. Conversely if you allow your iterative procedure to stop too early by not asking for great accuracy the obtained m.l.e. may be quite distant from the true point of maximum if the likelihood function is very flat. 4. There may be local maxima so numerical solution of likelihood equations need not necessarily provide the global maximum (see figure 4.3).

4.1. MAXIMUM LIKELIHOOD METHOD OF ESTIMATION

0.9 0.8

Likelihood L(θ)

0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

5 m.l.e.

10 parameter θ

15

20

Figure 4.1: The m.l.e. is a boundary point

0.2 0.18

Likelihood (θ)

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

0

5 a

10 parameter θ

15

20

b

Figure 4.2: Any point between a and b is a m.l.e

55

56

CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION

450 400

Likelihood L(θ)

350 300 250 200 150 100 50 −2

−1.5

−1

−0.5

local maximum

0 0.5 parameter θ

1

1.5 2 global maximum

2.5

Figure 4.3: Likelihood function exhibits more than one maximum.

Example: Let X1 , X2 , . . . , Xn be the lifetimes of n randomly selected components produced by a certain manufacturer which are observed to take the values x1 , x2 , . . . , xn . Assuming lifetimes are exponentially distributed with p.d.f. f (x|θ) = θe−θx x>0 find the m.l.e of θ on the basis of these n observations. Solution: The likelihood is ! Ã n n X Y xi θe−θxi = θn exp −θ L (θ; x) = i=1

i=1

and the log-likelihood is ℓ (θ; x) = n log θ − θ

n X

xi

i=1

Either of these functions in θ exhibits a maximum as a turning point; hence the m.l.e. is obtained by differentiation. In particular ∂ℓ (θ; x) =0 ∂θ

n

⇐⇒

n X xi = 0 − θ i=1

4.1. MAXIMUM LIKELIHOOD METHOD OF ESTIMATION i.e.

n θ˙ (x) = Pn

i=1

Hence the m.l.e. estimator

xi

=

57

1 x¯

n θ˙ (X) = Pn

1 = ¯ X i=1 Xi

Example: In a survival time study n cancer patients were observed for a fixed time T after operation and if the symptoms reappear, the time X, since the operation, this happens is recorded. For r of these patients symptoms reappeared at times x1 , x2 , . . . , xr after their operation and the remaining n − r patients were still free of symptoms at the end of the time period T. If the time X to the return of symptoms has exponential distribution with p.d.f. f (x|θ) = θe−θx x>0 find the m.l.e. of θ on the basis of the study results. Solution:The information from the study is as follows X1 = x1 , X2 = x2 , . . . , Xr = xr , Xr+1 > T, Xr+2 > T, . . . , Xn > T The likelihood of these results is L (θ) = θe−θx1 θe−θx2 . . . θe−θxr e−θT e−θT . . . e−θT Ã ! r X xi exp (− (n − r) θT ) = θr exp −θ Ã

= θr exp −θ

"i=1r X i=1

xi + (n − r) T

#!

and the log-likelihood is ℓ (θ) = r log θ − θ

"

r X i=1

xi + (n − r) T

#

This clearly has a maximum as a turning point. Thus differentiating and equating to zero we get # " r X ∂ℓ r xi + (n − r) T = 0 = 0 ⇐⇒ − ∂θ θ i=1

58

CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION

i.e.

r θ˙ = Pr [ i=1 xi + (n − r) T ]

Hence the maximum likelihood estimator of θ is

R θ˙ (X) = Pr [ i=1 Xi + (n − R) T ]

where R = # of patients whose symptoms return within time T after their operation. Example: A random sample X1 , X2 , . . . , Xn is drawn from the N (µ, σ 2 ) distribution providing the results x1 , x2 , . . . , xn . Find the m.l.e. of θ = (µ, σ 2 ). Solution: The likelihood of the results is µ ¶ n Y 1 1 2 − 2 (xi − µ) L (θ; x) = 1 exp 2) 2 2σ (2πσ i=1 Ã ! n X n ¡ ¢ 1 − (xi − µ)2 = 2πσ 2 2 exp − 2 2σ i=1

and the log-likelihood is

n n 1 X 2 ℓ (θ; x) = − log σ − 2 (xi − µ)2 + constant 2 2σ i=1

Differentiating with respect to equations  ∂ℓ  =0  ∂µ ⇐⇒ ∂ℓ   = 0 ∂σ 2

µ and σ 2 we obtain the maximum likelihood   

1 Pn (xi − µ) ˙ =0 2σ˙ 2 i=1 P n 1   − ˙ 2=0 + 4 ni=1 (xi − µ) 2 2σ˙ 2σ˙

From the first likelihood equation we get n

1X xi = x¯ µ˙ = n i=1 Putting this solution in the second equation we get n

1X σ˙ = (xi − x¯)2 n i=1 2

4.1. MAXIMUM LIKELIHOOD METHOD OF ESTIMATION

59

Hence the maximum likelihood estimators are n 1X ¯ Xi = X µ˙ (X) = n i=1

and

n

¢ 1 X¡ ¯ 2 Xi − X σ˙ (X) = n i=1 2

£ ¤ Example: Let X1 , X2 , . . . , Xn be a random sample from the Uniform θ − 21 , θ + 12 distribution. Find the m.l.e. of θ. Solution: Given the results (x1 , x2 , . . . , xn ) = x their likelihood is L (θ; x) = I[θ− 1 ,θ+ 1 ] (xi ) 2 2 n Y I(−∞,θ+ 1 ] (xi ) I[θ+ 1 ,∞) (xi ) = 2 2 i=1 ¶ ¶ µ µ = I(−∞,θ+ 1 ] max xi I[θ+ 1 ,∞) min xi 2

1≤i≤n

2

1≤i≤n

= I[max xi − 1 ,∞) (θ) I(−∞,min xi + 1 ] (θ) 2 2 = I[max xi − 1 ,min xi + 1 ] (θ) 2

2

where the set function IA (u) is defined as ½ 1 if u ∈ A IA (u) = 0 if u ∈ /A From the plot of L (θ; x) in ·figure 4.4 we see that the¸ likelihood is maximised by any θ˙ in the interval max xi − 12 , min xi + 21 i.e. the maximum 1≤i≤n

1≤i≤n

likelihood estimator is not unique. Possible candidates are θ1 (X) = min Xi + 1≤i≤n

1 2

1 θ2 (X) = max Xi − 1≤i≤n 2 µ ¶ 1 max Xi + min Xi θ3 (X) = 1≤i≤n 2 1≤i≤n This example also demonstrates that the m.l.e. is not always obtained by differentiating and equating to zero.

60

CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION

Likelihood L(θ)

2

1

Parameter θ max x − 1/2 i

min x +1/2 i

Figure 4.4: Likelihood function for observations x1 , x2 , . . . , xn sampled from the

Uniform[θ − 12 , θ + 12 ] distribution.

4.2 4.2.1

Properties of Maximum likelihood estimators Invariance principle.

Suppose that φ (θ) is a function of θ and that θ˙ is the m.l.e. of θ. Then the m.l.e. of φ is given by ³ ´ φ˙ = φ θ˙

˙ If φ is an one-to-one i.e. it is obtained by evaluating the faction φ at θ = θ. function then this is clearly obvious. If φ however is not one-to-one then the justification is not straightforward and is omitted.

4.2.2

M.l.e. and most efficient estimators

If a most efficient unbiased estimator θˆ of θ exists then by 2.14 i h ∂ log fX (x|θ) = I (θ) θˆ (x) − θ ∂θ

4.2. PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS

61

But under the same regularity conditions under which this result is true, the m.l.e. emerges as the solution of the likelihood equation ∂ log fX (x|θ) = 0 ∂θ i.e. the m.l.e. θ˙ satisfies i ³ ´h I θ˙ θˆ (x) − θ˙ = 0

and since I (θ) > 0 for all θ we must have that θ˙ = θˆ (x)

i.e. we have the following result Result: If a most efficient unbiased estimator θˆ of θ exists (i.e. θˆ is unbiased and its variance is equal to the CRLB) then the maximum likelihood method of estimation will produce it.

4.2.3

M.l.e. and sufficiency

Recall that if T is a sufficient statistic for θ then by the factorization theorem fX (x|θ) = g (t, θ) h (x)

with t = T (x)

Since the m.l.e. θ˙ (x) maximises fX (x|θ) with respect to θ it follows that when a sufficient statistic exists the m.l.e. θ˙ (x) maximises g(t, θ) with respect to θ. θ˙ must therefore depend on the sample observations only through the value t of the sufficient statistic. Since the sufficient statistic is arbitrary we have the following result. Result: A maximum likelihood estimator is a function of all sufficient statistics of θ including the minimal sufficient statistic.

4.2.4

Asymptotic properties of m.l.e

By far the best justification for the use of the maximum likelihood method of estimation is the asymptotic behaviour of the maximum likelihood estimator. In particular, under some regularity conditions and provided the sample size is large enough, the maximum likelihood estimator produces, with very high

62

CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION

probability, estimates very close to the true value of the parameter it is estimating (i.e. the m.l.e. is consistent). Further, under the same conditions the m.l.e. estimator is approximately unbiased, has variance approximately equal to the CRLB and its distribution is approximately Normal. Thus for large sample sizes maximum likelihood estimators are approximately most efficient unbiased for the parameter they estimate. In particular Theorem 4.2.1 Subject to mild regularity conditions the maximum likelihood estimator θ˙ (X) of θ, where X is a random sample of size n from the distribution with p.m/d.f. f (x|θ), is 1. weakly consistent i.e. ¯ ´ ³¯ ¯ ¯˙ Pr ¯θ (X) − θ0 ¯ ≤ ε → 1

as n → ∞

however small the value ε > 0, where θ0 is the true value of the θ parameter,

2. asymptotically most efficient, unbiased and Normally distributed i.e. µ ¶ ˙θ (X) ∼ N θ0 , 1 as n → ∞, I (θ0 ) where I (θ0 ) is the sample Fisher information evaluated at θ = θ0 If θ is a vector parameter of dimension k and θ˙ (X) is the vector m.l.e. of θ then ¡ ¢ θ˙ (X) ∼ M Nk θ0 , I−1 (θ0 ) for large n, where I (θ0 ) is the sample Fisher information matrix.

Sketch of Proof: The proof requires more mathematics than we are prepared to use in this course but, for the case when θ is scalar, is based on the following lines (the case when θ is a vector is very similar). 1. Consider the random variable Z (θ) = log f (X|θ)

4.2. PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS

63

which is dependent on θ. Its mean with respect to the true distribution of X is (assuming X to be continuous) Z µ (θ) = log f (x|θ).f (x|θ0 )dx. X

As a function in θ, µ (θ) attains a maximum at θ = θ0 . This follows from the fact that for all θ 6= θ0 ¶ µ Z f (x|θ) .f (x|θ0 )dx µ (θ) − µ (θ0 ) = log f (x|θ0 ) X ¶ ¾ Z ½µ f (x|θ) < − 1 .f (x|θ0 )dx (4.1) f (x|θ0 ) X since for any u 6= 1 , log u < u − 1. But the integral in 4.1 is zero so we get that µ (θ) < µ (θ0 ) for all θ 6= θ0 i.e. µ (θ) attains a maximum at θ = θ0 . Now log fX (X|θ) = log =

Ã

n X i=1

n Y i=1

!

f (Xi |θ)

log f (Xi |θ) =

n X

Zi (θ)

i=1

where the Zi (θ) are independent random variables having the same distribution as Z (θ) defined above. But by the Law of large numbers, as n → ∞, n 1X Zi (θ) → µ (θ) n i=1

1 log fX (X|θ) is close to µ (θ) fore all θ. Consequently n 1 the point of maximum of log fX (X|θ), namely θ˙ (X) , must be close n to the point of maximum of µ (θ) which is θ0 , provided the convergence 1 of log fX (X|θ) to µ (θ) is uniform in θ. Indeed as n → ∞, θ˙ (X) → θ0 n in probability which implies weak consistency

i.e. for large n,

64

CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION

Assuming that θ˙ (X) is a turning point of the log-likelihood i.e. it is a solution of the likelihood equation 0= or, dropping X for convenience

∂ ³˙ ´ ℓ θ; X ∂θ ∂ ³ ˙´ ℓ θ ∂θ

(4.2)

´ ∂2 ³ ∂ ℓ (θ0 ) + θ˙ − θ0 ℓ (θ0 ) + Remainder term ∂θ ∂θ2

(4.3)

0=

¯ ¯ ∂ ³ ˙´ ∂ where ℓ θ means ℓ (θ)¯¯ , the first derivative of ℓ (θ) evaluated at ∂θ ∂θ θ=θ˙ ˙ Expanding the r.h.s. of 4.2 about θ0 we get θ = θ. 0=

Since θ˙ is weekly consistent, for large n, the value of θ˙ will, with high probability, be close to θ0 so that with high probability the Remainder term in 4.3 will be negligible and can be ignored. Hence ¸−1 · 2 ´ ³ ∂ ˙θ − θ0 = − ∂ ℓ (θ0 ) ℓ (θ0 ) 2 ∂θ ∂θ ¸−1 · 2 ∂ ℓ (θ0 ) S (X, θ0 ) = − ∂θ2 or

where

¸−1 · ´ 2 √ ³ 1 ∂ 1 √ S (X, θ0 ) ℓ (θ0 ) n θ˙ − θ0 = − 2 n ∂θ n

(4.4)

¯ ¯ ∂ log fX (X|θ)¯¯ S (X, θ0 ) = = Score function evaluated at θ = θ0 ∂θ θ=θ0

However, since the observations in a random sample are independent from the same distribution with p.m/d.f. f (x|θ) we have that fX (X|θ) =

n Y i=1

f (Xi |θ)

4.2. PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS and log fX (X|θ) =

n X i=1

or

Hence

log f (Xi |θ)

65

(4.5)

n X ∂ S (X, θ) = log f (Xi |θ) ∂θ i=1

n 1 X 1 √ S (X, θ0 ) = √ Si (4.6) n n i=1 ¯ ¯ ∂ log f (Xi |θ)¯¯ . But by the Central Limit Theorem, and where Si = ∂θ θ=θ0 recalling that E (Si ) = 0 (see 2.11) n

1 X √ Si → N (0, V ar (Si )) n i=1 = N (0, i (θ0 ))

(4.7)

since V ar (Si ) = i (θ0 ) (see 2.12) where ¡ ¢ i (θ0 ) = E Si2 = Fisher information in one observation ! à ¯ ¯ ∂2 = E − 2 log f (Xi |θ)¯¯ ∂θ θ=θ0

Further, because of 4.5, in 4.4

n

1 X ∂2 log f (Xi |θ0 ) n i=1 ∂θ2 µ 2 ¶ ∂ → E log f (Xi |θ0 ) = −i (θ0 ) ∂θ2

1 ∂2 ℓ (θ0 ) = n ∂θ2

by the Law of Large Numbers. Hence from 4.4, 4.7 and 4.8 we have ´ √ ³ n θ˙ (X) − θ0 → [i (θ0 )]−1 N (0, i (θ0 )) ¶ µ 1 = N 0, i (θ0 )

(4.8)

66

CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION

or ¶ µ ˙θ (X) → √1 N 0, 1 + θ0 i (θ0 ) n µ ¶ µ ¶ 1 1 = N θ0 , = N θ0 , ni (θ0 ) I (θ0 ) Corollary to the last theorem Let X be a random sample of size n from a distribution with p.m/d.f f (x|θ) where θ is a scalar parameter. Under mild regularity conditions 1 √ S(X, θ) → N (0, i(θ)) n

as n → ∞

where S(X, θ) is the score statistic i.e. for large sample size n, S(X, θ) is approximately N (0, ni(θ)) distributed, or S(X, θ) ∼ N (0, I(θ))

approximately.

The asymptotic properties of the maximum likelihood estimator of a scalar parameter θ can be extended to the case of the maximum likelihood estimator of a real valued function φ(θ) of the scalar parameter θ. Theorem: Subject to mild regularity conditions the maximum likelihood estimator φ˙ (X) of a real valued function φ (θ) of θ, where X is a random sample of size n from the distribution with p.m/d.f. f (x|θ), is 1. weakly consistent i.e. ¯ ´ ³¯ ¯ ¯˙ Pr ¯φ (X) − φ (θ0 )¯ ≤ ε → 1

as n → ∞

however small the value ε > 0, where θ0 is the true value of the θ parameter,

2. asymptotically most efficient, unbiased and Normally distributed i.e. Ã ! 2 ′ [φ (θ )] 0 φ˙ (X) ∼ N φ (θ0 ) , as n → ∞, I (θ0 ) where I (θ0 ) is the sample Fisher information evaluated at θ = θ0 and evaluated at θ = θ0 . φ′ (θ0 ) is the derivative dφ(θ) dθ

4.3. ASYMPTOTIC CONFIDENCE INTERVALS FOR θ

67

The proof of this theorem can be obtained by adjusting the proof of the previous theorem or by using (a) the invariance principle of the maximum likelihood estimation (b) the last theorem and (c) the fact that, given a real valued function φ(θ) of θ, the Fisher Information ·

dφ I(φ) = I(θ)/ dθ

¸2

.

4.3

Asymptotic confidence intervals for θ

4.3.1

Asymptotic confidence intervals using m.l.e.

Let. X1 , X2 , . . . , Xn be a random sample from the N (0, θ) distribution with n large. The sample joint p.d.f. is n Y

¶ µ 1 1 2 √ fX (x|θ) = exp − xi 2θ 2πθ i=1 Ã ! n X 1 = (2πθ)−n/2 exp − x2i 2θ i=1 and n

1 X 2 n x + constant log fX (x|θ) = − log θ − 2 2θ i=1 i Hence ∂ n S (x) = log fX (x|θ) = − + ∂θ 2θ and equating to zero we get θ˙ (x) =

Pn

i=1

n

x2i

Pn

2 i=1 xi 2θ2

as the m.l.e. of θ. Further

n ∂2 log f (x|θ) = − X ∂θ2 2θ2

Pn

i=1 θ3

x2i

68

CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION

so that I (θ) = −E

µ

n 2θ2 n = − 2 2θ n = − 2 2θ = −

Pn ¶ µ ¶ 2 n ∂2 i=1 Xi log fX (X|θ) = E − 2 + ∂θ2 2θ θ3 Pn E (X 2 ) n nE (X12 ) + i=1 3 i = − 2 + θ 2θ θ3 nV ar (X1 ) + since E (X1 ) = 0 θ3 nθ n + 3 = 2 θ 2θ

Since n is assumed to be large we therefore have from theorem 40 that the m.l.estimator Pn ¶ µ ¶ µ 2 2θ2 1 i=1 Xi = N θ, ∼ N θ, n I (θ) n

asymptotically. Hence, approximately,   Pn 1 2 X −θ ≤ 1.96 = 0.95 Pr −1.96 ≤ n i=1q i 2 θ n

(4.9)

i.e.

Pr i.e.

ÃÃ

à r ! r !! n 2 2 1X 2 1 − 1.96 Xi ≤ 1 + 1.96 θ≤ = 0.95 n n i=1 n 

1

Pn

2

1

Pn

2



X X   Pr  ³ n i=1 qi ´ ≤ θ ≤ ³ n i=1 qi ´  = 0.95 1 + 1.96 n2 1 − 1.96 n2

Equation 4.10 states that the random interval   P P n n 1 1 2 X2   n i=1 Xi q ´ , ³ n i=1 qi ´  ³ 1 + 1.96 n2 1 − 1.96 n2

(4.10)

(4.11)

has probability 95% of positioning itself so that it includes the fixed but unknown value of the parameter θ. It therefore constitutes a 95% confidence interval estimator of θ .

4.3. ASYMPTOTIC CONFIDENCE INTERVALS FOR θ

69

Percentile of standard Normal distribution 0.4

Standard Normal p.d.f.

0.35

0.3

0.25

0.2

0.15

1−α/2

0.1

α/2 0.05

0 −4

−3

−2

−1

0

1

z

3

4

zα/2

Figure 4.5:

It is instructive to note how this confidence interval was obtained starting from equation 4.9. A probabilistic statement on the standardised m.l.e. Pn 1 2 θ˙ (X) − θ i=1 Xi − θ n p q = 1/ I (θ) θ n2

was seen, after some re-arrangements, to define a random interval (i.e. a region whose position was random) which had a certain probability of positioning itself so that it contains within it the unknown value of the parameter θ. Generalising this approach provides us with a means of obtaining confidence intervals as follows. Since from µ theorem ¶ 40 we have, approximately for 1 large sample sizes, that θ˙ (X) ∼ N θ, i.e. I (θ)

it follows that

´p θ˙ (X) − θ ³ ˙ p = θ (X) − θ I (θ) ∼ N (0, 1) 1/ I (θ) ³

³ ´p ´ ˙ Pr −zα/2 ≤ θ (X) − θ (4.12) I (θ) ≤ zα/2 = 1 − α ¢ ¡ where zα/2 is such that Φ zα/2 = 1 − α/2, Φ being the distribution function of the N (0, 1) distribution (see figure 4.5).

70

CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION The inequality ´p ³ I (θ) ≤ zα/2 −zα/2 ≤ θ˙ (X) − θ

(4.13)

in 4.12 defines a random region CR (X) in the parameter space Θ and equation 4.12 can now be interpreted as Pr (CR (X) contains θ) = 1 − α i.e. the probability that CR (X) positions itself so that it contains within it the value of the parameter θ is 1−α. Hence CR (X) constitutes a 100 (1 − α) % confidence region for the true value of θ. In most cases, as in the example at the start of this section, it will be easy to identify the region CR (X). There will be occasions, however, when it will be difficult to identify the region CR (X) from equation 4.13. If it is not possible to do so, then an approximation to CR (X) can be obtained by evaluating the Fisher information in 4.13 at the m.l.e. θ˙ instead of at θ. In that event 4.13 becomes ´r ³ ´ ³ ˙ I θ˙ ≤ zα/2 −zα/2 ≤ θ (X) − θ

i.e.

i.e. the interval

zα/2 zα/2 θ˙ (X) − r ³ ´ ≤ θ ≤ θ˙ (X) + r ³ ´ I θ˙ I θ˙ 



  θ˙ (X) − rzα/2 , θ˙ (X) + rzα/2  ´ ´ ³ ³   I θ˙ I θ˙

(4.14)

is an approximate 100 (1 − α) % confidence interval estimator of θ. Example: Continuing with the last example and introducing the approxi³ ´ ˙ mation I θ in 4.9 we see that the 95% confidence interval of θ obtained in 4.11 can be approximated by the interval à n ! r r n n n 2 1X 2 1X 2 2 1X 2 1X 2 X − 1.96 X , X + 1.96 X . . n i=1 i n n i=1 i n i=1 i n n i=1 i

4.3. ASYMPTOTIC CONFIDENCE INTERVALS FOR θ i.e.

4.3.2

Ã

n

1X 2 X n i=1 i

71

Ã

à r ! r !! n 1X 2 2 2 1 − 1.96 Xi 1 − 1.96 , . n n i=1 n

Asymptotic confidence intervals using the score statistic

We have seen in corollary on page 66 that for large sample sizes S (X) ∼ N (0, I (θ)) i.e. S (X) p ∼ N (0, 1) . I (θ)

Hence we have, approximately, that ! Ã S (X) ≤ zα/2 = 1 − α Pr −zα/2 ≤ p I (θ)

(4.15)

Once again the inequality

S (X) ≤ zα/2 −zα/2 ≤ p I (θ)

(4.16)

defines a region CR (X) in the parameter space Θ whose position is determined by the random vector X. Its position is therefore deemed to be random. Equation 4.15 is therefore interpreted to say that there is approximately probability (1 − α) that the random region CR (X) will position itself so that it contains within it the unknown value of the parameter θ. Hence CR (X) constitutes a 100(1 − α) % confidence interval estimator of the parameter θ. If the region CR (X) is difficult to identify from the inequality ³ ´ 4.16 then it can be found approximately by replacing I (θ) with I θ˙ in 4.16. Example: Let X1 , X2 , . . . , Xn be a random sample from the Poisson distribution with parameter θ. Find an approximate 95% confidence interval for θ.

72

CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION

Solution: The joint p.m.f. of the sample is fX (x|θ) =

n Y e−θ θxi i=1

xi !

Hence log fX (x|θ) = −nθ + and

n X

e−nθ θ = n Q

Pn

xi

xi !

i=1

xi log θ + constant

i=1

∂ log fX (X|θ) = −n + S (X) = ∂θ

Further ∂2 I (θ) = E − 2 log fX (x|θ) ∂θ nE (X1 ) nθ n = = = θ2 θ2 θ µ

i=1



Pn

=E

i=1

Xi

θ

µ Pn

i=1 θ2

Xi

Thus for large n we have, approximately,   Pn X i i=1 −n   θr  ≤ 1.96 Pr −1.96 ≤  = 0.95 n θ

i.e.

¯= with X

i.e. or

! ¢ √ ¡¯ n X −θ √ Pr −1.96 ≤ ≤ 1.96 = 0.95 θ Ã

1 n

Pn

i=1

Xi , or equivalently ! Ã ¡ ¢ ¯ −θ 2 n X ≤ 1.962 = 0.95 Pr θ ³ ¡ ´ ¢ ¯ − θ 2 ≤ 1.962 θ = 0.95 Pr n X

¢ ¡ 2 ¡ ¢ ¯ + 1.962 θ + nθ2 ≤ 0 = 0.95 ¯ − 2nX Pr nX



(4.17)

4.3. ASYMPTOTIC CONFIDENCE INTERVALS FOR θ

73

Value of the quadratic in θ

Confidence interval for θ

0 θ

θ

1

2

θ

Figure 4.6: The interval between the roots of the quadratic nθ2 − 2nX¯ + 1.962 θ+ ¯ 2 is an approximate 95% confidence interval for θ. nX

¡

¢

Thus there is probability approximately 95% that the quadratic ¡ ¢ ¯ + 1.962 θ + nX ¯2 nθ2 − 2nX

with random coefficients (i.e. with random position) will position itself so that the true value of the parameter θ falls in the region for which the quadratic is negative i.e. there is probability approximately 95% that the quadratic will position itself so that the true value of the parameter θ is between the roots θ1 and θ2 of the quadratic i.e. Pr (θ1 (X) ≤ θ ≤ θ2 (X)) = 0.95

Hence (θ1 (X) , θ2 (X)) is an approximate 95% Confidence interval for θ. But ¢ ¢ q¡ ¡ µ ¶ 2 ¯ + 1.962 2 − 4n2 X ¯2 ¯ 2n X ± 2n X + 1.96 θ1 = θ2 2n r ¶ µ 2 2 1.96 1.96 ¯+ ¯ + 1.96 ±√ = X 2X 2n 2n 2n Thus r r µ ¶ ¶ µ 2 2 2 2 1.96 1.96 1.96 1.96 1.96 ¯+ ¯+ ¯+ ¯ + 1.96 , X −√ +√ X 2X 2X 2n 2n 2n 2n 2n 2n

74

CHAPTER 4. MAXIMUM LIKELIHOOD ESTIMATION

is a 95% Confidence Interval for θ. An approximation to this confidence interval by re³ ´ could be obtained P n 1 placing in 4.17 the expression for I (θ) by I θ˙ . Since θ˙ = n i=1 Xi an approximation to 4.17 is   Pn i=1 Xi −n   θ  = 0.95 r −1.96 ≤ Pr  ≤ 1.96   n Pn 1 i=1 Xi n

i.e.

Ã

or

1.96n ≤ Pr n − pPn i=1 Xi

i.e.

Pn  i=1 Xi  Pr  µ n 1 + √P1.96 n



i=1

i.e.

Hence



Pr  ³

Xi

¯ X 1+

1.96 √ ¯ nX

Pn

1.96n i=1 Xi ≤ n + pPn θ i=1 Xi

¶ ≤θ≤

Pn

i=1

µ

´ ≤θ≤ ³

i=1

¯ X 1−

1.96 √ ¯ nX

= 0.95



Xi

n 1 − √P1.96 n

!

Xi

 ¶  = 0.95



´  = 0.95

 √ ¯3 √ ¯3 nX 2 nX 2 ´ ≤ θ ≤ ³√ ´  = 0.95 Pr  ³√ ¯ + 1.96 ¯ − 1.96 nX nX 

 √ ¯3 √ ¯3 2 2 nX  ³√ n X ´ , ³√ ´ ¯ ¯ nX + 1.96 nX − 1.96 

is an approximate 95% Confidence interval for θ.