Lecture Notes 4 Convergence (Chapter 5) 1

Random Samples

Let X1 , . . . , Xn ∼ F . A statistic is any function Tn = g(X1 , . . . , Xn ). Recall that the sample mean is n 1X Xi Xn = n i=1 and sample variance is n

Sn2

1 X (Xi − X n )2 . = n − 1 i=1

Let µ = E(Xi ) and σ 2 = Var(Xi ). Recall that E(X n ) = µ,

Var(X n ) =

σ2 , n

E(Sn2 ) = σ 2 .

Theorem 1 If X1 , . . . , Xn ∼ N (µ, σ 2 ) then X n ∼ N (µ, σ 2 /n). Proof. We know that MXi (s) = eµs+σ t

MX n (t) = E(etX n ) = E(e n

Pn

i=1

Xi

2 s2 /2

. So,

) 

= (EetXi /n )n = (MXi (t/n))n = e(µt/n)+σ

2 t2 /(2n2 )

n

( = exp µt +

σ2 2 t n

)

2

which is the mgf of a N (µ, σ 2 /n).  Example 2 Let X(1) , . . . , X(n) denoted the ordered values: X(1) ≤ X(2) ≤ · · · ≤ X(n) . Then X(1) , . . . , X(n) are called the order statistics. And Tn = (X(1) , . . . , X(n) ) is a statistic.

1

2

Convergence

Let X1 , X2 , . . . be a sequence of random variables and let X be another random variable. Let Fn denote the cdf of Xn and let F denote the cdf of X. We are going to study different types of convergence.

Example: A good example to keep in mind is the folllowing. Let Y1 , Y2 , . . . , be a sequence of iid random variables. Let n

1X Yi Xn = n i=1 be the average of the first n of the Yi ’s. This defines a new sequence X1 , X2 , . . . ,. In other words, the sequence of interest X1 , X2 , . . . , might be a sequence of statistics based on some other sequence of iid random variables. Note that the original sequence Y1 , Y2 , . . . , is iid but the sequence X1 , X2 , . . . , is not iid.

a.s.

1. Xn converges almost surely to X, written Xn → X, if, for every  > 0,   P lim |Xn − X| <  = 1. n→∞

(1)

a.s.

Xn converges almost surely to a constant c, written Xn → c, if, for every  > 0,   P lim Xn = c = 1. (2) n→∞

P

2. Xn converges to X in probability, written Xn → X, if, for every  > 0, P(|Xn − X| > ) → 0

(3)

as n → ∞. In other words, Xn − X = oP (1). P Xn converges to c in probability, written Xn → c, if, for every  > 0, P(|Xn − c| > ) → 0

(4)

as n → ∞. In other words, Xn − c = oP (1). 3. Xn converges to X in quadratic mean (also called convergence in L2 ), written qm Xn → X, if E(Xn − X)2 → 0 (5) as n → ∞. qm Xn converges to c in quadratic mean, written Xn → c, if E(Xn − c)2 → 0 as n → ∞. 2

(6)

4. Xn converges to X in distribution, written Xn

X, if

lim Fn (t) = F (t)

(7)

n→∞

at all t for which F is continuous. Xn converges to c in distribution, written Xn

c, if

lim Fn (t) = δc (t)

n→∞

(8)

at all t 6= c where δc (t) = 0 if t < c and δc (t) = 1 if t ≥ c.

Theorem 3 Convergence in probability does not imply almost sure convergence. Proof. Let Ω = [0, 1]. Let P be uniform on [0, 1]. We draw S ∼ P . Let X(s) = s and let X1 = s + I[0,1] (s), X4 = s + I[0,1/3] (s),

X2 = s + I[0,1/2] (s), X3 = s + I[1/2,1] (s) X5 = s + I[1/3,2/3] (s), X6 = s + I[2/3,1] (s)

P

etc. Then Xn → X. But, for each s, Xn (s) does not converge to X(s). Hence, Xn does not converge almost surely to X. In fact, P ({s ∈ Ω : limn Xn (s) = X(s)}) = 0.  Example 4 Let Xn ∼ N (0, 1/n). Intuitively, Xn is concentrating at 0 so we would like to say that Xn converges to 0. Let’s √ see if this is true. Let F be the distribution function for a point mass at 0. Note that nXn ∼ N (0, 1). Let Z denote a standard normal random variable. For t < 0, √ √ √ Fn (t) = P(Xn < t) = P( nXn < nt) = P(Z < nt) → 0 √ since nt → −∞. For t > 0, √ √ √ Fn (t) = P(Xn < t) = P( nXn < nt) = P(Z < nt) → 1 √ since nt → ∞. Hence, Fn (t) → F (t) for all t 6= 0 and so Xn 0. Notice that Fn (0) = 1/2 6= F (1/2) = 1 so convergence fails at t = 0. That doesn’t matter because t = 0 is not a continuity point of F and the definition of convergence in distribution only requires convergence at continuity points. Now consider convergence in probability. For any  > 0, using Markov’s inequality, 1 E(Xn2 ) n P(|Xn | > ) = P(|Xn | >  ) ≤ = 2 →0 2  2

2

P

as n → ∞. Hence, Xn → 0. 3

The next theorem gives the relationship between the types of convergence. Theorem 5 The following relationships hold: qm P (a) Xn → X implies that Xn → X. P (b) Xn → X implies that Xn X. P (c) If Xn X and if P(X = c) = 1 for some real number c, then Xn → X. as P (d) Xn → X implies Xn → X. In general, none of the reverse implications hold except the special case in (c). qm

Proof. We start by proving (a). Suppose that Xn → X. Fix  > 0. Then, using Markov’s inequality, P(|Xn − X| > ) = P(|Xn − X|2 > 2 ) ≤

E|Xn − X|2 → 0. 2

Proof of (b). Fix  > 0 and let x be a continuity point of F . Then Fn (x) = P(Xn ≤ x) = P(Xn ≤ x, X ≤ x + ) + P(Xn ≤ x, X > x + ) ≤ P(X ≤ x + ) + P(|Xn − X| > ) = F (x + ) + P(|Xn − X| > ). Also, F (x − ) = P(X ≤ x − ) = P(X ≤ x − , Xn ≤ x) + P(X ≤ x − , Xn > x) ≤ Fn (x) + P(|Xn − X| > ). Hence, F (x − ) − P(|Xn − X| > ) ≤ Fn (x) ≤ F (x + ) + P(|Xn − X| > ). Take the limit as n → ∞ to conclude that F (x − ) ≤ lim inf Fn (x) ≤ lim sup Fn (x) ≤ F (x + ). n→∞

n→∞

This holds for all  > 0. Take the limit as  → 0 and use the fact that F is continuous at x and conclude that limn Fn (x) = F (x). Proof of (c). Fix  > 0. Then, P(|Xn − c| > ) = ≤ = → =

P(Xn < c − ) + P(Xn > c + ) P(Xn ≤ c − ) + P(Xn > c + ) Fn (c − ) + 1 − Fn (c + ) F (c − ) + 1 − F (c + ) 0 + 1 − 1 = 0. 4

Proof of (d). Omitted. Let us now show that the reverse implications do not hold. Convergence in √in quadratic mean. Let U ∼ Unif(0, 1) √probability does not imply convergence and let Xn = nI(0,1/n) (U ). Then P(|Xn | > ) = P( nI(0,1/n) (U ) > ) = P(0 ≤ U < 1/n) = R 1/n P 1/n → 0. Hence, Xn → 0. But E(Xn2 ) = n 0 du = 1 for all n so Xn does not converge in quadratic mean. Convergence in distribution does not imply convergence in probability. Let X ∼ N (0, 1). Let Xn = −X for n = 1, 2, 3, . . .; hence Xn ∼ N (0, 1). Xn has the same distribution function as X for all n so, trivially, limn Fn (x) = F (x) for all x. Therefore, Xn X. But P(|Xn − X| > ) = P(|2X| > ) = P(|X| > /2) 6= 0. So Xn does not converge to X in probability.  The relationships between the types of convergence can be summarized as follows:

q.m. ↓ a.s. → prob → distribution P

Example 6 One might conjecture that if Xn → b, then E(Xn ) → b. This is not true. Let Xn be a random variable defined by P(Xn = n2 ) = 1/n and P(Xn = 0) = 1 − (1/n). P Now, P(|Xn | < ) = P(Xn = 0) = 1 − (1/n) → 1. Hence, Xn → 0. However, E(Xn ) = [n2 × (1/n)] + [0 × (1 − (1/n))] = n. Thus, E(Xn ) → ∞. Example 7 Let X1 , . . . , Xn ∼ Uniform(0, 1). Let X(n) = maxi Xi . First we claim that P

X(n) → 1. This follows since P(|X(n) − 1| > ) = P(X(n) ≤ 1 − ) =

Y

P(Xi ≤ 1 − ) = (1 − )n → 0.

i

Also P(n(1 − X(n) ) ≤ t) = P(X(n) ≥ 1 − (t/n)) = 1 − (1 − t/n)n → 1 − e−t . So n(1 − X(n) )

Exp(1).

Some convergence properties are preserved under transformations. Theorem 8 Let Xn , X, Yn , Y be random variables. P P P (a) If Xn → X and Yn → Y , then Xn + Yn → X + Y . qm qm qm (b) If Xn → X and Yn → Y , then Xn + Yn → X + Y . P P P (c) If Xn → X and Yn → Y , then Xn Yn → XY . 5

In general, Xn X and Yn are cases when it does:

Y does not imply that Xn + Yn

Theorem 9 (Slutzky’s Theorem) If Xn If Xn X and Yn c, then Xn Yn cX.

X and Yn

X + Y . But there

c, then Xn + Yn

X + c. Also,

Theorem 10 (The Continuous Mapping Theorem) Let Xn , X, Yn , Y be random variables. Let g be a continuous function. P P (a) If Xn → X, then g(Xn ) → g(X). (b) If Xn X, then g(Xn ) g(X). Exercise: Prove the continuous mapping theorem.

3

The Law of Large Numbers

The law of large numbers (LLN) says that the mean of a large sample is close to the mean of the distribution. For example, the proportion of heads of a large number of tosses of a fair coin is expected to be close to 1/2. We now make this more precise. Let X1 , X2 , . . . be an iid sample, let µ = E(X1 ) and σ 2 = Var(X1 ). Recall that the sample P n mean is defined as X n = n−1 i=1 Xi and that E(X n ) = µ and Var(X n ) = σ 2 /n. Theorem 11 (The Weak Law of Large Numbers (WLLN)) P If X1 , . . . , Xn are iid, then X n → µ. Thus, X n − µ = oP (1). Interpretation of the WLLN: The distribution of X n becomes more concentrated around µ as n gets large. Proof. Assume that σ < ∞. This is not necessary but it simplifies the proof. Using Chebyshev’s inequality,  Var(X n ) σ2 = P |X n − µ| >  ≤ 2 n2 which tends to 0 as n → ∞.  Theorem 12 The Strong Law of Large Numbers. Let X1 , . . . , Xn be iid with mean µ. as Then X n → µ. The proof is beyond the scope of this course.

6

4

The Central Limit Theorem

The law of large numbers says that the distribution of X n piles up near µ. This isn’t enough to help us approximate probability statements about X n . For this we need the central limit theorem. 2 Suppose that X1 , . . . , Xn are iid P with mean µ and variance σ . The central limit the−1 orem (CLT) says that X n = n i Xi has a distribution which is approximately Normal with mean µ and variance σ 2 /n. This is remarkable since nothing is assumed about the distribution of Xi , except the existence of the mean and variance. Theorem 13 (The Central Limit P Theorem (CLT)) Let X1 , . . . , Xn be iid with mean µ and variance σ 2 . Let X n = n−1 ni=1 Xi . Then √ Xn − µ n(X n − µ) Zn ≡ q = Z σ Var(X n ) where Z ∼ N (0, 1). In other words, Z

z

lim P(Zn ≤ z) = Φ(z) =

n→∞

−∞

1 2 √ e−x /2 dx. 2π

Interpretation: Probability statements about X n can be approximated using a Normal distribution. It’s the probability statements that we are approximating, not the random variable itself. Remark: We often write

 Xn ≈ N

as short form for



n(X n − µ)

σ2 µ, n



N (0, 1).

Recall that if X is a random variable, its moment generating function (mgf) is ψX (t) = EetX . Assume in what follows that the mgf is finite in a neighborhood around t = 0. Lemma 14 Let Z1 , Z2 , . . . be a sequence of random variables. Let ψn be the mgf of Zn . Let Z be another random variable and denote its mgf by ψ. If ψn (t) → ψ(t) for all t in some open interval around 0, then Zn Z. P −1/2 Proof of the central limit theorem. Let Y = (X − µ)/σ. Then, Z = n i i n i Yi . P √ Let ψ(t) be the mgf of Yi . The mgf of i Yi is (ψ(t))n and mgf of Zn is [ψ(t/ n)]n ≡ ξn (t). Now ψ 0 (0) = E(Y1 ) = 0, ψ 00 (0) = E(Y12 ) = Var(Y1 ) = 1. So, t2 00 t3 000 ψ(t) = ψ(0) + tψ (0) + ψ (0) + ψ (0) + · · · 2! 3! t2 t3 000 = 1 + 0 + + ψ (0) + · · · 2 3! t2 t3 000 = 1 + + ψ (0) + · · · 2 3! 0

7

Now, n   t ξn (t) = ψ √ n  n 2 t t3 000 = 1+ + ψ (0) + · · · 2n 3!n3/2 #n " t3 t2 000 + ψ (0) + · · · 1/2 3!n = 1+ 2 n 2 /2

→ et

which is the mgf of a N(0,1). The result follows from Lemma 14. In the last step we used the fact that if an → a then  an  n 1+ → ea .  n

√ The central limit theorem tells us that Zn = n(X n − µ)/σ is approximately N(0,1). However, we rarely know σ. We can estimate σ 2 from X1 , . . . , Xn by n

Sn2

1 X (Xi − X n )2 . = n − 1 i=1

This raises the following question: if we replace σ with Sn , is the central limit theorem still true? The answer is yes. Theorem 15 Assume the same conditions as the CLT. Then, √ n(X n − µ) Tn = N (0, 1). Sn Proof. Here is a brief proof. We have that Tn = Zn Wn where

√ Zn =

n(X n − µ) σ

and Wn = Now Zn

P

σ . Sn

N (0, 1) and Wn → 1. The result follows from Slutzky’s theorem.  8

Here is an extended proof. P

Step 1. We first show that Rn2 → σ 2 where n

Rn2

1X (Xi − X n )2 . = n i=1

Note that n 1X 2 2 Rn = X − n i=1 i

n 1X Xi n i=1

!2 .

Define Yi = Xi2 . Then, using the LLN (law of large numbers) n

n

1X 1X 2 P Xi = Yi → E(Yi ) = E(Xi2 ) = µ2 + σ 2 . n i=1 n i=1 Next, by the LLN, n

1X P Xi → µ. n i=1 Since g(t) = t2 is continuous, the continuous mapping theorem implies that !2 n 1X P Xi → µ2 . n i=1 Thus

P

Rn2 → (µ2 + σ 2 ) − µ2 = σ 2 . Step 2. Note that Sn2

 =

n n−1

P



Rn2 . P

Since, Rn2 → σ 2 and n/(n − 1) → 1, we have that Sn2 → σ 2 . Step 3. Since g(t) = P implies that Sn → σ.



t is continuous, (for t ≥ 0) the continuous mapping theorem

Step 4. Since g(t) = t/σ is continuous, the continuous mapping theorem implies that P Sn /σ → 1. Step 5. Since g(t) = 1/t is continuous (for t > 0) the continuous mapping theorem P implies that σ/Sn → 1. Since convergence in probability implies convergence in distribution, σ/Sn 1. 9

Step 5. Note that √ Tn =

n(X n − µ) σ



σ Sn

 ≡ Vn Wn .

Now Vn Z where Z ∼ N (0, 1) by the CLT. And we showed that Wn 1. By Slutzky’s theorem, Tn = Vn Wn Z × 1 = Z. The next result is very important. It tells us how close the distribution of X is to the Normal distribution. Theorem 16 (Berry-Esseen Theorem) Let X1 , . . . , Xn ∼ P . Let µ = E[Xi ] and σ 2 = Var[Xi ]. Assume that µ3 = E[|Xi − µ|3 ] < ∞. Let  √ n(X n − µ) ≤z . Fn (z) = P σ Then sup |Fn (z) − Φ(z)| ≤ z

33 µ3 √ . 4 σ3 n

There is also a multivariate version of the central limit theorem. Recall that X = (X1 , . . . , Xk )T has a multivariate Normal distribution with mean vector µ and covariance matrix Σ if   1 1 T −1 f (x) = exp − (x − µ) Σ (x − µ) . (2π)k/2 |Σ|1/2 2 In this case we write X ∼ N (µ, Σ). Theorem 17 (Multivariate central limit theorem) Let X1 , . . . , Xn be iid random vectors where Xi = (X1i , . . . , Xki )T withP mean µ = (µ1 , . . . , µk )T and covariance matrix Σ. Let X = (X 1 , . . . , X k )T where X j = n−1 ni=1 Xji . Then, √ n(X − µ) N (0, Σ). Remark: There is also a multivariate version of the Berry-Esseen theorem but it is more complicated than the one-dimensional version.

5

The Delta Method

If Yn has a limiting Normal distribution then the delta method allows us to find the limiting distribution of g(Yn ) where g is any smooth function. 10

Theorem 18 (The Delta Method) Suppose that √ n(Yn − µ) N (0, 1) σ and that g is a differentiable function such that g 0 (µ) 6= 0. Then √ n(g(Yn ) − g(µ)) N (0, 1). |g 0 (µ)|σ In other words, σ2 Yn ≈ N µ, n

!

! 2 σ g(Yn ) ≈ N g(µ), (g 0 (µ))2 . n

implies that

Example 19 Let variance σ 2 . By the central √ X1 , . . . , Xn be iid with finite mean µ and Xfinite limit theorem, n(X n − µ)/σ N (0, 1). Let Wn = e n . Thus, Wn = g(X n ) where s 0 s g(s) = e . Since g (s) = e , the delta method implies that Wn ≈ N (eµ , e2µ σ 2 /n). There is also a multivariate version of the delta method. Theorem 20 (The Multivariate Delta Method) Suppose that Yn = (Yn1 , . . . , Ynk ) is a sequence of random vectors such that √ n(Yn − µ) N (0, Σ). Let g : Rk → R and let 

∂g ∂y1



  ∇g(y) =  ...  . ∂g ∂yk

Let ∇µ denote ∇g(y) evaluated at y = µ and assume that the elements of ∇µ are nonzero. Then  √ n(g(Yn ) − g(µ)) N 0, ∇Tµ Σ∇µ . Example 21 Let 

X11 X21

     X12 X1n , , ..., X22 X2n

be iid random vectors with mean µ = (µ1 , µ2 )T and variance Σ. Let n

n

1X X1 = X1i , n i=1

1X X2 = X2i n i=1

and define Yn = X 1 X 2 . Thus, Yn = g(X 1 , X 2 ) where g(s1 , s2 ) = s1 s2 . By the central limit theorem,   √ X 1 − µ1 n N (0, Σ). X 2 − µ2 11

Now ∇g(s) =

∂g ∂s1 ∂g ∂s2

!

 =

s2 s1



and so ∇Tµ Σ∇µ

 = (µ2 µ1 )

σ11 σ12 σ12 σ22



µ2 µ1



= µ22 σ11 + 2µ1 µ2 σ12 + µ21 σ22 .

Therefore, √ n(X 1 X 2 − µ1 µ2 )

 N

0, µ22 σ11

12

+ 2µ1 µ2 σ12 +

µ21 σ22

 .