MATH 526 LECTURE NOTES

MATH 526 LECTURE NOTES TERRY SOO Abstract. These notes are for Math 526 (a first course in probability and statistics for students with the power of ...
Author: Noah Haynes
6 downloads 0 Views 839KB Size
MATH 526 LECTURE NOTES TERRY SOO

Abstract. These notes are for Math 526 (a first course in probability and statistics for students with the power of first year Calculus), following the text book, Probability and Statistics for Engineers and Scientist, Warpool, Myers, Myers, and Ye. They are not a substitute for reading the book or attending class.

1. Chapter 1 1.1. Introduction. We will try to get an idea of what statistics is about with the following examples. A population is a collection of individuals or items of a particular type. For example, we could be interested in the population of KU students or the population of KState students. Suppose we are interested in knowing which students are taller. The population mean of the heights of a population is given by summing all the heights and then dividing by the number of students. One way of interpreting this question is to compare the population mean of the heights of the two populations. However, it may be too hard or undesirable to actually measure the heights of all the students of both populations; the best we may be able to do is take sample; that is, just measure the heights of a subset of the students. What subsets should one take? A simple way to sample is take a simple random sample, where each individual is equally likely to be sampled. So say we sample n students from each of the populations and obtain the height data: x1 , . . . , xn and y1 , . . . , yn , giving the heights of the sampled KU and KState students. We compute the sample mean of the heights of the sampled KU students given by n

1X x¯ = xi , n i=1 and we similarly compute y¯. But how close are these sample means to the population means? How certain can we be that these estimators for the population means tells us anything about them? My favourite examples in probability and statistics usually involve flipping coins and rolling dice. Suppose we are given a coin and want 1

2

TERRY SOO

to determine if it is fair or not. Let’s assign a value of 1 to heads and 0 tails. In this case, we imagine the population to be the set of all coin flips that is generated by the coin, and a sample is given by flipping the coin a finite number of times. Suppose we obtain the values x1 , . . . , xn . Again, we can take the sample mean x¯ and hope that this tells us something about the true probability that a coin flip comes up heads. By the end of this course we will have a quantitative way of saying how good these estimators are. 1.2. Standard deviation. Let’s return to the heights of the KU students. What if you for some reason knew that most KU students are roughly the same height? Would you be more confident that sample average would be good estimator for the population mean? One way to measure the spread of sample numerical values x1 , . . . , xn is given by the sample standard deviation. The sample variance is given by n

s2 =

1 X (xi − x¯)2 , n − 1 i=1

where x¯ is the sample mean, and the sample standard deviation is given by √ s = s2 . Hopefully, we will discuss why n − 1 appears instead of n, later in the course. Example done in class 1.1. Let L(z) = az+b. Given data x1 , . . . , xn , consider the transformed data y1 , . . . , yn , where yi = L(xi ). Think of converting from Celsius to Fahrenheit for a concrete example. Find the relation between sx , the sample standard deviation of x1 , . . . , xn and sy , the sample standard deviation of y1 , . . . , yn . P Example done in class 1.2. Let f (z) := ni=1 (z − xi )2 . Find the minimum of f . Use the power of Calculus. Exercise 1.3. Prove the so-called short cut formula: n X i=1

2

(xi − x¯) =

n X i=1

x2i

n 1 X 2 − xi . n i=1

Exercise 1.4. Consider m := min {x1 , . . . , xn } and M := max {x1 , . . . , xn }, so that m ∈ {x1 , . . . , xn } and m ≤ xi for all i = 1, 2, . . . , n and M ∈ {x1 , . . . , xn } and M ≥ xi for all i = 1, 2, . . . , n. (For example, min {2, 6, 5} = 2 and max {2, 6, 5} = 6.) (a) Show that m ≤ x¯ ≤ M.

MATH 526 LECTURE NOTES

3

(b) Show that n

1X (xi − x¯)2 ≤ (M − m)2 . n i=1 Exercise 1.5 (Markov’s inequality for sample data). Let x1 , . . . xn be nonnegative real numbers. Let a ≥ 0. Let the number of xi ≥ a . Prop(x ≥ a) = n Show that aProp(x ≥ a) ≤ x¯. Exercise 1.6 (Chebyshev’s inequality for sample data). Let x1 , . . . xn ∈ R. Let a ≥ 0. Set the number of |xi − x¯| ≥ a . Prop(|x − x¯| ≥ a) = n Show that a2 Prop(|x − x¯| ≥ a) ≤ s2 . Let k ≥ 0. By choosing a = sk show that 1 Prop(|x − x¯| ≥ ks) ≤ 2 . k 1.3. Method of least squares. Exercise 1.7 (Some definitions). Let x1 , . . . xn , y1 , . . . , yn ∈ R. Define n

1X xi y i , xy = n i=1 x2 =

1X 2 x, n i=1 i n

var(x) =

1X (xi − x¯)2 . n i=1 n

cov(x, y) =

1X (xi − x¯)(yi − y¯). n i=1

Note that in the definitions of var and cov we do use the denominator n. (a) Use Exercise 1.3 to show that n

1X 2 var(x) = x − (¯ x)2 = x2 − (¯ x)2 . n i=1 i

4

TERRY SOO

(b) Prove a short-cut type formula to obtain that cov(x, y) = xy − x¯ · y¯. Exercise 1.8 (Equation of a line). Find the equation of a line in R2 that passes through the points (1, 7) and (2, 9). Exercise 1.9. Let x1 , . . . xn , y1 , . . . , yn ∈ R. Consider the line given by f (t) = mt + b. Let g(m, b) =

n X

n X (f (xi ) − yi ) = (mxi + b − yi )2 .

i=1

i=1

2

Show that at the point (m, ˆ ˆb), the function g is minimized, where m ˆ = cov(x, y)/var(x), and ˆb = y¯ − m¯ ˆ x.

2. Chapter 2.1, 2.2 2.1. Sample spaces. We will now build a mathematical framework in that will model random events. By an experiment we mean any process that generates a set of data. Call the set of all possibles outcomes of a statistical experiment a sample space; sometimes it is denoted by S, S, or Ω. In the simple experiment, where we toss a coin, we can take S = {H, T } or S = {0, 1}. Another experiment where we toss a coin until it lands heads, we can take S = {H, T H, T T H, T T T H, . . .}. If we are wondering how much longer it will take the bus to arrive we can take S = {t ∈ R : t ≥ 0} . An event is a subset of a sample space. That is, E is an event of a sample space S, if every element of E is an element of S; in this case, we write, E ⊆ S. Note that the whole set S and the empty or null set, ∅ are always events. In the simple experiment where we roll a dice, and then toss a coin, we can take S = {(i, j) : i = 1, 2, 3, 4, 5, 6, j = 0, 1}, and examples of events are given by {(6, 0), (6, 1)} and {(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0)}; in words, these are the events that the dice rolled 6, and the coin came up tails, respectively.

MATH 526 LECTURE NOTES

5

2.2. Set operations. In dealing with events, it will be necessary to acquaint ourselves with set notions. By x ∈ E, we mean that x is a member of the set E (sometimes we also say that x is an element of E). Two sets E and F are equal whenever they have exactly the same elements. Thus if E = {1, 2, 3} and F = {3, 2, 1}, then F = E. Let’s assume we are working in the sample space S. If A ⊆ S, then by the complement of A with respect to S, we mean the set A0 = {x ∈ S : x 6∈ E}; this is the set of all elements of S that are not in A, and is also sometimes denoted by Ac . The intersection of two events A and B is the set A ∩ B = {x ∈ S : x ∈ A and x ∈ B}. The union of two events A and B is the set A ∪ B = {x ∈ S : x ∈ A or x ∈ B}. Example done in class 2.1. If S1 and S2 are sample spaces, we can create another sample space by taking S1 × S2 = {(s1 , s2 ) : s1 ∈ S1 , s2 ∈ S2 } . Use this notation to write sample spaces for the following experiments: flip a coin three times; flip a coin and then row a dice. Example done in class 2.2. Let A and B be events of a sample space S, where A ⊆ B ⊆ S. Show that B 0 ⊆ A0 . We will now verify one of De Morgan’s law. Example done in class 2.3 (De Morgan). Let A and B be events of a sample space S. Show that (1) (A ∩ B)0 = A0 ∪ B 0 and (2) (A ∪ B)0 = A0 ∩ B 0 Solution. We will verify (half ) of the first identity (1). Note that in order to verify a set identity such as E = F , it suffices to check that E ⊆ F and F ⊆ E. Let x ∈ A0 ∪ B 0 , then x is in A0 or B 0 . Note that A ∩ B ⊆ A, thus Exercise 2.2 gives that A0 ⊆ (A ∩ B)0 . If x ∈ A0 , then by we have x ∈ (A ∩ B)0 . Similarly, if x ∈ B 0 , then x ∈ (A ∩ B)0 . Thus in both cases, x ∈ (A ∩ B)0 ; hence, A0 ∪ B 0 ⊆ (A ∩ B)0 . It remains to verify that (A ∩ B)0 ⊆ A0 ∪ B 0 . Do this for homework. Exercise 2.4. Let S = {a, b, c, d}. List all the subsets of S that have exactly three elements. 3. Chapter 2.3 3.1. Counting techniques. Often we want to say all the outcomes of an experiment is equally likely to happen. In order to do this, we need to know how to count the number of elements in a sample space. The most basic rule is the multiplication rule, which states that if A

6

TERRY SOO

and B are finite sets with cardinalities |A| = n and |B| = m, then |A × B| = nm. Another way of stating this, is that if the first element of an ordered pair can be chosen in n ways, and the second element of an ordered pair can be chosen in m ways, then the total number of ordered pairs is nm. Given, say, 3 distinct objects {a1 , a2 , a3 }, a permutation is an ordered arrangement of them; for example, (a1 , a2 , a3 ) or (a2 , a1 , a3 ). Example done in class 3.1. List all the possible permutations of the above example. We can use the multiplication rule to count the number of possible permutations of n objects; there are n choices for the first entry, then there are n − 1 choices for the second entry, and so on, so that there n! = n(n − 1)(n − 2) · · · 2 · 1 permutations. We take by definition that 0! = 1. Given n distinct objects {a1 , . . . , an }, suppose we want to partition them into 2 sets of size k and n − k (where we don’t care about order). We can think of this as choosing k objects from n without regard to order; such a selection is called a combination. We can count the number of ways to do this via the following argument. Every permutation say, (a1 , . . . , ak , ak+1 , . . . an ) or (a2 , a1 , . . . , ak , ak+1 , . . . , an ) yields a partition of 2 sets, by putting the first k coordinates into one set and putting the last n − k elements into another. However, in the examples above, some of them give the same set. For each permutation, there are exactly k! · (n − k)! other permutations that gives the same partition; thus the answer is     n n n! = = . k k, n − k k!(n − k)! Similarly, we can derive a formula to count the number of partitions of n distinct objects into r sets of size n1 , . . . , nr , where n1 + · · · nr = n. We obtain that the number is 

n n1 , . . . , nr

 =

n! n1 ! · · · nr !

Example done in class 3.2. In Powerball, a lottery, 5 white balls are drawn from 59 numbered balls, and 1 red ball is drawn from 35 numbered red balls. The jackpot is won by guessing correctly all the balls that are drawn (you don’t have to guess the order). How many different choices are there for the player?

MATH 526 LECTURE NOTES

7

Example done in class 3.3 (Binomial formula). Show that n   X n k n−k n (x + y) = x y , k k=0 and then use this to show that a set of size n has 2n subsets. Exercise 3.4 (Pascal’s triangle identity). Let 0 ≤ k ≤ n. Let A = {1, . . . , n} be a set of size n. (a) How many subsets of A with size k are there? (b) How many subsets of A with size k are there that contain the element 1? (c) How many subsets of A with size k are there that do not contain the element 1. (d) Show that       n n−1 n−1 = + . k k−1 k 4. Chapter 2.3, 2.4 Exercise 4.1. What is wrong with the following argument. Let A and  B be disjoint sets, both of size 5 The set A has 53 subsets   of size 3, 5 5 ditto for B. So, the set A ∪ B has size 10, and has 3 × 3 subsets of 2  size 6. Thus 53 = 10 . 6 4.1. Poker hands. A poker hand is a set of five cards from the deck of 52 standard playing cards. In a standard deck of playing card, (sometimes called a French deck), there are 13 ranks (A, 2, 3, . . . , 10, J, Q, K), and for each rank there are four suits: ♦-diamond, ♥-heart, ♣-club, and ♠-spade. Example done in class 4.2 (2 pair). A two-pair is of the form (aa)(bb)c, where a, b, c are cards of distinct rank. Thus we do not count a four of a kind as a two-pair! An example of a two pair would be, {4♦, 4♥, K♣, K♠, A♥} (1) What are the total number of poker hands? (2) What are the total number of two-pairs?  Solution. There are 52 poker hands. To count the number of two5 pairs, we first choose two ranks from the 13 ranks, then for each of the two ranks, we need to choose 2 suits from the 4 possible suits, then we still need to pick one card to be the non-paired card; to do this, we note that there are 11 remaining ranks to choose from, and for each rank

8

TERRY SOO

there are four possible suits. Thus we have that the total number of two pairs is given by:     13 4 4 (11)(4). 2 2 2 Exercise 4.3 (1 pair). How many one-pair poker hands are there? Two pairs, and three-of-a-kinds, etc, do not count as a one-pair. A one-pair is of the form (aa)(bcd), where a, b, c, d are of distinct rank. Exercise 4.4 (3-of-a-kind, a triple). How many 3-of-a-kind poker hands are there? A three-of-a-kind is of the form (aaa)(bc), where a, b, c are of distinct rank. Exercise 4.5 (Straights, including straight flushes, and royal flushes). Let us say that a straight is of the form abcde, where all the cards are of distinct rank, and can be arranged in increasing order. For the purposes of order, an ace can count as a 1 or a 14, and J = 11,Q = 12, and K = 13. (Sometimes by a straight, we mean to exclude the cases where the cards abcde all have the same suit, but we will allow these for this exercise.) 4.2. The axioms of probability. In order to do probability on a sample space S, we have to assign a number in [0, 1] for events or subsets of S. We require the following rules on a set function P; these are Kolmogorov’s axioms for probability theory. For all events A, we require P(A) ∈ [0, 1], and we also require P(∅) = 0, and P(S) = 1. Another reasonable requirement, on P, is that if two events A and B are such that A ∩ B = ∅ (disjoint), then P(A) + P(B) = P(A ∪ B). In fact, we will need a stronger requirement that if A1 , A2 , . . . is a sequence of mutually exclusive events (that is, if Ai ∩ Aj = ∅, if i 6= j), then P

∞ [ i=1



Ai =

∞ X

P(Ai ).

i=1

(This last axiom is referred to as countable additivity). Sometimes a sample space, together with its events, and a set function P is called a probability space. In many ways that P function behaves like an area function that gives the standard Euclidean area of subsets of a unit square. 4.3. The case of equal probabilities. One way to define such setfunction P on a finite sample space S is to assign equal probabilities to each element of the sample space, this then leads us to set P(E) = |E|/|S| for an event E, where |E| is the number of elements in E.

MATH 526 LECTURE NOTES

9

Example done in class 4.6. Find the probability that when two fair dice are rolled the sum will be seven? How about seven or eight? Solution. Let S = {1, 2, 3, 4, 5, 6}2 , and P assign equal probabilities to each of the outcome. Note that |S| = 62 = 36. The event A = {(1, 6), (6, 1), (3, 4), (4, 3), (2, 5), (5, 2)} is exactly when the sum of the two dice is 7. Since |A| = 6, we have that P(A) = 1/6. The event B = {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)} is exactly when the sum of the two dice is 8. Clearly A ∩ B = ∅, thus P(A ∪ B) = P(A) + P(B) = 6/36 + 5/36 = 11/36. Exercise 4.7. In well-mixed deck of standard playing cards, what is probability that I am dealt on 4-of-a-kind, when I am dealt 5 cards. A 4-of-a-kind is a hand of the form aaaab where a and b are of disjoint rank. See Section 4.1 for terminology. 4.4. Arbitrary probabilities. Given a sample space with a countable number of elements, that is, the elements of S can be put into ∞ a sequence S = {s1 , s2 , . . .}. We can also check P∞ that if (pi )i=1 is a sequence of non-negative numbers such that i=1 pi = 1, then for an event E = {si1 , si2 , . . .} if we set P(E) =

∞ X

pij ,

j=1

then Kolmogorov’s axiom’s are satisfied. Example done in class 4.8 (Poisson process). Set S = {0, 1, 2, 3, . . .}. −1 Check that if pi = e i! , then by assigning probabilities pi to i ∈ S, then Kolmogorov’s axiom’s are satisfied; in other words, check that pi are positive and have 1 as the sum. Exercise 4.9. Let p ∈ (0, 1). Consider the probability space for two coin flips given by: S = {HH, HT, T H, T T } with P given by P(HH) = p2 , P(HT ) = P(T H) = p(1 − p), P(T T ) = (1 − p)2 Check that the probabilities do indeed sum to 1 (for any p ∈ (0, 1)). Write down the event E that the first flip comes up heads. What is the probability for this event (in the case that p = 3/4)?

10

TERRY SOO

5. Chapter 2.5, 2.6 5.1. Some formulas. Given the axioms of probability and standard set identities we can deduce some useful formulas for computing probabilities. Let P be a probability on a sample space S. Example done in class 5.1. If A is an event, then P(A) + P(A0 ) = P(S) = 1. Example done in class 5.2. Suppose we flip a fair coin three times. What is the probability that there will be at least one head? Exercise 5.3. For any events A and B, we have A = (A ∩ B) ∪ (A ∩ B 0 ) and A ∪ B = A ∪ (B ∩ A0 ), where A is disjoint from B ∩ A0 . Example done in class 5.4 (Inclusion-exclusion). If A and B are events, then P(A ∪ B) = P(A) + P(B) − P(A ∩ B). Solution. We rewrite A∪B as a disjoint union so that A∪B = A∪(B∩ A0 ) and A is disjoint from B ∩ A0 . Thus P(A ∪ B) = P(A) + P(B ∩ A0 ). We also have that P(B ∩ A0 ) + P(B ∩ A) = P(B), so some algebra yields the required result. Exercise 5.5 (Monotonicity). For events A, B with A ⊆ B, show that P(A) ≤ P(B). Solution. Rewrite B = A ∪ (B ∩ A0 ), so that P(B) = P(A) + P(B ∩ A0 ). Example done in class 5.6. Assume that our Math 526 class contains 35 students. If 10 students are enrolled in an English course, and 15 students are enrolled in a Spanish course, and 20 students are enrolled a English or Spanish course, then what is probability that a randomly selected student will be enrolled in a English and Spanish course. (Here we mean that each student is selected with equal probability.) Exercise 5.7 (Inclusion-Exclusion, counting). Suppose that S is a finite set; that is, |S| = N , for some integer N > 0. Also assume that P({s}) = 1/N, for all s ∈ S; that is, each element of S has equal weight under P. Use Exercise 5.4 to recover the standard inclusion-exclusion formula: |A ∪ B| = |A| + |B| − |A ∩ B|.

MATH 526 LECTURE NOTES

11

Exercise 5.8 (Inclusion-Exclusion for three sets). Use Exercise 5.4 to prove the following inclusion-exclusion formula in the case of three sets, A, B, C ⊆ S: P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C). 5.2. Conditional probabilities. Suppose I roll a fair dice with values {1, 2, 3, 4, 5, 6}. If I told you that the result was even, then what is the probability that the result was a 4? In some sense, what has happened, is that with the new information, we should consider a reduced sample space, with possibly different probabilities. The new sample space should be {2, 4, 6}, with each outcome having equal probabilities; another way is to just use the same sample space {1, 2, 3, 4, 5, 6}, but assign zero probability to 1, 3, 5, and equal probabilities to 2, 4, 6. One way to formalize this is to used conditional probabilities. Let A, B be events, with P(B) > 0. Let P(A ∩ B) P(B|A) = . P(B) Exercise 5.9. Let P(B) > 0. Check that the set function Q(A) = P(A|B) satisfies Kolmogorov’s axioms. Example done in class 5.10. Consider a fair dice roll with values {1, 2, 3, 4, 5, 6}. Let B be the event that result is even. Let A be the event that the roll is a 2. Compute P(A|B). Solution. Note that P(B) = 1/2, and A ∩ B = {2}. Thus (1/6) P(A|B) = = 1/3. (1/2) Example done in class 5.11. Suppose we do the following two step experiment. I flip a fair coin with values {h, t}, and if it comes up heads, then I roll a fair dice, with values {1, 2, 3, 4, 5, 6}, otherwise, if it comes up tails, then I flip another fair coin with values {0, 1}. (1) What is a sample space for this experiment? (2) What is the probability that we have a head and a 1? (3) What is the probability that we have a tail and a 1 (4) What is the probability that we have a 1? Solution. Let H be the event that the first coin is head, let T be the event that it is tails, and let U be the event that the final result is 1. (1) We can take S = {(h, 1), (h, 2), (h, 3), (h, 4), (h, 5), (h, 6), (t, 0), (t, 1)} .

12

TERRY SOO

(2) We know given that if the first coin flip comes up heads, then the probability is 1/6; in other words, P(U |H) = 1/6. Thus P(U ∩ H) = P(U |H)P(H) = (1/6)(1/2) = 1/12. (3) We know that if first coin flip come up tails, then the probability is 1/2; in other words, P(U |T ) = 1/2. Thus P(U ∩ T ) = P(U |T )P(T ) = (1/2)(1/2) = 1/4. (4) Note that P(U ) = P(U ∩ H) + P(U ∩ T ), since H 0 = T . So, P(U ) = 1/12 + 3/12 = 4/12 = 1/3. 6. Lecture 6, Chapter 2.6, 2.7 Exercise 6.1. Suppose you are dealt two cards from a standard (wellmixed) 52 card French deck. By counting, compute the probability that you will have at least one ace. Try this problem again using conditional probabilities. 6.1. Stochastic independence. Two events A and B are independent if P(A ∩ B) = P(A)P(B). In the case that P(B) > 0, is easy to check that this is equivalent to the condition that P(A|B) = P(A). In other words, knowing B, does not effect the probability of A. Example done in class 6.2. Check that the two definitions are equivalent. Solution. We checked in class that if A and B are independent, then P(A|B) = P(A) (for B with P(B) > 0); it remains to verify the other direction. Exercise 6.3. Find the probability that when two independent dice sixsided are rolled you get two sixes. Find the probability that you will get doubles. Exercise 6.4. Suppose two friends Britney and Christina go shopping. Let bi be the probability that Britney buys i things, and let ci be the probability that Christina buys i things. Suppose that bi and ci are given by: b0 = 1/10, b1 , = 2/10, b2 = 3/10, b3 = 2/10, b4 = 2/10, and c0 = 3/10, c1 = 3/10, c2 = 2/10, c3 = 2/10. Find the probability that Britney and Christina buy the same number of things, if their shopping habits are independent of one each other. Solution. Let Bi and Ci be the events that Britney and Christina buy i things, respectively. The event that they both buy i things is Bi ∩ Ci , which has probability P(Bi ∩ Ci ) = P(Bi )P(Ci ) since the events Bi and

MATH 526 LECTURE NOTES

13

Ci are independent. Let E be the event that they buy the same number of things. Then E is given by the disjoint union: E = (B0 ∩ C0 ) ∪ (B1 ∩ C1 ) ∪ (B2 ∩ C2 ) ∪ (B3 ∩ C3 ). So P(E) = b0 c0 + b1 c1 + b2 c2 + b3 c3 . Exercise 6.5. Check that if A and B are independent, then A and B 0 are also independent. Let A = {A1 , A2 , . . .} be a collection of events. We say that the events are pairwise independent if P(Ai ∩ Aj ) = P(Ai )P(Aj ) for all i 6= j, and we say that they are independent or mutually independent if for every finite subset of events Ai1 , Ai2 , . . . Ain (where the ij are distinct), we have that n n \  Y P Aij = P(Aij ). j=1

j=1

Exercise 6.6. Give an example of a sample space S with probabilities P , where A1 , A2 , A3 are events such that they are pairwise independent, but not mutually independent. Solution. Let S = {0, 1}2 and let P assign equal probabilities to each element of S. Let A1 := {(0, 0), (0, 1)}, A2 := {(0, 0), (1, 0)}, and A3 = {(0, 1), (1, 0)}. In words, A1 is the event that the first flip is tails, A2 is the event that the second flip is tails, A3 is the event that exactly one of the flips came out tails. Clearly, A1 and A2 are independent. It is easy to check that A3 and A1 are independent and A3 and A2 are independent. However, A1 ∩ A2 ∩ A3 = ∅, thus 0 = P(A1 ∩ A2 ∩ A3 ) 6= P(A1 )P(A2 )P(A3 ) = (1/2)3 . Example done in class 6.7. Consider the experiment where I flip a coin, that is known to come up heads with probability 2/3, 7 times. Assume that the coin flips are independent. (a) How many ways are there of getting exactly 3 heads in 7 tosses? (This is just a counting question, there is no probability involved.) (b) What is the probability that I will get HHHT T T T ; that is, 3 consecutive heads, then 4 consecutive tails? (c) What is the probability that I will get T T T T HHH? (d) What is the probability that I will get exactly three heads? (e) To generalize, suppose instead that the coin is known to come up heads with probability p ∈ (0, 1), and I flip it n times. Find an expression for the probability that I will get exactly k heads (for 0 ≤ k ≤ n).

14

TERRY SOO

7. Chapter 2.6, 2.7 We will elaborate on the techniques used in Exercise 5.11 7.1. Total probability. Suppose B1 , B2 , B3 give a partition of a sample space S, so that B1 , B2 , B3 are mutually exclusive, and their union is all of S. Given any event A, clearly it is given by the disjoint union, A = (A ∩ B1 ) ∪ (A ∩ B2 ) ∪ (A ∩ B3 ), thus P(A) = P(A ∩ B1 ) + P(A ∩ B2 ) + P(A ∩ B3 ). We also know from the definition of conditional probabilities, that if each of the Bi ’s have non-zero probabilities, then P(A ∩ Bi ) = P(A|Bi )P(Bi ), for each i = 1, 2, 3. Thus we obtain that, P(A) = P(A|B1 )P(B1 ) + P(A|B2 )P(B2 ) + P(A|B3 )P(B3 ). Sometimes this is referred to as the rule of total probability. Sometimes we also want to compute P(Bi |A), and a bit algebra gives the following formula, in the case i = 3: P(B3 ∩ A) P(B3 |A) = P(A) P(A|B3 )P(B3 ) = . P(A|B1 )P(B1 ) + P(A|B2 )P(B2 ) + P(A|B3 )P(B3 ) This is referred to as Bayes’ theorem. Exercise 7.1 (Two Face). The DC comic book villain Two-Face often uses a coin to decide the fate of his victims. If the result of the flip is tails, then the victim is spared, otherwise the victim is killed. It turns out he actually randomly selects from three coins: a fair one, one that comes up tails 1/3 of the time, and another that comes up tails 1/10 of the time. What is the probability that a victim is spared? Solution. Let Sp denote that event that the victim is spared and let C1 be the event that the fair coin is used, C2 be the event that the coin that comes up tails 1/3 of time is used, and C3 denote the event that the coin that comes up tails 1/10 of the time is used. Then P(Sp) = P(Sp|C1 )P (C1 ) + P(Sp|C2 )P(C2 ) + P(Sp|C3 )P(C3 ) = 1/2(1/3) + 1/3(1/3) + 1/10(1/3). Exercise 7.2. In Exercise 7.1, what is the probability that two-face used the fair-coin, given that the victim was spared.

MATH 526 LECTURE NOTES

15

Solution. P(C1 ∩ Sp) P(Sp) P(Sp|C1 )P(C1 ) = P(Sp) 1/2(1/3) = 1/2(1/3) + 1/3(1/3) + 1/10(1/3)

P(C1 |Sp) =

Exercise 7.3 (Monty Hall meets Bayes). In a game show there are three doors, behind each of which is a prize; behind one door is a car, and the other two, goats. The car is equally likely to behind each door. Your goal is to win the car. After you choose a door, the host of the game show who knows where the car is, chooses as at random (as randomly as possible) one of the remaining doors (one which you didn’t pick) which contain a goat, and reveals this information to you. You are then offered the choice of staying with your current choice or switching your choice to the other remaining door. What should you do if you want to maximize your chances of winning? Assume for simplicity that you originally picked the first door, and the host revealed that behind the third door is a goat, so that your choice is between staying with the first door or switching to the second door. Exercise 7.4 (Medical testing). Consider a screening test for a certain disease that has the following reliability. If the patient has the disease, the test will be positive, inconclusive, or negative with probabilities 0.8, 0.05, 0.15, respectively. If the patient does not have the disease, the test will be positive, inconclusive, or negative with probabilities 0.03, 0.1, 0.87, respectively. Suppose that the disease is present in 14 percent of the population of adults who are referred by their doctors to be screened with this test. (1) An adult is referred by their doctor for testing. What is the probability that the person will be correctly diagnosed? (2) If an adult has been correctly diagnosed, what is the probability the person does not have the disease?

8. Exam 1, Answers Q1: 173cm, 13cm2 , 3.6056cm; Q3: 0.9; Q4: 1 in 5, 153, 632.64706; Q5: 1/12, 2/3; Q6: 0.6; Q7: 0.2344; Q8: (See Exercise 2.93 in the textbook); Q9: 5/11= 0.45454545; Q10: Use the Binomial formula.

16

TERRY SOO

9. Chapter 3.1 9.1. Random variables. A (real-valued) random variable is a function which assigns an element of R to each member of a sample space S. Typically, random variables are denoted by capital Roman letters such as X, Y and Z. In probability, we write {X = n} = {s ∈ S : X(s) = n} and {X ≤ x} = {s ∈ S : X(s) ≤ x} . The simplest random variables are constant (real-valued) random variables; that is a random variable X such that for some b ∈ R, we have P(X = b) = 1. 9.2. Bernoulli random variables. A random variable that only takes the values 0 and 1 is called a Bernoulli random variable. Let p ∈ (0, 1). A Bernoulli random variable X, with P(X = 1) = p is called a Bernoulli random variable with parameter p; if X is such a random variable, we often write X ∼ Bern(p). Example done in class 9.1. Given the sample space S = {1, 2, 3, 4, 5, 6} where P assigns equal probabilities to each element of S, define a Bernoulli random variable on S. Define a Bernoulli random variable X on S such that probability that of the event {X = 1} is 1/3. Solution. We can take Y (s) = 1 for all s ∈ S even, and Y (s) = 0 otherwise. We can take X(s) = 1 is s = 1, 2, and X(s) = 0 otherwise. Example done in class 9.2. Consider the simple experiment where we keep flipping a fair coin until we see a head. Let X be the number of flips. Define a sample space for X. What is the probability that X = 4? Compute P(X ∈ {2, 5}). Solution. We can let S = {H, T H, T T H, T T T H, T T T T H, . . .}, and assign probabilities 1/2, 1/4, 1/8, . . . to each one of the elements, respectively. Note these probabilities are a geometric series with unit sum. We have P(X = n) = 2−n . Thus P(X = 4) = 1/16, and P(X ∈ {2, 5}) = 1/4 + 1/32. Example done in class 9.3. Suppose that I will bet 5 dollars on a toss of a coin, whereby I will win 5 dollars if it comes up heads, and lose 5 dollars otherwise. Suppose that the coin tossed two times and that I will make a bet each time. Write down a sample space for the coin tosses, and a random variable W defined on the sample space that represents my possible winnings and losses.

MATH 526 LECTURE NOTES

17

Solution. Let S = {T T, T H, HT, HH}. Then W (T T ) = −10, W (T H) = 0 = W (HT ), W (HH) = 10. Example done in class 9.4 (Indicators). Let A be an event of a sample space S. Define a random variable on S via ( 1 if s ∈ A, 1A (s) = 0 otherwise. Let A and B be independent events such that P(A) = P(B) = 2/3. Let X = 1A + 1B . Find the P(X = x) for all x ∈ R. Solution. Notice that {X = 2} = A ∩ B, {X = 1} = (A ∩ B 0 ) ∪ (A0 ∩ B), and {X = 0} = A0 ∩ B 0 , and the events A and B are independent. Thus we have that P(X = 2) = (2/3)(2/3), P(X = 1) = (2)(2/3)(1/3), P(X = 0) = (1/3)(1/3); for all other x ∈ R, we have P(X = x) = 0. Example done in class 9.5. Suppose a system consists of 3 components and will work if at least two of the components are functioning. Let Ci be the event that component i is functioning. Let W be the event that the system is functioning. (a) For any events A and B, check that 1A∩B = 1A 1B . (b) Also check that if A and B are disjoint events, then 1A∪B = 1A +1B . (c) Express W in terms of the events Ci ; that is, write down an expression that looks like W = (C1 ∩ C2 ∩ C3 ) ∪ . . . (d) Express 1W in terms of the indicators 1Ci ; that is, write down an expression that looks like 1W = 1C1 1C2 1C3 + · · · . (e) Suppose that the events Ci are independent, and that P(C1 ) = 0.2, P(C2 ) = 0.3 and P(C3 ) = 0.6. Compute P(W ). Solution. (a) Notice that both expressions only take values in {0, 1}, and note that 1A 1B = 1 if and only if 1A = 1 and 1B = 1. And 1A = 1 and 1B = 1 if and only if 1A∩B = 1. (b) Since A and B are disjoint, we have that both expressions only take values in {0, 1}. Note that 1A∪B = 0 if and only 1A = 0 and 1B = 0. And 1A = 0 and 1B = 0 if and only if 1A + 1B = 0. (c) W = (C1 ∩C2 ∩C3 ) ∪ (C1 ∩C2 ∩C30 ) ∪ (C1 ∩C20 ∩C3 ) ∪ (C10 ∩C2 ∩C3 ). (d) 1W = 1C1 1C2 1C3 + 1C1 1C2 1C30 + 1C1 1C20 1C3 + 1C10 1C2 1C3 .

18

TERRY SOO

(e) Since W is expressed as a disjoint union, and the events Ci are independent, we have P(W ) = 0.2(0.3)(0.6) + 0.2(0.3)(0.4) + 0.2(0.7)(0.6) + 0.8(0.3)(0.6). Exercise 9.6 (Inclusion-Exclusion again). Let A and B be events for sample space S. Show that 1A∪B = 1A + 1B − 1A∩B . 9.3. Law of large numbers. Let p ∈ (0, 1). Let (Ai )∞ i=1 is a sequence of mutually independent events all with P(Ai ) = P(A1 ) = p. For a concrete example, think of an experiment where we flip a coin an infinite number of times, and let Ai be the event that the i-flip comes up heads. The law of large numbers gives the connection between the mathematical foundations of probability and our intuitive understanding of probability as a limiting relative frequency: it states that on an event of probability one, we have that n

1X lim 1Ai = P(A1 ) = p; n→∞ n i=1 more precisely, there is an event S 0 such that P(S 0 ) = 1, and for all s ∈ S 0 , we have n

1X 1Ai (s) = P(A1 ) = p. n→∞ n i=1 lim

9.4. Types of random variables. Let X be a random variable. If there a set C = {v1 , v2 , . . .} such that P(X ∈ C) = 1, then X is a discrete random variable; in other words, X only takes values in a countable set. We will discuss continuous random variables in more detail later, which for example might take values in all of in the interval [0, 1]. 9.5. Uniform random variables. Suppose that U is a random variable values in [0, 1] as randomly as possible. Such a random variable should have the property that P(a ≤ U ≤ b) = b − a, for all 0 ≤ a < b ≤ 1; that is the probability that U lies in any interval is exactly the length of the interval. The random variable U is called a uniform random variable, and it can be defined on the sample space [0, 1], by taking U (s) = s for all s ∈ [0, 1], and constructing P (this is actually no easy task) on [0, 1] so that P((a, b)) = b − a for all 0 ≤ a ≤ b ≤ 1.

MATH 526 LECTURE NOTES

19

Exercise 9.7. Let U be a uniform random variable on [0, 1]. Show that P(U = x) = 0 for all x ∈ [0, 1]. What is wrong with the following calculation: X 1 = P(U ∈ [0, 1]) = P(U = x) = 0. x∈[0,1]

Hint: [0, 1] can not be arranged in a sequence. Solution. Let ε > 0. Observe that 0 ≤ P(U = e) ≤ P(x − ε ≤ U ≤ x + ε) ≤ 2ε. Since this inequality holds for all ε > 0, we must have that P(U = x) = 0. The calculation is flawed because although the sets {U = x} are clearly disjoint, the set of all x ∈ [0, 1] can not be put into a sequence, thus we can not apply Kolmogorov’s (countable additivity) axiom. 10. Chapter 3.2 10.1. Probability distributions. Let X be a discrete random variable. The probability mass function (pmf) of X is defined via p(x) = P(X = x) for all x (for which X takes values). (In the case that X is realvalued, it suffices Pto consider all x ∈ R). Kolmogorov’s axioms give that p(x) ≥ 0, and x p(x) = 1, and similarly, any function p with these two properties is a probability mass function for a discrete random variable. The cumulative distribution function (cdf) of a real-valued random variable X is defined to be F (x) = P(X ≤ x), for all x ∈ R. In the case that X is also a discrete random variable we have that X F (x) = p(y). y:y≤x

We say that either p or F give the distribution or law of a discrete real-valued random variable. Often, what we care about in practice is not the actual sample space that a random variable may be defined on, but its distribution. Exercise 10.1. Let p be a pmf given by p(0) = 1/3, p(1) = 1/9, and p(2) = x (and p(y) = 0 for all other y). What is x? Exercise 10.2. Find the pmf and cdf of a Bernoulli random variable with parameter 1/3.

20

TERRY SOO

Solution. Clearly, p(0) = 2/3, p(1) = 1/3, and p(x) = 0 for all other x ∈ R. As for F , we have F (x) = 0 for all x < 0, F (x) = 2/3 for all x ∈ [0, 1), and F (x) = 1 for all x ∈ [1, ∞). Exercise 10.3 (Sum of independent Bernoulli random variable). Let p ∈ (0, 1). Suppose (Ai )ni=1 are mutually independent events with P(Ai ) = p. Let n X X= 1Ai . i=1

Find the pmf for X. Solution. Let 0 ≤ k ≤ n. One way that X = k is for A1 , . . . , Ak to occur and Ak+1 , . . . , An not to occur. The mutual independence of the events (Ai ) gives that P(A1 ∩ A2 ∩ · · · ∩ Ak ∩ A0k+1 ∩ · · · ∩ A0n ) = pk (1 − p)n−k ,  and so does every other way; there are exactly nk ways for k of the events (Ai )ni=1 to occur. Thus   n k P(X = k) = p (1 − p)n−k . k The random variable X in Exercise 10.3 is called a binomial random variable with parameters (n, p), and we write X ∼ Bin(n, p). It can be used to model the number of ‘success’ (and failures) in n independent trails that each have a success probability of p. Exercise 10.4. Suppose I have a standard French deck. Consider the following experiment. I deal myself a poker hand of 5 cards, check the hand, then put the cards back in deck, mix deck, and then repeat for a grand total of 6 times. What is the probability that I get exactly three, one pair poker hands? Exercise 10.5. Show that log : (0, ∞) → R is a increasing function; that is, if x < y, then log(x) < log(y). Let f : R → (0, ∞). Show that if for some xM ∈ R we have that log(f (xM )) ≥ log(f (x)) for all x ∈ R, then f (xM ) ≥ f (x) for all x ∈ R. Also check that if xc is a critical point of f (x), then xc is also a critical point of log(f (x)).  Solution. Let x < y. Clearly, log(y) − log(x) = log y/x . Since x < y, we know that y/x > 1. Also log(z) > 0, for all z > 1. Thus log(y) − log(x) > 0. In particular, if f (xM ) < f (z) for some z ∈ R, then log(f (xM )) < f (z).

MATH 526 LECTURE NOTES

21

If g(x) = log(f (x)), note that g 0 (x) =

f 0 (x) . f (x)

Thus if xc is a critical point for f (that is, f 0 (xc ) = 0 or f 0 is not differentiable at xc ), then the same must hold for g. Exercise 10.6 (Maximum likelihood estimate). Suppose I have a coin, and I want to figure out the probability p that it lands heads. I flip it n = 10 times, and find that I get k = 3 heads (those two numbers I do know). We don’t know p, but we do know that we can model our experiment as a random variable X which counts the number of heads, where X ∼ Bin(n, p). Consider the function f : [0, 1] → [0, 1], given by   n k f (p) = p (1 − p)n−k = P(X = k). k Find the value for which f is maximized. (You may assume that n = 10 and k = 3.) Hint: by using Exercise 10.5, it might easier to maximize g(x) = log(f (x)) rather than f (x) directly. (H) Solution. By Exercise 10.5, we consider   n g(p) = log(f (p)) = log + k log(p) + (n − k) log(1 − p). k We have that g 0 (p) =

k n−k − , p 1−p

and setting g 0 (p) = 0 and solving for p, we obtain that p = k/n. Furthermore, by the 1st or 2nd derivative test, it is easy to argue that g and hence f obtains its maximum value at p = k/n. Exercise 10.7. Suppose F is the cdf for the integer valued random variable X, and we know that F (0) = 0, F(1) = 1/3, F (2) = 2/3, F (3) = 5/6, F (4) = 11/12, and F (5) = 1. Compute the pmf for X. Exercise 10.8. Let X be a discrete integer valued (P(X ∈ Z) = 1) random variable. Let F be the cdf for X. Express the following in terms of F : P(X > a), P(X = a), P(a ≤ X ≤ b), P(a < X < b), P(a < X ≤ b), P(a ≤ X < b). Assume that a, b ∈ Z.

22

TERRY SOO

11. Chapter 3.3 (I) 11.1. Continuous random variables. We will define continuous random variables via a continuous analogue of the pmf. A function f : R → [0, ∞) is a probability density function (pdf) if Z ∞ f (x)dx = 1. −∞

Example done in class 11.1. Let c > 0, and f : R → [0, ∞) be defined via ( c/x2 if x ∈ [1, ∞), f (x) = 0 otherwise. Find c so that f is a pdf. We say that X is a continuous random variable if there is a pdf f so that the cdf of X is given via Z x F (x) = P(X ≤ x) = f (u)du. −∞

Note that as a function of x, the cdf F is continuous. In fact it is differentiable if f is continuous, and by the fundamental theorem of calculus, F 0 (x) = f (x). (If F is differentiable at x, then we take f (x) = F 0 (x).) The standard normal distribution is given by the pdf 1 2 1 f (x) = √ e− 2 x . 2π We say that Z is a standard normal random variable if it has pdf f . This is one of the most important densities in probability and statistics. See Figures 1 and 3 for illustrations of the pdf and the cdf. We will verify later that this is indeed a pdf. It turns out that there is not nice formula for the anti-derivative of f , so we can only numerically compute values of F (x). We will discuss in more detail how to handle this important distribution later. In many ways a integrals and pdfs will function like sums and pmfs. However it is possible that for a pdf to have a value greater than 1. We also have the approximation Z x+ε (2ε)f (x) ≈ f (u)du,

x−ε

and for many purposes it may be helpful to pretend that f (x)dx = P(X = x), but for us this is not a good mathematical statement.

MATH 526 LECTURE NOTES

23

0.2 0.0

0.1

f(x)

0.3

0.4

Standard normal density

-4

-2

0

2

4

x

Figure 1. A graph of the pdf for a standard normal random variable

Example done in class 11.2. Show that if X is continuous random variable with pdf f , then P(X = b) = 0 for all b ∈ R. Solution. Let e > 0. Observe that {X = b} ⊆ {b − ε < X ≤ b}. Hence 0 ≤ P(X = b) ≤ F (b) − F (b − e). Since F is continuous, taking a limit as ε → 0, gives the required result.

12. Chapter 3.3 (II) In contrast to discrete random variables, for a continuous random random variable X with a pdf f and cdf F , Exercise 11.2 gives that

24

TERRY SOO

0.0

0.2

0.4

f(x)

0.6

0.8

1.0

Standard normal cdf

-4

-2

0

2

4

x

Figure 2. A graph of the cdf for a standard normal random variable. for any a < b, we have P(a ≤ X ≤ b) = P(a < X < b) = P(a ≤ X < b) = P(a < X ≤ b) Z b Z = f (x)dx − −∞

a

f (x)dx

−∞

= F (b) − F (a) Z b = f (x)dx. a

Example done in class 12.1 (Uniform random variables). Find the pdf for a random variable U such that P(U ∈ [0, 1]) = 1, and for any interval [a, b] ⊆ [0, 1], we have P(U ∈ [a, b]) = b − a. We say that U is uniformly distributed in [0, 1].

MATH 526 LECTURE NOTES

25

0.2 0.0

0.1

f(x)

0.3

0.4

Standard normal density

-4

-2

0

2

4

x

Figure 3. An illustration of P(−2 ≤ Z ≤ 0) for a standard normal random variable Z. Solution. Let ( 1 f (x) = 0

if x ∈ [0, 1], otherwise.

Example done in class 12.2 (Exponential random variables). Let λ > 0. Let X be a continuous random variable with the property that all x ≥ 0, we have P(X > x) = e−λx . Note that X is positive. We call X an exponential random variable with parameter 1/λ. (We will see why it is a 1/λ, instead of just λ later.) Find the pdf for X (assuming the pdf is piecewise continuous).

26

TERRY SOO

Rx Solution. We know that F (x) = −∞ f (t)dt. Taking a derivative yields, F 0 (x) = f (x). We know that F (x) = 0 for all x ≤ 0, and F (x) = 1 − e−λx , otherwise. Thus for all x ≥ 0 ( λe−λx if x ∈ [0, ∞), f (x) = 0 otherwise. Exercise 12.3 (Cauchy distribution). For all x ∈ R, set f (x) =

1 . π(1 + x2 )

Check that f is a pdf. Solution. Clearly f (x) ≥ 0. In order to do the integral, make a change of variables x = tan u, so that dx = (sec2 u)du, from which we obtain using the trig. identity 1 + tan2 u = sec2 u that Z ∞ Z 1 π/2 f (x)dx = 1du = 1. π −π/2 −∞ Exercise 12.4. Let c > 0. Consider the pdf given by ( cx4 if x ∈ [0, 2), f (x) = 0 otherwise. What is c? Find the P(1 ≤ X < ∞)? 12.1. Convolutions. We know that we can add random variables to create other random variables, but what operations can we perform on pdfs and pmfs to obtain new ones? Suppose f and g are pdfs. Define Z ∞ h(z) = (f ? g)(z) = f (x)g(z − x)dx. −∞

We can check that h is also a pdf. Clearly, h ≥ 0, since both f, g ≥ 0. Also by interchanging the order of integration, Fubini’s theorem, we have Z ∞ Z ∞Z ∞ h(x)dx = f (x)g(z − x)dxdz −∞ −∞ −∞ Z ∞  Z ∞ = f (x) g(z − x)dz dx −∞ −∞ Z ∞ = f (x) · 1dx = 1 −∞

MATH 526 LECTURE NOTES

27

We can define the convolutions of pmfs f, g : Z → R similarly via X (f ? g)(n) = f (i)g(n − i). i∈Z

Exercise 12.5 (Convolutions of pmfs). Check that f ? g is also a pmf. 12.2. Constructing (continuous) random variables. Note that in defining continuous random variables, we never actually constructed them! All we did was specify a distribution via a pdf or cdf. Recall that a random variable is defined on sample space. We never defined the actually sample spaces for which the random variables live, nor did we construct the probability P that agrees with the distribution. It turns out that this is actually beyond the scope of this course. However, if you believe in the existence of uniform random variables, then that is all that is need to construct other random variables. In order to construct uniform random variables, say a random variable U such that P(U ≤ x) = x for all x ∈ [0, 1], we need to be able define a set function ` on all intervals of (a, b) ⊂ [0, 1], such that `(a, b) = b − a. This is not a problem, except, we also need to extend this function in a reasonable way (satisfying Kolmogorov’s axioms) to include all subsets of [0, 1]. It turns out this is actually in some sense impossible; the best you can hope for is to include enough subsets of [0, 1] so that you can actually have a large set of events in which to do probability. Exercise 12.6. Assume that U is a uniform random variable such that P(U ≤ x) = x for all x ∈ [0, 1]. (1) Let p ∈ (0, 1). Find a function φ so that φ(U ) is a Bernoulli random variable with P(φ(U ) = 1) = p. (2) Assume that F is a cdf for a continuous random variable X, and that F is an increasing function so that if x < y, then F (x) < F (y), and the inverse F −1 is well-defined. Check that the random variable defined via F −1 (U ) has the same cdf as X. Exercise 12.7 (Hit/miss von Neumann). Let (Vi )i∈N be independent and uniformly distributed in the unit square centered at the origin in R2 , so that P(Vi ∈ A) = area(A) for subsets A of the unit square. Let D be the disk that is inscribed in the square. Let N denote the first time i such that Vi ∈ D. Show that VN is uniformly distributed on C, so that area(A) P(VN ∈ A) = , area(D) for subsets A of D.

28

TERRY SOO

Exercise 12.8 (Fair flips from a biased coin). Suppose you do not know if a coin is fair or not. How can you and your friend decided in random and fair way will who pay for dinner? 13. Chapter 3.4 (i) 13.1. Joint distributions. Suppose I have a deck of cards and deal myself two cards which we can represent as a discrete random variable Z = (X1 , X2 ), so that X1 is the first card dealt, and X2 is the second card dealt. If X1 is the 2 of spades, then X2 can not be the 2 of spades. Sometimes if helps to think of Z as two random variables with a joint distribution rather than as a single random variable taking values as ordered pairs. We say that f = fX,Y is a joint probability mass function for a pair of random variables (X, Y ), if f (x, y) ≥ 0 for all (x, y), P P x y f (x, y) = 1, and P({X = x} ∩ {Y = y}) = P(X = x, Y = y) = f (x, y). Let fX and fY P be the pmf of X and Y , respectively. P Notice that fX (x) = P(X = x) = y f (x, y) and fY (y) = P(Y = y) = x f (x, y); in the contexts of joint distributions, there are called the marginal distributions of X and Y (alone), respectively. The conditional distribution of X given Y = y (provided that P(Y = y) > 0) is given by f (x|y) = P(X = x|Y = y) =

P(X = x, Y = y) fX,Y (x, y) = . P(Y = y) fY (y)

A similar expression holds for conditional distribution of Y given X = x. Example done in class 13.1. Let c ≥ 0. Let f be the joint probability mass function of two discrete random variables X and Y be given below from the following table: f (x, y) y = 0 y = 1 y = 2 x=0 0.12 0.08 0.2 x=1 0.2 0.1 0.1 x=2 0.1 c 0.05 (a) (b) (c) (d)

Find c. Compute P(X = 0). Compute P(Y = 2). Compute P(X = 0|Y = 1).

(Thus f (1, 2) = P(X = 1, Y = 2) = 0.2 and f (2, 1) = c)

MATH 526 LECTURE NOTES

29

Exercise 13.2. Consider the joint pmf of random variables X and Y given by f (0, 0) = 1/12, f (0, 1) = 1/4, f (1, 0) = 5/12, f (1, 1) = 3/12. Find the marginal distributions of X and Y . Find the conditional distribution of X given Y = 1. Exercise 13.3. Consider the experiment where we roll a fair dice, record its value as the random variable X, and then if X = x, we flip a coin biased coin x times, and record the number of heads as the random variable Y . Assume that the probability that coin comes up head is 2/3. Find the joint pmf of X and Y . Solution. We know that for 1 ≤ n ≤ 6 and for 0 ≤ k ≤ n, we have that by Exercise 10.3 that      n 2 k 1 n−k P(Y = k|X = n) = . k 3 3 We also know that P(X = n) = 1/6 for all n = 1, 2, 3, 4, 5, 6. Hence P(X = n, Y = k) = P(Y = k|X = n)P(X = n)      n 2 k 1 n−k (1/6), = 3 k 3 for 0 ≤ k ≤ n, and n = 1, 2, 3, 4, 5, 6. 13.2. Independence of Random variables. Given two random variables X and Y (discrete or otherwise), we say that they are independent if for all x, y ∈ R, we have P({X ≤ x} ∩ {Y ≤ y}) ≤ P(X ≤ x)P(Y ≤ y). Similarly, given a sequence of random variables (Xi ), we say that they are independent if any finite number of the coordinates i1 , . . . , in (where the ij are distinct) we have  P {Xi1 ≤ x1 } ∩ · · · ∩ {Xin ≤ xn } = P(Xi1 ≤ x1 ) · · · P(Xin ≤ xn ) for all x1 , . . . , xn ∈ R. It is possible to check that this condition is equivalent to the seemingly stronger condition that  P {Xi1 ∈ Ai } ∩ · · · ∩ {Xin ∈ An } = P(Xi1 ∈ Ai ) · · · P(Xin ∈ An ) for all (reasonable) subsets A1 , . . . , An ⊆ R. Using this condition, it is possible to check that functions of independent random variables are also independent; that is, if Xi are independent random variables, and gi are (deterministic) functions, then gi (Xi ) are also independent random variables.

30

TERRY SOO

Exercise 13.4. For integer-valued discrete random variables, show that X and Y are independent if and only if for all integers n, m ∈ Z we have P({X = n} ∩ {Y = m}) = P(X = n)P(Y = m), so that if fX,Y is joint pmf of X and Y , then X and Y are independent if and only if fX,Y (n, m) = fX (n)fY (m). Example done in class 13.5. Let X and Y be a random variables which take values in {1, 2, 3, 4, 5, 6} with equal probability; that is, P(X = i) = 1/6 = P(Y = i) for all i = 1, 2, . . . , 6. Also assume that X and Y are independent. Compute P(XY = 12). Example done in class 13.6. Let λ, µ > 0. Suppose that X and Y are independent random variables such that P(X > x) = e−λx for all x ≥ 0, and P(Y > y) = e−µy for all y ≥ 0. If Z = min {X, Y }, then find the cdf for Z. Solution. Note that {Z > z} = {X > z} ∩ {Y > z}. Since X and Y are independent, we have that P(Z > z) = P(X > z)P(Y > z) = e−(λ+µ)z for all z ≥ 0. Also note that P(Z > 0) = 1. Thus ( 1 − e−(λ+µ)z if z ≥ 0, F (z) = 0 otherwise. Exercise 13.7 (Sum of independent random variables). Let X and Y be independent discrete integer-valued random variables with pmfs given by f and g, respectively. Show that if Z = X + Y , then the pmf of Z is given by the convolution f ? g. (See Exercise 12.5.) Solution. Let Ai = {X = i} ∩ {Y = n − i} for each i ∈ Z. Observe that the (Ai )i∈Z are mutually disjoint, and [ {Z = n} = Ai . i∈Z

Since X and Y are independent, we have X P(Z = n) = P(X = i)P(Y = n − i) = (f ? g)(n). i∈Z

MATH 526 LECTURE NOTES

31

13.3. Poisson random variables. Let λ > 0 and consider the pdf given by ( −λ n e λ if n =0,1,2,3, . . . , n! pλ (n) = 0 otherwise. If X is random variable with pdf pλ , then we say that X is a Poisson random variables with mean or parameter λ, and we write X ∼ P oi(λ). Exercise 13.8. Check that pλ is indeed a pdf. Solution. Since ex =

∞ X xn i=0

n!

,

we have that ∞ X e−λ λn n=0

n!

−λ

=e

∞ X λn n=0

n!

= e−λ eλ = 1.

Exercise 13.9. Suppose X and Y are independent Poisson random variables with parameter λ and µ. Let Z = X + Y . (1) If λ = 2 and µ = 1, find P(Z = 1). (2) Find the pmf of Z. (Hint, it is the pmf of a Poisson random variable. Use Exercise 13.7. The Binomial formula may also come in handy.) Solution. (1) Observe that {Z = 1} is given by the disjoint union of   {X = 1} ∩ {Y = 0} ∪ {X = 0} ∩ {Y = 1} . Since X and Y are independent we have that P(Z = 1) = pλ (1)pµ (0) + pλ (0)pµ (1)  = e−(λ+µ) λ(1) + 1(µ) = e−(λ+µ) (λ + µ) = pλ+µ (1) = p3 (1) = 3e−3 (2) By Exercise 13.7, we know that P(Z = n) = (pλ ? pµ )(n) X = pλ (i)pµ (n − i). i∈Z

32

TERRY SOO

We know that P(Z = n) = 0 if n < 0, since X and Y are nonnegative. Let n ≥ 0. Note that pλ (i) = 0 for all i < 0, and pµ (n − i) = 0 for all i > n. Thus X

pλ (i)pµ (n − i) =

i∈Z

=

n X

pλ (i)pµ (n − i)

i=0 n  −λ i  −µ n−i X

e

i=0

λ i!

e µ (n − i)!

Now recall that   n n! = . i i!(n − i)! So n   e−(λ+µ) X n i n−i P(Z = n) = λµ . n! i=0 i

Now the binomial formula, Exercise 3.3, with x = λ and y = µ gives that (λ + µ)n . n! So we obtain that Z is a Poisson random variable with parameter λ + µ. P(Z = n) = (e−(λ+µ) )

14. Chapter 3.4 (ii) 14.1. Joint distributions for continuous random variables. We say that f : R2 → [0, ∞) is the joint probability density function if Z ∞Z ∞ f (u, v)dudv = 1. −∞

−∞

We say that F is the joint distribution for continuous random variables X and Y if for all x, y ∈ R, we have Z x Z y F (x, y) = f (u, v)dudv = P(X ≤ x, Y ≤ y) −∞

−∞

for all x, y ∈ R. This is equivalent to saying that Z Z P((X, Y ) ∈ A) = f (u, v)dudv A 2

for all (nice) regions of R . (By nice, I mean anything you can write down or define explicitly, and anything we will use in this course.)

MATH 526 LECTURE NOTES

33

If F is also sufficiently smooth at a point (x, y), then we can recover f (x, y) via: ∂2 F (x, y) = f (x, y). ∂x∂y From this we can deduce that if X and Y are continuous random variables, and f is the joint pdf for X and Y , then X and Y are independent if and only if f (x, y) = fX (x)fY (y) for all x, y ∈ R, where fX and fY are the pdf’s for X and Y . The marginals are given via: Z ∞ Z ∞ f (x, y)dy and fY (y) = f (x, y)dx. fX (x) = −∞

−∞

It is trickier to define conditional distributions for continuous random variables, since the event {Y = y} has probability zero, when Y is continuous. Let X and Y be continuous random variables with a joint pdf f . Consider the following calculation for small ε > 0, which will be based on the approximation Z x+ε f (u)du. (2ε)f (x) ≈ x−ε

P {X ≤ x} ∩ {Y ∈ (y − ε, y + ε)} P(X ≤ x | Y ∈ (y − ε, y + ε)) = P(Y ∈ (y − ε, y + ε)) R x R y+ε f (u, v)dudv −∞ y−ε ≈ 2εfY (y) Rx 2ε −∞ f (u, y)du ≈ 2εfY (y) Rx f (u, y)du . = −∞ fY (y) So we define the conditional distribution function of X given Y = y via Rx f (u, y)du , FX|Y (x) = −∞ fY (y) and the conditional density of X given that Y = y via the f (x, y) fX|Y (x|y) = f (x|y) = , fY (y) provided that fY (y) > 0. We can similarly define the conditional distribution of Y given that X = x. Note that the condition that fX|Y (x|y) is given by a function of only the x variable, and does not depend on y is equivalent to that of X and Y being independent. The analogous statement for fY |X is also true.



34

TERRY SOO

Exercise 14.1. Let c > 0. Consider the joint distribution for continuous random variables X and Y given by the density ( cxe−5x e−xy f (x, y) = 0 (a) (b) (c) (d)

if x, y ≥ 0, otherwise.

Find c. Find the marginal fX . Find the marginal fY . Check, directly by integration, that the function defined by ( 5 if y ≥ 0, (5+y)2 g(y) = 0 otherwise,

is indeed a pdf. (e) Find the density for the conditional distribution of Y given that X = x, for x > 0. Solution. (a) Clearly f (x, y) ≥ 0. We have the requirement that Z ∞Z ∞ f (x, y)dxdy = 1. −∞

−∞

We have that Z ∞Z ∞ Z f (x, y)dxdy = −∞

−∞





Z

f (x, y)dxdy

0

0

Z



e−5x

= c

Z

0

Z = c =

c . 5

!



xe−xy dy dx

0 ∞

e−5x dx

0

Thus c = 5. (b) From our previous calculation, we easily see that fX (x) = 5e−5x for all x ≥ 0, and fX (x) = 0 otherwise. (c) We will need to do integration by parts. Recall that the product rule gives that d (u(x)v(x)) = u0 (x)v(x) + u(x)v 0 (x). dx

MATH 526 LECTURE NOTES

35

Thus integrating gives Z b Z b  b 0 u(x)v (x)dx = u(x)v(x) a − u0 (x)v(x)dx. a

a

We know that fY (y) = 0 for y < 0. For y ≥ 0, we are supposed to compute Z ∞ fY (y) = 5xe−5x e−xy dx. 0  −x(5+y) −5 Take u(x) = x and v 0 (x) = 5e−x(5+y) . So take v(x) = 5+y e . We find that h ix=∞ −5  −x(5+y) ix=∞ h 5 −x(5+y) e fY (y) = x − e 5+y (5 + y)2 x=0 x=0 Using that fact that xe−x → 0 as x → ∞, we have that 5 . fY (y) = (5 + y)2 (d) Note that



−5 ∞ g(y)dy = = 1. 5+y 0 −∞

Z (e) We have that

( xe−xy fY |X (y|x) = 0

if y ≥ 0, otherwise.

Exercise 14.2. Let c > 0. Consider the joint distribution for continuous random variables X and Y given by the density ( c if 0 ≤ y ≤ x ≤ 1, f (x, y) = 0 otherwise. (a) Find c. (Hint, it may help to try a picture of the region of integration.) (b) Find the marginal fX . (c) Find the marginal fY . (d) Check, directly by integration or elementary school math, that the function defined by ( 2(1 − y) if y ∈ [0, 1], g(y) = 0 otherwise, is indeed a pdf. (e) Find the density for the conditional distribution of X given that Y = y, for y ∈ (0, 1).

36

TERRY SOO

(f ) Fix some b ∈ [0, 1). Check, directly, by integration or elementary school math, that the function defined by ( h(x) =

1 1−b

0

if 0 ≤ b ≤ x < 1, otherwise,

is indeed a pdf. Solution. (a) The region of integration is a triangle with boundaries given by the equations y = 0, x = 1, and y = x; thus c = 1/2; or we can do the integration: Z ∞Z ∞ Z 1Z x f (x, y)dxdy = c 1dydx −∞

−∞

0

Z

0 1

xdx 1 2 = cx /2 = c

0

0

= c/2. Thus c = 2. (b) Clearly, fX (x) = 0 for all x < 0 and all x > 1. For x ∈ [0, 1], we have that Z x 2dy = 2x. fX (x) = 0

(c) Clearly, fY (y) = 0 for all y < 0 and all y > 1. For all y ∈ [0, 1], we have Z 1 fY (y) = 2dx = 2(1 − y). y

(d) It is clear that the required integral of g is given by the area of a right angle triangle with height 2 and length 1, which has unit area. (e) Thus the required conditional density is given by ( 1 if 0 ≤ y ≤ x < 1, fX|Y (x|y) = 1−y 0 otherwise. (f ) The required integral of h is given by the area of a rectangle with height 1/(1 − b) and length 1 − b, which as unit area.

MATH 526 LECTURE NOTES

37

14.2. Normal distributions. Let ρ ∈ (−1, 1). Let f : R2 → [0, ∞) be given by f (x, y) =

  1 1 2 2 p (x − 2ρxy + y ) . exp − 2(1 − ρ2 ) 2π 1 − ρ2

This is a called a standard bivariate normal distribution. We will first consider the case where ρ = 0. If g : R → [0, ∞) is given by 1 2 1 g(x) = √ e− 2 x , 2π

the g the pdf for the standard normal distribution. A random variable with pdf g is a a standard normal random variable. Example done in class 14.3. (1) In the case that ρ = 0, check that the bivariate normal distribution defined above is indeed a joint distribution; that is, it is nonnegative, and its double integral over the entire real plane is one. (2) In the case ρ = 0, also rewrite f (x, y) = f1 (x)f2 (y) as a product of functions in x and y alone. (3) Show that the pdf for the standard normal distribution is indeed a pdf. Solution. (1) We use polar coordinates and the change of coordinates x = r cos θ, y = r sin θ, dxdy = rdrdθ. 1 2π

Z Z

−1/2(x2 +y 2 )

e R2

Z ∞ Z 2π 1 2 dxdy = re−r drdθ 2π 0 0 Z 2π 2 = re−r dr 0 ∞ −r2 = −e 0

= 1. (2) Write f (x, y) = g(x)g(y), where g is the pdf for the standard normal distribution. R 2 RR (3) We know that 1 = f (x, y)dxdy = g(x)dx , so the R2 R result follows.

38

TERRY SOO

15. Chapter 3.4 (III) 15.1. More on normal distributions. Let µ ∈ R and σ > 0. The density for a normal random variable X with parameters (µ, σ) is given by 1 2 1 n(x; µ, σ) = √ e− 2σ2 (x−µ) ; σ 2π in which case we also write X ∼ N (µ, σ 2 ).

Example done in class 15.1. Check that for every µ ∈ R, and σ > 0, we indeed have that n(·; µ, σ) is a density. Exercise 15.2 (Transformation of the standard normal). (1) Let X be a standard normal random variable, so that X ∼ N (0, 1). Let σ > 0, and µ ∈ R. Find the pdf for the random variable Y = σX + µ. (2) Find the second marginal for bivariate normal R ∞ distribution, f , without assuming ρ = 0; that is, compute −∞ f (x, y)dx. (Hint: complete the square x2 − 2ρxy + y 2 = (x − ρy)2 + (1 − ρ2 )y 2 . You will get the Rdensity for a standard normal random variable.) ∞ R∞ (3) Verify that −∞ −∞ f (x, y)dxdy = 1 Solution. (1) We have x − µ 1 P(Y ≤ x) = P(σX + µ ≤ x) = P X ≤ =√ σ 2π 

By the change of variables u =

Z

x−µ σ

u2

e− 2 du.

−∞

t−µ σ

we have that Z x (t−µ)2 e− 2σ2 dt.

1 FY (x) = P(Y ≤ x) = √ σ 2π −∞ By taking a derivative we obtain the required pdf (t−µ)2 1 FY0 (x) = √ e− 2σ2 . σ 2π (2) After completing the square, we are to do the integral: Z ∞   1 1 2 2 2 p exp − ((x − ρy) + (1 − ρ )y ) dx 2(1 − ρ2 ) 2π 1 − ρ2 −∞ = Z ∞   exp(y 2 /2) 1 2 p exp − (x − ρy) dx. 2(1 − ρ2 ) 2π 1 − ρ2 −∞

MATH 526 LECTURE NOTES

39

From the previous exercise, with σ s = 1 − ρ2 and µ = 2ρy, we know that Z ∞   1 1 2 exp − (x − ρy) dx = 1. √ p 2(1 − ρ2 ) 2π 1 − ρ2 −∞ Thus we are left with Z ∞ 1 2 1 f (x, y)dx = √ e− 2 y , 2π −∞ which is exactly the pdf for a standard normal random variable. (3) We just need to integrate both sides of the above equation; we already know the answer for this integral since it is the pdf for a standard normal random variable. Exercise 15.3 (Sum of continuous random variables). Let X and Y be continuous random variables with a joint pdf f , and pdfs given by fX and fY , respectively. Let Z = X + Y . Let z ∈ R. (1) Show that the pdf for Z is given by Z ∞ fX+Y (z) = f (x, z − x)dx. −∞

(2) If X and Y are independent show that the pdf for Z is given by fX ? f Y . Solution. (1) Let R = {(x, y) ∈ R2 : x + y ≤ z}. Z Z f (x, y)dxdy P(Z ≤ z) = R Z ∞ Z z−x = f (x, y)dydx −∞

−∞

By a change of variables y = u − x (here x is a constant), we have Z ∞Z z P(Z ≤ z) = f (x, u − x)dudx ∞ −∞ Z z Z ∞  = f (x, u − x)du dx −∞



Differentiating gives the result. (2) If X and Y are independent, we have that f (x, z − x) = fX (x)fY (z − x), from which the result follows.

40

TERRY SOO

Exercise 15.4. Let X and Y have the joint density given by the standard bivariate normal distribution. Compute the conditional distribution of X given that Y = y. When are X and Y independent? Solution. From our previous calculations in Exercise 15.2, it is clear that   1 1 2 f (x|y) = √ p . exp − (x − ρy) 2(1 − ρ2 ) 2π 1 − ρ2 We see that X and Y are independent if and only if ρ = 0. 16. Chapter 4.1, 4.3 (I) 16.1. Mean of a random variable. Let X be a real-valued discrete random variable with pmf f , then the expected value of X is defined via: X xf (x). EX = x

If X is a continuous random variable with pdf f then its expected value is define similarly: Z ∞ EX = xf (x)dx. −∞

We often also write µ = EX. We also call EX the mean of a random variable; the justification for this is given by a version of the law of large numbers which states that if (Xi )∞ i=1 is a sequence of independent random variables with the same distribution such that E|X1 | < ∞, then on an event of probability one, we have ∞

1X Xi = EX1 . n→∞ n i=1 lim

Example done in class 16.1. Let X be a Bernoulli random variable with parameter p, so that P(X = 1) = p. Show that EX = p. Let A be an event, show that E1A = P(A). Exercise 16.2. Let X be a Binomial random variable with parameter p ∈ (0, 1), so that if has a pmf given by:   n k P(X = k) = p (1 − p)n−k k for all integers k with 0 ≤ k ≤ n. Show directly using the definition of expectation that EX = np.

MATH 526 LECTURE NOTES

41

Solution.   n X n k EX = k p (1 − p)n−k k k=0   n X n k = k p (1 − p)n−k k k=1   n X n k−1 = p k p (1 − p)n−k k k=1 = p

n X

k

k=1 n X

= pn

k=1

= pn

n−1 X k=0

n! pk−1 (1 − p)n−k k!(n − k)! (n − 1)! pk−1 (1 − p)n−k (k − 1)!(n − k)! (n − 1)! pk (1 − p)n−k−1 k!(n − k − 1)!

n−1 X

(n − 1)! pk (1 − p)n−1−k k!(n − 1 − k)! k=0  n−1  X n−1 k = pn p (1 − p)n−1−k k k=0

= pn

= pn(1) = pn. Example done in class 16.3 (N -sided fair dice). Suppose that X is a random variable which takes values in R = {1, 2, 3, . . . , n} with equal d probability; that is, P(X = i) = 1/n for all i ∈ R. Let Y = X; that is; P(Y = z) = P(X = z) for all z ∈ R. Also assume that X and Y are independent. Compute P(Y = X) and EX. Exercise 16.4 (Geometric random variables). We say that T is a geometric random variable with parameter p ∈ (0, 1) if P(T = k) = (1 − p)k−1 p for all k = 1, 2, 3, . . . Compute ET . If you flip a fair coin, how many times on average do you have to flip it in order to get a head? Hint: recall from Calculus that for all r ∈ (0, 1) we have that if ∞

X 1 f (r) = = rn , 1 − r n=0

42

TERRY SOO

then



X 1 f (r) = = nrn−1 . (1 − r)2 n=1 0

Example done in class 16.5. Let U be uniformly distributed in [0, 1]. Find EU . Exercise 16.6. Let σ > 0 and µ ∈ R. Let X be a continuous random variable with pdf given by (x−µ)2 1 f (x) = √ e− 2σ2 , σ 2π

so that X ∼ N (µ, σ 2 ). Find EX. Exercise 16.7. Let λ > 0. Let X be an exponential random variable with parameter 1/λ. Find EX. (See Exercise 12.2.) Solution. The pdf for X is given by ( λe−λx if x ∈ [0, ∞), f (x) = 0 otherwise. Thus Z EX =



xf (x)dx. 0

We will need to integrate by parts. Recall that: Z b Z b  b 0 u(x)v (x)dx = u(x)v(x) a − u0 (x)v(x)dx. a

a

Set v 0 (x) = f (x) and u(x) = x; thus we take v(x) = −e−λx , and have u0 (x) = 1. So, we have h i∞ Z ∞ −λx EX = − xe + e−λx dx 0 0 h 1 i∞ = 0 + − e−λx λ 0 1 = . λ (Recall that xe−x → 0 as x → ∞.) Example done in class 16.8. Let X be a continuous random variable with probability density function f . If f is even (f (x) = f (−x) for all x ∈ R), then show that EX = 0 and P(X ≥ x) = P(X ≤ −x).

MATH 526 LECTURE NOTES

43

Let α ∈ (0, 1) Show that P(−z ≤ X ≤ z) = 1 − α if and only if P(X > z) = α/2. We should assume that Z |x|f (x)dx < ∞ to avoid any problems with ∞ − ∞. Solution. Since f is even, by a change of variables, x = −u we have that Z 0 Z ∞ xf (x)dx = −uf (−u)du −∞ 0 Z ∞ −uf (u)du = 0

Hence Z



Z



xf (x)dx =

0

Z xf (x)dx +

−∞

Z

xf (x)dx Z−∞ ∞

0 ∞

xf (x)dx −

= 0

xf (x)dx = 0. 0

Also, by a change of variables x = −u we have Z P(X ≥ z) =



Z

−z

f (x)dx = P(X ≤ −z).

f (x)dx = z

−∞

For the final claim, observe that P(−z ≤ X ≤ z) = P(X ≤ z) − P(X > −z) = 1 − P(X > z) + P(X < −z) = 1 − 2P(X > z) Exercise 16.9. For a discrete real-valued random variable, show that if Z ≥ 0, then EZ ≥ 0. end of Exam 2 coverage 17. Quiz and review of Quiz and Extra HW 18. Chapter 4.1, 4.3 (II) 18.1. Functions of random variables. Example done in class 18.1. Suppose X is uniformly distributed on {−3, −2, −1, 0, 1, 2, 3}; that is, P(X = i) = 1/7 for all integers −3 ≤ i ≤ 3. Let g(x) = x2 . Compute Eg(X).

44

TERRY SOO

Rather than to compute the pmf, it turns out that there is often an easier way. Let X be a discrete random variable taking values on D. Given a function g : D → R such that E|g(X)| < ∞, we have that X Eg(X) = g(x)P(X = x). x∈D

Similarly, if f is the pdf for a continuous random X, and g is a function such that g : R → R and g(X) is a continuous random variable, then Z ∞ Eg(X) = g(x)f (x)dx. −∞

Sometimes these formulas are called the law of the unconscious statistician; I am not sure why. Exercise 18.2. Let λ > 0. Let X be a random variable with the property that all x ≥ 0, we have P(X > x) = eλx . Find E(X 2 ). Solution. The pdf for X is given by ( λe−λx if x ∈ [0, ∞), f (x) = 0 otherwise. Thus Z

2



x2 f (x)dx.

E(X ) = 0

Again we will need to integrate by parts. Recall that: Z b Z b  b 0 u0 (x)v(x)dx. u(x)v (x)dx = u(x)v(x) a − a

a

Choose u(x) = x2 , and v 0 (x) = f (x). Thus we take v(x) = −e−λx , and have u0 (x) = 2x. So we have h i∞ Z ∞ 2 2 −λx E(X ) = − x e + 2xe−λx dx. 0

0

The first term on the right hand vanishes since x2 e−x → 0 as x → ∞. As for the second term, we already know how to deal with that from Exercise 16.7, since we from that exercise, we know Z ∞ 1 EX = xλe−λx dx = ; λ 0 hence, Z 0



2xe−λx dx =

2 . λ2

MATH 526 LECTURE NOTES

45

We will give a proof of the law of the unconscious statistician, for the case of discrete random variables. We need to show that X X yP(g(X) = y) = g(x)P(X = x). y

x

In order to do this, we will re-arrange the sum on the right-hand side. For each y, let g −1 (y) = {x : g(x) = y}. Note that for each y, X P(g(X) = y) = P(X = x), x∈g −1 (y)

so that yP(g(X) = y) = y

X

P(X = x) =

x∈g −1 (y)

X

g(x)P(X = x).

x∈g −1 (y)

Summing both sides over y gives X X X yP(g(X) = y) = g(x)P(X = x). y

y

x∈g −1 (y)

Since the union over all y of g −1 (y) gives all possible values of x, the right hand side is the same as: X g(x)P(X = x). x

Similar formulas hold for joint distributions. In particular, the version for the discrete joint distribution is a consequence of the regular one, since given two discrete random variables X, Y , the random variable Z = (X, Y ) is also a discrete random variable: let g be a function of (X, Y ), then XX Eg(X, Y ) = g(x, y)P(X = x, Y = y). x

y

For the continuous random variables, if f is the joint pdf of continuous random variables X and Y , we have Z ∞Z ∞ Eg(X, Y ) = g(x, y)f (x, y)dxdy. −∞

−∞

Example done in class 18.3. Let U1 and U2 be independent random variables uniformly distributed in [0, 1]. Compute E(U1 U2 ).

46

TERRY SOO

18.2. Linearity and independent products. It turns out the law of the unconscious statistician for (joint) distributions can be very useful for deriving other useful properties of the expectation operator E. Exercise 18.4. If X and Y are independent real-valued random variables, then E(XY ) = (EX)(EY ). Show this for discrete random variables using the law of the unconscious statistician with the function g(x, y) = xy. Solution. E(XY ) = Eg(X, Y ) XX = xyP(X = x, Y = y) x

=

x

=

y

XX X

xyP(X = x)P(Y = y) since X and Y are independent

y

xP(X = x)

x

X

yP(Y = y)

y

= (EX)(EY ). Exercise 18.5. Find two random variables X and Y such that E(XY ) = (EX)(EY ), but X and Y are not independent. Example done in class 18.6. Let a, b ∈ R. If X is a real-valued random variable, then E(aX + b) = aEX + b. Show this for a continuous random variable X with pdf f , using the law of the unconscious statistician with the function g(x) = ax + b. Solution. E(aX + b) = Eg(X) Z ∞ = (ax + b)f (x)dx −∞ Z ∞ Z ∞ = axf (x)dx + bf (x)dx −∞ −∞ Z ∞ Z ∞ = a xf (x)dx + b f (x)dx −∞

−∞

= aEX + b. Example done in class 18.7. If X and Y are real-valued random variables (not necessarily independent), then E(X + Y ) = EX + EY (provided that E|X| + E|Y | < ∞). Show this for discrete random variables using the law of the unconscious statistician with the function g(x, y) = x + y.

MATH 526 LECTURE NOTES

47

Solution. E(X + Y ) = E(g(X, Y )) XX = (x + y)P(X = x, Y = y) x

=

y

XX x

 xP(X = x, Y = y) + yP(X = x, Y = y)

y

X X X X = x P(X = x, Y = y) + y P(X = x, Y = y) x

=

X

y

y

xP(X = x) +

x

X

x

yP(Y = y)

y

= EX + EY. Example done in class 18.8. Use the linearity of expectation to show that find the expectation of a Binomial random variable with parameter (n, p). Exercise 18.9. Let X be a nonnegative integer-valued random variable. Convince yourself that ∞ X X= i1{X=i} , i=0

and X=

∞ X

1{X≥i} .

i=1

Conclude from that latter that EX =

∞ X

P(X ≥ i).

i=1

18.3. Information theory. Let X be a discrete random variable taking the values {a1 , . . . , an }, where P(X = ai ) 6= 0 for all i = 1, 2, . . . , n. The information function is another random variable associated with X defined via: n X I = IX = − log(P(X = ai ))1{X=ai } ; i=1

in other words, if X = ai , then I = − log(P(X = ai )). Notice that I ≥ 0. The log here is the usual natural logarithm; other popular choices are to use log2 , especially in computer science. The information function is a measure of how much information or how surprised you are when you see the outcome of X. For example,

48

TERRY SOO

if X is Bernoulli random variable with parameter 0.99, we have that if X = 1, then I = − log(0.99) ≈ 0, but if X = 0, then I = − log(0.01), which is a large number. The entropy of X is denoted by H(X) and is given by H(X) = EIX ; thus we can think of entropy the expected information given by a random variable. Entropy plays an important rule in coding theory, and we will touch upon some of its basic properties in the next exercises. Exercise 18.10. Let X be a discrete random variable taking values in {a1 , . . . , an }., where P(X = ai ) 6= 0 for all i = 1, . . . , n. Show that H(X) = −

n X

P(X = ai ) log(P(X = ai )).

i=1

(Hint, take the expectation of IX and use the linearity of expectation, and remember what the expectation of an indicator is.) Solution. Recall that E1A = P(A). By the linearity of expectation, we have n X H(X) = EIX = − log(P(X = ai ))E(1{X=ai } ) i=1

= −

n X

log(P(X = ai ))P(X = ai ).

i=1

Exercise 18.11. Let X be a Bernoulli random variable with parameter p ∈ (0, 1) Compute H(X). Find the value p which maximizes H(X). (Hint: After you find a formula for H(X) that depends on p, you might need to use calculus.) Solution. Let f (p) = −p log p − (1 − p) log(1 − p). Clearly, H(X) = f (p). Note that technically f is only defined for x ∈ (0, 1). However, limx→0 f (x) = limx→1 f (1) = 0, from which it makes sense to set f (1) = f (0) = 1. We want to find the p which maximizes f (p) on [0, 1]. We want to find the derivative of f and find critical points; that is, points c for which f 0 (c) = 0. Then we need to check which one of those will maximizes f in the interval [0, 1]. We have that 1 p + log(1 − p) − 1−p 1−p = log(1 − p) − log(p).

f 0 (p) = − log p − 1 +

MATH 526 LECTURE NOTES

49

Thus we are left to solve for p in log(1 − p) = log(p), or 1 − p = p, from which we deduce that p = 1/2. Since f is positive on (0, 1), f (1) = f (0) = 0, and p = 1/2 is the only critical point, f (1/2) is the maximum value of f on [0, 1]. (We can also argue using the first derivative test.) 18.3.1. Joint entropy. If X and Y are discrete random variables with joint pmf f , and marginals fX and Y , then we also define their joint entropy to be given by XX f (x, y) log(f (x, y)); H(X, Y ) = − x

y

(in this notation we take, 0 log(0) = 0. Exercise 18.12. Show that if X and Y are independent, then H(X, Y ) = H(X) + H(Y ). For a fixed value of x, such that fX (x) > 0, we define X fY |X (y|x) log(fY |X (y|x)), H(Y |X = x) = − y

and the conditional entropy of Y given X to be, X fX (x)H(Y |X = x). H(Y |X) = x

Exercise 18.13. Show that XX H(Y |X) = − f (x, y) log(fY |X (y|x)). x

y

Exercise 18.14. Show that H(X, Y ) = H(X) + H(Y |X) and conclude that if X and Y are independent, then H(Y |X) = H(Y ). 19. Chapter 4.2, 4.3 (I) 19.1. Variance. The variance of a random variable X is defined via:  Var(X) = E (X − EX)2 . p We often write Var(X) = σ 2 , and we say that Var(X) = σ is the standard deviation of the random variable X. The law of large numbers relates the variance to sample variance in the following way. If (Xi )i∈N is a sequence of independent random

50

TERRY SOO

variables, all with the same distribution with EX1 = µ and Var(X1 ) = σ 2 , then n 1X lim (Xi − µ)2 = E(X1 − µ)2 = σ 2 . n→∞ n i=1 (We cheated slightly here since we have the difference of Xi with µ, instead of the sample mean. We will do a more detailed calculation later.) For discrete random variables and continuous random variables, using the law of the law of the unconscious statistician with the function g(x) = (x − µ)2 , we can obtain formulas for computing the variance. For a discrete random variable X we have X (x − µ)2 P(X = x). Var(X) = x

And for continuous a continuous random variable X with pdf f , we have Z ∞ (x − µ)2 f (x)dx. Var(X) = −∞

Example done in class 19.1 (Variance of a Bernoulli random variable). Let X be a random variable with P(X = 1) = p and P(X = 0) = 1 − p. Find Var(X). Solution. We know that EX = p. Using the above formula we have: Var(X) = (0−p)2 (1−p)+(1−p)2 p = (1−p)(p2 +(1−p)p)) = (1−p)(p). Exercise 19.2 (Variance of standard normal). Let X be a continuous random variable with pdf given by x2 1 f (x) = √ e− 2 . 2π Find the variance of X. Solution. It is easy to see that EX = 0, thus Var(X) = E(X 2 ). Again we will need to integrate by parts to compute E(X 2 ). Recall that: Z b Z b  b 0 u(x)v (x)dx = u(x)v(x) a − u0 (x)v(x)dx. a

a

0

Choose u(x) = x and v (x) = xf (x). Integration by parts gives: Z ∞ Z ∞ x2 x − x2 ∞ 1 2 √ e− 2 dx. x f (x)dx = − √ e 2 + −∞ 2π 2π −∞ −∞ We have that the second term equals 1 and the first term is zero, since it follows from l’hopital’s rule that the exponential goes to zero faster than any polynomial going to infinity.

MATH 526 LECTURE NOTES

51

Example done in class 19.3 (Short-cut formula). Show that Var(X) = E(X 2 ) − (EX)2 . Solution. Var(X) = E(X − EX)2 = E(X 2 − 2XEX + (EX)2 ). Applying the linearity of expectation, we have, Var(X) = E(X 2 ) − 2(EX)2 + (EX)2 = E(X 2 ) − (EX)2 . Exercise 19.4. Let X be a Poisson random variable with parameter λ > 0, so that λk P(X = k) = e−λ k! for k = 0, 1, 2, . . .. Show that EX = λ = Var(X). Solution. First we show that EX = λ. ∞ X

EX =

kP(X = k) = e−λ

k=0

= e−λ λ

∞ X λk k k! k=0

∞ ∞ ∞ X X X λk−1 λk−1 λk k = e−λ λ = λe−λ k! (k − 1)! k! k=1 k=1 k=0

= λ With the short-cut formula, it remains to compute E(X 2 ): 2

E(X ) = e

−λ

∞ X

k

k=0 ∞ X



k

k!

=e

−λ

∞ X k=1

k



k

k!

∞ X λk λk−1 = e−λ λ (k + 1) (k − 1)! k! k=0 k=1 ∞ ∞ X λk X λk  −λ = e λ k + k! k=0 k! k=0

= e−λ λ

k

= λ(EX + 1) = λ2 + λ. Thus Var(X) = λ2 + λ − (λ)2 = λ. Exercise 19.5. Let λ > 0. Let X be an exponential random variable with parameter 1/λ. Find Var(X). Solution. By the short-cut formula, we have that Var(X) = E(X 2 ) − (EX)2 .

52

TERRY SOO

By Exercises 16.7 and 18.2, we have that Var(X) =

2 1 1 − 2 = 2. 2 λ λ λ

20. Test 2 Q1: a)0.43478, b)0.42, c)1.8; Q2: a)4, c)1, d)0, e)0.05859, f)4/5; Q3: a)1/5, b)1/5[g(10) − g(5)]; Q4: a)0.34956, b)0.04637; Q5: a)2, b)FX (4) − FX (3). 21. Chapter 4.2, 4.3 (II) 21.1. Sample variance and variance. Let (Xi )∞ i=1 be a sequence of independent random variables such that EXi = EX1 and EXi2 = EX12 for all i ∈ N. Let n 1X ¯ Xi . Xn = n i=1 Notice by the short-cut formula, Exercise 1.3, we have that n

n

n

1X 1 X 2  1 X 2 2 ¯ (Xi − Xn ) = X − Xi . n i=1 n i=1 i n i=1 Taking a limit as n → ∞, by the law of large numbers, we have that on an event of probability one that n

1X ¯ n )2 = EX 2 − EX1 )2 = Var(X1 ) (Xi − X lim 1 n→∞ n i=1 by the other short-cut formula, Exercise 19.3. 21.2. Variance for sums of random variables. Example done in class 21.1. Let X be a random variable and a, b ∈ R. Find Var(aX), Var(X + b). Solution. We have 2 2 Var(aX) = E aX − E(aX) = E aX − aEX = a2 E(X − EX)2 . Thus Var(aX) = a2 Var(X). We have 2 Var(X−b) = E (X−b)−E(X−b) = E X−b−EX+b)2 = E(X−EX)2 . Thus Var(X − b) = Var(X).

MATH 526 LECTURE NOTES

53

Example done in class 21.2. Recall that if X and Y are independent random variables, then E(XY ) = (EX)(EY ). Use this fact to show that if X and Y are independent random variables, then Var(X + Y ) = Var(X) + Var(Y ). Solution. Var(X + Y ) = = = = = =

E(X + Y )2 − (E(X + Y ))2 EX 2 + 2E(XY ) + EY 2 − (E(X) + E(Y )2 ) 2E(XY ) + EX 2 + EY 2 − 2(EX)EY − (EX)2 − (EY )2 2(EX)EY + EX 2 + EY 2 − 2(EX)EY − (EX)2 − (EY )2 E(X 2 ) − (EX)2 + E(Y 2 ) − (EY )2 Var(X) + Var(Y ).

In first equation, we used the short-cut formula. In second equation, we used the linearity of Expectation ( E(aX + bY ) = aEX + bEY ). In the fourth equation, we used the independence of X and Y . Finally, in last equation, we used the short-cut formula again. Example done in class 21.3. Let a, b ∈ R. Show that if X and Y are independent random variables, then aX and bY are also independent random variables. (See also Section 13.2.) Example done in class 21.4. Let a, b ∈ R. Let X and Y be independent random variables. Show that Var(aX + bY ) = a2 Var(X) + b2 Var(Y ); thus in particular, Var(X − Y ) = Var(X) + Var(Y ). Example done in class 21.5. Let X and Y be random variables that are not necessarily independent. Show that Var(X + Y ) = Var(X) + Var(Y ) + 2E(XY ) − 2(EX)(EY ). Example done in class 21.6 (Variance for Binomial random variables). Let (Xi )ni=1 be independent Bernoulli random variables with paPn rameter p. Then X = i=1 Xi is a Binomial random variable with parameter (n, p). Find the variance of X. Solution. By Exercise 19.1, we have that Var(Xi ) = p(1 − p). Since X is a sum of independent random variables, we know that n X Var(X) = Var Xi = np(1 − p). i=1

Exercise 21.7. Let X1 , X2 , X3 . . . , be independent random variables with the same distribution; that is, all the XP i have the same cdf. Let 2 EX1 = µ and V ar(X1 ) = σ > 0. Let Sn = ni=1 Xi . Find (formulas,

54

TERRY SOO

in terms of n, µ, and σ 2 for) the mean the variance of Sn ; determine the mean and variance (by applying the formula) for the specific case n = 25, µ = 3 , and σ 2 = 7. Solution. The linearity of expectation gives, ESn = nEX1 = nµ, and the independence of the Xi gives, Var(Sn ) = n Var(X1 ) = nσ 2 . In the specific case n = 25, µ = 3 , and σ 2 = 7, we obtain that ESn = 25(3) = 75, and Var(Sn ) = 25(7) = 175. Exercise 21.8 (Standard Version). Let X be a random variable. Sometimes it makes things nice to standardize random variables by defining: X − EX Z=p . Var(X) Check that Z has mean 0 and unit variance; that is, Var(Z) = 1. (Assume that 0 < Var(X) < ∞.) Solution. Let EX = µ and V ar(X) = σ 2 , to bring out the fact that they are just constants. X − µ 1 1 = E(X − µ) = (EX − µ) = 0. E σ σ σ X − µ 1 1 Var = 2 Var(X − µ) = 2 Var(X) = 1. σ σ σ Example done in class 21.9. Let c ∈ R. Suppose that X is a random variable such that P(X = c) = 1. Show that Var(X) = 0. (In fact, the other direction is also true. That is, if Var(X) = 0, then there exists c ∈ R, such that P(X = c) = 1; moreover c = EX. 21.3. Covariance. Given two random variables X and Y , we define the covariance of X and Y to be given by   Cov(X, Y ) = E (X − EX)(Y − EY ) . Whenever Var(X), Var(Y ) > 0, the correlation coefficient is given by Cov(X, Y ) ρ(X, Y ) = p . Var(X)Var(Y ) Exercise 21.10. Let Y = aX + b. Find ρ(X, Y ). Assume a 6= 0, and Var(X) 6= 0. (You may have to take cases depending on whether a is negative or positive.)

MATH 526 LECTURE NOTES

55

Solution.   Cov(X, Y ) = E (X − EX)(Y − EY )   = E (X − EX)(aX + b − E(aX + b)   = E (X − EX)(aX − aEX)   = aE (X − EX)(X − EX) = aVar(X). 2 We also √ have that Var(Y ) = Var(aX + b) =a Var(aX) = a Var(X). Since, a2 = |a|, we have that ρ(X, Y ) = |a| . Thus if a > 0, then ρ(X, Y ) = 1, and if a < 0, then ρ(X, Y ) = −1.

Exercise 21.11. Prove the following short-cut formula. Show that Cov(X, Y ) = E(XY ) − (EX)(EY ). Exercise 21.12. Suppose that (X, Y ) are standard normal random variables with joint density function f given by the standard bivariate normal distribution with parameter ρ ∈ (−1, 1) so that   1 1 2 2 f (x, y) = p (x − 2ρxy + y ) . exp − 2(1 − ρ2 ) 2π 1 − ρ2 Find the covariance of X and Y . Hint use the law of the unconscious statistician with the function g(x, y) = xy. Chapter 4.4–Postponed, please read 21.4. Chebyshev’s inequality. Let X be a random variable with EX 2 < ∞. Let a ≥ 0. Chebyshev’s inequality states that a2 P(|X| > a) ≤ EX 2 . To see this consider the random variable Y defined by Y = a1{|X|>a} . Notice that Y ≤ |X|, since Y = 0 when |X| ≤ a and Y simply takes the value a when |X| > a. Since Y ≤ |X|, and Y is nonnegative, we have that Y 2 ≤ X 2 . Taking expectations on both sides, we obtain the required result. Actually, we cheated a bit, we need to the following result to show that if X ≤ Y , then EX ≤ EY : Exercise 21.13. Show that if Z is real-valued random variable (discrete or continuous) such that P(Z ≥ 0) = 1, then EZ ≥ 0.

56

TERRY SOO

Exercise 21.14. Use Chebyshev’s theorem to obtain the version in the textbook. Let X be a real-valued random with EX = µ and Var(X) = σ 2 . Show that for any k > 0, we have P(|X − µ| ≤ kσ) ≥ 1 −

1 . k2

Exercise 21.15 (Convergence in probability). Let Xn be a sequence of random variables such that for any ε > 0 we have that lim P(|Xn | > ε) = 0;

n→∞

in this case we say that Xn converges to 0 in probability. Show that if lim EXn2 = 0,

n→∞

then Xn converges to 0 in probability. Solution. Let ε > 0. By Chebyshev’s theorem, we have that 0 ≤ ε2 P(|Xn | > ε) ≤ EXn2 . Taking limits on both sides, we obtain that limn→∞ P(|Xn | > ε) = 0, as required. 21.5. Weak law of large numbers. Using Chebyshev’s inequality and Exercise 21.15 we can prove a version of the weak law of large numbers. Exercise 21.16. Let Xi be a sequence of independent random variables all with the same mean EX1 = 0 and bounded variance Var(Xi ) ≤ C < ∞ (for some C). Let Sn = X1 + · · · + Xn . Show that  S 2 n lim E =0 n→∞ n and that Sn converges to 0 in probability. Solution. Observe that ESn = 0, so that Var(Sn ) = ESn2 . By using Exercise 21.2, we have that 1 1 ESn2 = 2 Var(Sn ) 2 n n n n X 1 1 X C = Var(Xi ) ≤ 2 C≤ . 2 n i=1 n i=1 n

0 ≤ E(Sn /n)2 =

Taking limits on both sides we obtain the first required result, and applying Exercise 21.15, we obtain the second result.

MATH 526 LECTURE NOTES

57

22. Chapter 5.1, 5.2, 5.6 22.1. Some named discrete random variables. We collect here some of the discrete random variables that we have discussed so far. 22.1.1. Bernoulli random variables. Let p ∈ (0, 1). We say that X ∼ Bern(p) if P(X = 1) = p and P(X = 0) = 1 − p. We have computed that EX = p and Var(X) = p(1 − p). 22.1.2. Binomial random variables. Let n ≥ 0. Let p ∈ (0, 1). We say that X ∼ Bin(n, p) if for all 0 ≤ k ≤ n, we have   n k P(X = k) = p (1 − p)n−k . k Note that if (Xi )ni=1 are independent Bernoulli random variables with parameter p, then n X Xi ∼ Bin(n, p). i=1

Using this, we have computed that EX = np and Var(X) = np(1 − p). 22.1.3. Using your calculator to help compute binomial probabilities. Your calculator has the pmf of binomial random variables built in to it. For example, if Y ∼ Bin(8, 0.345), then if we want to know P(Y = 2), we can do the following: (i) Press the blue 2nd button, (ii) then press DISTR (VARS), (iii) then (scroll down) choose option A, (iv) enter: binompdf(8,0.345,2) (v) which gives 0.2631746302 More importantly, you calculator allows you to compute the cdf of Binomial random variable. Let X ∼ Bin(23, 0.345). For example, it would be a very tedious task to compute P(X ≤ 8) = P(X = 0) + P(X = 1) + · · · + P(X = 8). Your book has tables of the cdf values of Binomial random variables, but only for certain values of p. Thirty years ago, we might learn to use these tables, but we will instead opt to use the calculator. For example, to compute P(X ≤ 8), we can do the following: (i) Press the blue 2nd button, (ii) then press DISTR (VARS), (iii) then (scroll down) choose option B, (iv) enter: binomcdf(23,0.345,8) (v) which gives 0.6058681469

58

TERRY SOO

Exercise 22.1. Let X ∼ Bin(15, 0.3457). (a) Find P(X = 7). (b) Find P(X ≤ 7). (c) Find P(X < 7). (d) Find P(X > 8). (e) Find P(2 ≤ X ≤ 5). Solution. The calculator gives: (a) P(X = 7) = 0.12754386. (b) P(X ≤ 7) = 0.8936125. (c) Find P(X < 7) = P(X ≤ 7) − P(X = 7) = 0.7661. (d) Find P(X > 8) = 1−P(X ≤ 8) = 1−0.9610004418 = 0.0389995582. (e) Find P(2 ≤ X ≤ 5) = P(X ≤ 5) − P(X < 2) = P(X ≤ 5) − P(X ≤ 1) = 0.5783131724 − 0.0153912842 = 0.5629218882. 22.2. Poisson random variables. Let λ > 0. We say that X is Poisson random variable with parameter λ > 0 and write X ∼ P oi(λ) if λk P(X = k) = e−λ k! for k = 0, 1, 2, . . .. We showed that EX = λ = Var(X). 22.2.1. Using your calculator to compute Poisson probabilities. Your calculator also has built in, the pmf and cdf. For example, let X ∼ P oi(5), so that X is a Poisson random variable with mean 5. To compute P(X = 4), we can do the following: (i) Press the blue 2nd button, (ii) then press DISTR (VARS), (iii) then (scroll down) choose option C, (iv) enter: poissonpdf(5,4) (v) which gives 0.1754673698 To compute P(X ≤ 4), we can do the following: (i) Press the blue 2nd button, (ii) then press DISTR (VARS), (iii) then (scroll down) choose option D, (iv) enter: poissoncdf(5,4) (v) which gives 0.4404932851 Exercise 22.2. Let X ∼ P oi(7). (a) Find P(X = 7). (b) Find P(X ≤ 7). (c) Find P(X < 7). (d) Find P(X > 8).

MATH 526 LECTURE NOTES

59

(e) Find P(2 ≤ X ≤ 5). Solution. The calculator gives: (a) P(X = 7) = 0.14900. (b) Find P(X ≤ 7) = 0.59871. (c) Find P(X < 7) = P(X ≤ 7) − P(X = 7) = 0.449711. (d) Find P(X > 8) = 1−P(X ≤ 8) = 1−0.7290912682 = 0.2709087318. (e) Find P(2 ≤ X ≤ 5) = P(X ≤ 5) − P(X ≤ 1) = 0.3007082762 − 0.0072950557 = 0.2934132204. 22.3. Poisson process on a line. Let N (t) be the number of arrivals of some random process up to time t ≥ 0, where N (0) = 0 and N (s) ≤ N (t) for all s ≤ t. Think of people arriving to an ice cream shop. We say that N is a Poisson process (on the positive real line) of intensity λ > 0 if the following conditions are satisfied: (i) (Stationarity: The number of arrives in an interval of time depends only on the length of time.) For all t > s we have that d N (t) − N (s) = N (t − s); that is, P([N (t) − N (s)] = k) = P(N (t − s) = k) for all k = 0, 1, 2, . . . . Thus the distribution of N (t)−N (s) only depends on t − s. (ii) (Independent Increments.) For all t1 < t2 < · · · , tn , the random variables [N (tn )−N (tn−1 )], [N (tn−1 )−N (tn−2 )], . . . , [N (t2 )− N (t1 )] are independent. (iii) (Orderliness: two customers do not arrive at the same time) 1 lim P(N (h) ≥ 2) = 0. h (iv) (In a small interval time, the probability that a customer arrives is proportional to λ.) 1 lim P(N (h) = 1) = λ > 0. h→0 h When these conditions hold, we have that h→0

(λt)k k! for all t > 0 and all k = 0, 1, . . . ; in other words for any fixed t, we have that N (t) is a Poisson random variables with parameter λt. Thus EN (t) = λt. Usually λ has units like arrivals per time. To see why N (t) has anything to do with a Poisson random variable, let t > 0, and partition the interval [0, t] into n intervals of size t/n, where n is large. By condition (iii) and condition (i) we can basically assume that in each interval there is at most one arrival. Let p = P(N (t) = k) = e−λt

60

TERRY SOO

λ(t/n). By conditions (ii) and (iv), we have that probability that there are k arrives is given by   n k n! (λt)k  λt n−k n−k P(N (t) = k) = p (1 − p) = 1− + g(n), k (n − k)!k! nk n where g(n) → 0 as n → ∞. Using the fact that  1 n lim 1 + = e, n→∞ n and n! lim = 1, n→∞ (n − k)!nk we obtain that   k (λt)k  λt n−k n k n! −λt (λt) 1− +g(n) = e . lim p (1−p)n−k = n→∞ k (n − k)!k! nk n k! Exercise 22.3. Dr. Zed stays at school from 9:00 AM to 5:00 PM. Dr. Zed’s office hours are from 9:00 AM to 10:00 AM. The number of students visiting Dr. Zed can be modelled by using two independent Poisson processes. During office hours, the number of students visiting can be modelled by a Poisson process of rate 5 per hour. Outside of office hours, the number of students visiting can be modelled using a Poisson process of rate 1/2 per hour. (a) What is the probability that no students show up during office hours? (R) (b) What is the probability that exactly 2 students show up during a 45 minute period during office hours? (R) (c) What is the probability that no students show up during the entire day at school? Hint: Let A be the event that no student shows up during the office hours and let B be the event that no students show up outside of office hours. You want to compute P(A ∩ B). (d) What is the probability that exactly one student shows up during the entire day at school? Solution. There is one hour of office hours and 7 hours of non-office hours. Let N1 and N2 be the number of students that come during office hours and non-office hours respectively. Then Let N1 ∼ P oi(5(1)) and N2 ∼ P oi((1/2)(7)). Assume that N1 and N2 are independent. (a) We need to compute P(N1 = 0) = e−5 50 /0! ≈ 0.006738. (b) Let N3 ∼ P oi((45/60)(5)). We need to compute P(N3 = 2) ≈ e−3.75 (3.75)2 /2! ≈ 0.165359.

MATH 526 LECTURE NOTES

61

(c) We need to compute P(N1 = 0, N2 = 0) = P(N1 = 0)P(N2 = 0) = e−5 50 /0! × e−7/2 (7/2)0 /0! ≈ 0.03694. (d) We need to compute P(N1 = 1, N2 = 0) + P(N1 = 0, N2 = 1) = e−5 51 /1! × e−7/2 (7/2)0 /0! + e−5 50 /0! × e−7/2 (7/2)1 /1! ≈ 0.001729. Exercise 22.4. Suppose domestic and international passengers arrive at a security check point independently of each other at rate of 10 passengers per hour and 3 passengers per hour, respectively. (a) What is the probability that within a 15 minute interval, exactly 2 domestic and 3 international passengers arrive the check point? (b) What is the probability within an hour interval exactly 10 passengers (regardless of type) arrive at the checkpoint? (Hint: Use Exercise 13.9) (c) What is the probability that with an hour interval between five and ten passengers, inclusive, arrive at the checkpoint? Exercise 22.5. Let N (x) be a Poisson process of intensity λ > 0. Show that P(N (x) = 1 | N (1) = 1) = x for all x ∈ [0, 1]. Thus if you know that there is one arrival in [0, 1], then the time of that arrival is uniformly distributed in [0, 1]. 22.4. Poisson processes in higher dimensions. Let N (A) be the number of items (of some random process) in a set A area (or volume) L(A) > 0, where N (A) = 0 if L(A) = 0 and N (A) ≤ N (B) if A ⊆ B. (Think of the number of trees in some patch of the forest or stars in some region of outer space.) We say that N is a (spatial) Poisson process of intensity λ on a set D, if for all A ⊆ D (λL(A))k (1) k! for all k = 0, 1, 2, . . . ; in other words N (A) is a Poisson random variable with parameter λL(A). Thus EN (A) = λL(A). Usually λ has units like items per area or items per volume. Similar to the case of one dimension, the following conditions on N motivate definition (1): P(N (A) = k) = e−λL(A)

d

(i) (Stationarity.) For all A, B ⊆ S, if L(A) = L(B), then N (A) = N (B); that is, P(N (A) = k) = P(N (B) = k) for all k = 0, 1, 2, . . . . Thus the distribution of N (A) only depends on the area (or volume) of A. (ii) (Independence) If A1 , . . . , An ⊆ D are disjoint sets, then the random variables N (A1 ), . . . N (An ) are independent.

62

TERRY SOO

(iii) If D ⊇ A1 ⊇ A2 ⊇ A3 , . . . is a sequence of sets such that L(An ) → 0 as n → ∞, then 1 lim P(N (An )) ≥ 2) = 0 n→∞ L(An ) and (iv) 1 P(N (An ) = 1) = λ > 0. lim n→∞ L(An ) Exercise 22.6. Suppose the number of trees in a forest can be modelled as a Poisson process, and on average there are 25 trees per square mile. (a) In a 2 square mile plot of land, what is the probability that there are 40 trees? (b) In a 2 square mile plot of land, what is the probability that there are (strictly) less than 40 trees? Solution. Let X be the number of trees in a 2 square mile plot of land. We have that X ∼ P oi(2(25)). (a) We compute P(X = 40) = e−50 5040 /40! ≈ 0.021499. (b) We are asked to find P(X < 40). Since X is a discrete integervalued random variable, we P(X < 40) = P(X ≤ 39). Our calculator easily gives that P(X ≤ 39) ≈ 0.0645703689. 23. Chapter 6.1, 6.6, 6.2, 6.3 23.1. Some named continuous random variables. We collect here some of the continuous random variables that we have discussed so far. 23.1.1. Uniform random variables. Let a < b. We say that X is uniformly distributed in the interval [a, b] (and write X ∼ U nif [a, b] or X ∼ U [a, b]) if it has the pdf given by ( 1 if x ∈ [a, b], f (x) = b−a 0 otherwise. Exercise 23.1. Find the mean and variance of random variable that is uniformly distributed in the interval [a, b]. Solution. Let X ∼ U nif [a, b], and let f be the pdf for X. Recall from high-school that b2 − a2 = (b − a)(b + a), and b3 − a3 = (b − a)(b2 + ab + a2 ).

MATH 526 LECTURE NOTES

63

We have that 1 EX = b−a

Z a

b

x2 b b+a xdx = , = 2(b − a) a 2

and Z b x3 b 1 b2 + ab + a2 2 x dx = . EX = = b−a a 3(b − a) a 3 Hence by the short-cut formula we have that 2

b2 − 2ab + a2 (b − a)2 = . 12 12 Exercise 23.2. Let U1 and U2 be independent random variables that are uniformly distributed in [0, 1]. Let V = U1 +U2 . Let W be uniformly distributed in [0, 2]. (a) Compute the variance of V . (b) Compute the variance of W . (c) Find the pdf for V . (Hint use Exercise 15.3). Var(X) = EX 2 − (EX)2 =

Solution. Note that we know from Exercise 23.1 that Var(U1 ) = Var(U2 ) = 1/12. (a) Since U1 and U2 are independent, we have that Var(V ) = Var(U1 )+ Var(U2 ) = 1/6. (b) Again, from Exercise 23.1, we have that Var(W ) = 1/3. (c) Let f be the pdf for a U1 ; thus it is also the pdf for U2 . Let x ∈ [0, 1]. Let z ∈ R. Note that f (z − x) = 1 if z − x ∈ [0, 1] and 0 otherwise. By Exercise 15.3, we have that Z 1 fW (z) = f (z − x)dx. 0

Thus if z ∈ [0, 1], then Z

z

fW (z) =

1dx = z, 0

and if z ∈ [1, 2], then Z

1

dx = 2 − z

fW (z) = z−1

23.1.2. Exponential random variables. Let λ > 0. Recall that X is an exponential random variable with parameter 1/λ if it has pdf given by ( λe−λx if x ∈ [0, ∞), f (x) = 0 otherwise.

64

TERRY SOO

We know that an exponential random variable with parameter 1/λ has mean 1/λ and variance 1/λ2 . We also know that if X is an exponential random variable with parameter 1/λ, then P(X > x) = e−λx for all x ≥ 0. Exponential random variables are connected to Poisson processes in the following way. Suppose N is a Poisson process on positive real line with intensity λ. Consider T the time of the first arrival; that is, the smallest time t such that P(N (t) = 1). We have that P(T > t) = P(N (t) = 0) = e−λt . Thus T is an exponential random variable with parameter 1/λ. Another important property of the exponential random variable is the memoryless property, which states that for t, s ≥ 0, we have  P {X > t + s} ∩ {X > s} P(X > t + s | X > s) = P(X > s) P(X > t + s) = P(X > s) −λ(t+s) e = e−λs −λt = e = P(X > t). Exercise 23.3. Suppose that the lifetime, in years, of a lightbulb is given by an exponential random variable with mean 5 years. (a) What is the probability that the light bulb last longer than 5 years? (b) Suppose I used the light bulb for 5 years, what is the probability that it will last another 5 years? Exercise 23.4. Suppose type-A light bulbs have a lifetime, in years, that can be modelled by exponential random variables with mean 5, and that type-B light bulbs can similarly be modelled by exponential random variables with mean 7. (a) Suppose that a room is lit by one type-A and one type-B light bulb, which operate independently of each other. What is the probability that the room will stay lit for at least 10 years? (Hint: you may need to use the inclusion-exclusion formula. Exercise 13.6 may also be helpful.) (H) (b) What is the expected time the room will stay lit? Exercise 23.5. Let X and Y be independent exponential random variables with mean 1. Find the pdf for W = X + Y .

MATH 526 LECTURE NOTES

65

23.2. Normal random variables. Let µ ∈ R and σ > 0. We say that X is a normal random variable with mean µ and variance σ 2 and write X ∼ N (µ, σ 2 ) if has pdf given by 1 2 1 n(x; µ, σ) = √ e− 2σ2 (x−µ) . σ 2π

We know that if Z ∼ N (0, 1), and if X = σZ + µ, then X ∼ N (µ, σ 2 ). We also know that EZ = 0, and Var(Z) = 1, from which we can deduce that EZ = µ and Var(Z) = σ 2 . Similarly, if X ∼ N (µ, σ 2 ) and Z = X−µ , then Z ∼ N (0, 1). σ 23.2.1. Using the tables. Let Z ∼ N (0, 1) be a standard normal random variables. Let f : R → [0, ∞) be the pdf for Z; thus  −x2  1 . f (x) = √ exp 2 2π By definition, for each z ∈ R, we have that Z z P(Z ≤ z) = Φ(z) = f (x)dx. −∞

However, f does not have an elementary anti-derivative. Tables of values of Φ(z) are contained in Table A.3 of your book. In your book contains the values of Φ(z) for z ∈ [−3.49, 3.49], where the value of z is given to two decimal places, and the value of Φ(z) is given to four decimal places. From the table we can read off that Φ(−1.1) ≈ 0.1357, Φ(−1.17) ≈ 0.1210, Φ(0) = 0.5, Φ(0.01) ≈ 0.5040, Φ(0.1) ≈ 0.5398, and Φ(3.49) ≈ 0.9998. If we want to compute the value of Φ(0.435) we can take an linear of interpolation of the values Φ(0.43) and Φ(0.44) via: Φ(0.435) ≈ [Φ(0.43) + Φ(0.44)]/2. Exercise 23.6. Let Z be a standard normal variable. (a) (b) (c) (d)

Find Find Find Find

P(Z ≤ 2.343). P(Z < 2.343). P(1 ≤ Z ≤ 2). P(Z > 0.3).

Solution. (a) P(Z ≤ 2.343) ≈ Φ(2.34) ≈ 0.9904 (b) Since Z is a continuous random variable, we have that P(Z ≤ 2.343) = P(Z < 2.343).

66

TERRY SOO

(c) Observe that P(1 ≤ Z ≤ 2) = P(Z ≤ 2) − P(Z < 1) = P(Z ≤ 2) − P(Z ≤ 1) = Φ(2) − Φ(1) ≈ 0.9772 − 0.8413 = 0.1359. (d) P(Z > 0.3) = 1 − P(Z ≤ 0.3) = 1 − Φ(0.3) ≈ 1 − 0.6179 = 0.3821. Some tables actually only gives the values of Φ(z) for z ≥ 0. The symmetries of the normal distribution can be used to deduce the other values. Let Z ∼ N (0, 1). For z ≥ 0, we have that P(Z ≥ −z) = P(Z ≤ z), and P(Z ≤ −z) = Φ(−z) = 1 − Φ(z) = 1 − P(Z ≤ z); Similarly, we have that P(−z ≤ Z ≤ z) = P(Z ≤ z)−P(Z ≤ −z) = 2P(Z ≤ z)−1 = 2Φ(z)−1. Exercise 23.7. Let Z ∼ N (0, 1). (a) Find z ∈ R so that P(Z ≤ z) = 0.567 (b) Find z ≥ 0, so that P(−z ≤ Z ≤ z) = 0.5. Solution. (a) We see that Φ(0.16) ≈ 0.5636 and Φ(0.17) ≈ 0.5675. So by taking z = 0.17, we have that P(Z ≤ z) ≈ 0.567. (b) We know that P(−z ≤ Z ≤ z) = 2Φ(z) − 1. Thus we want to solve for z in 2Φ(z) − 1 = 0.5, which leads to the equation Φ(z) = 0.75. We find on the tables that Φ(0.67) = 0.7486 and Φ(0.68) = 0.7517. So we have that P(−0.675 ≤ Z ≤ 0.675) ≈ 0.5. 24. Chapter 6.3, 6.4 24.1. Critical values and z-alpha notation. Let Z ∼ N (0, 1). Let α ∈ (0, 1). The number zα is the number such that P(Z ≥ zα ) = α. The numbers zα are sometimes called z-critical values. Exercise 24.1. (a) Find zα for α = 0.1, 0.05, 0.025, 0.01, 0.005, 0.001, 0.0005 (b) Show that P(−zα/2 ≤ Z ≤ zα/2 ) = 1 − α. Solution.

MATH 526 LECTURE NOTES

67

(a) Note that α = P(Z ≥ zα ) = 1 − Φ(zα ). Thus we want to find zα , so that Φ(zα ) = 1 − α. For α = 0.1, we find that Φ(1.28) ≈ 0.8897 and Φ(1.29) ≈ 0.9015. So we take z0.1 ≈ 1.28. For α = 0.05, we find that Φ(1.64) ≈ 0.9495 and Φ(1.65) ≈ 0.9505. So we take z0.05 ≈ 1.645. We similarly find the other values: z0.025 ≈ 1.96, z0.01 ≈ 2.33, z0.005 ≈ 2.58, z0.001 ≈ 3.08, and z0.0005 ≈ 3.27. (b) We have that P(−zα/2 ≤ Z ≤ zα/2 ) = 2P(Z ≤ zα/2 ) − 1 = 2(1 − α/2) − 1 = 1 − α. 24.2. Non-standard normals. Let µ ∈ R and σ > 0. Suppose X ∼ N (µ, σ 2 ). If we set Z=

X −µ , σ

the Z ∼ N (0, 1). This relation allows us to compute probabilities regarding X using the standard normal tables. Exercise 24.2. Let X ∼ N (0.5, 22 ). (a) Find P(X ≤ 2) (b) Find P(X ≥ −1) (c) Find P(−1 ≤ X ≤ 2). Solution. (a) Note that Z = (X − 0.5)/2 is a standard normal random variable. We have that P(X ≤ 2) = P(X − 0.5 ≤ 2 − 0.5)   = P (X − 0.5)/2 ≤ (2 − 0.5)/2 = P(Z ≤ 0.75). From the tables we have that P(Z ≤ 0.75) ≈ 0.7734. (b) Similarly, we find that P(X ≥ −1) = P(Z ≥ (−1 − .5)/2) = P(Z ≥ −0.75). We know that 1 − P(Z ≤ −0.75) = P(Z ≥ −0.75). From the tables, we have that P(Z ≥ −0.75) = P(Z ≤ 0.75) ≈ 0.7734. (c) We have that P(−1 ≤ X ≤ 2) = P(−0.75 ≤ Z ≤ 0.75) = 2P(Z ≤ 0.75) − 1 ≈ 0.5468

68

TERRY SOO

24.2.1. Using your calculator. Your calculator has a even more accurate and powerful normal table built in. For example, if X ∼ N (0.5, 22 ), your calculator can easily give the values of P(−1 ≤ X ≤ 2). All that needs to be done is the following: (i) Press the blue 2nd button, (ii) then press DISTR (VARS), (iii) then press 2 for normalcdf, (iv) enter: normalcdf(-1,2,0.5,2) (make sure you use the negative sign and not the minus) (v) which gives 0.5467454411 Note that if X ∼ N (µ, σ 2 ), then normalcdf (a, b, µ, σ) ≈ P(a ≤ X ≤ b). In order to (approximately) obtain P(X ≤ b), we can set a to be something like −9999. The calculator also has the ability to do inverse look-up. Suppose X ∼ N (1, 22 ), and we want to know the value a for which P(X ≤ a) = 0.65. (i) Press the blue 2nd button, (ii) then press DISTR (VARS), (iii) then press 3 for invNorm, (iv) enter: invNorm(0.65,1,2) (v) which gives 1.770640945 24.3. The class policy on the calculators on the normal tables. You are free to use to the calculator. However, as a general policy, in a question that contains normal random variables, part of showing your work means that you should be able to obtain your answer from using the standard normal tables provided in your book. You may use the calculator instead of the tables, and I encourage you to use the full power of your calculator to double-check your answers. However, as far as obtaining your final answer, you should pretend that the power of your calculator is limited to that of looking up standard normal values that are available in the tables; so that in presenting your final answers, you are only looking up the values Φ(z) for z ∈ R. For example, you are free to use normalcdf (−9999, z, 0, 1) ≈ Φ(z). Similarly, you are free to use the invNorm function. Please make it clear, in your work at the point you use the tables or your calculator. Exercise 24.3. Let X ∼ N (5, σ 2 ). If P(X ≤ 6 | X ≥ 5) = 0.383, then what is σ > 0?

MATH 526 LECTURE NOTES

69

Solution. Observe that 0.383 = P(X ≤ 6 | X ≥ 5) = P(5 ≤ X ≤ 6) P(X ≤ 6) − P(X ≤ 5) = P(X ≥ 5) P(X ≥ 5) Clearly, P(X ≥ 5) = 1/2 = P (X ≤ 5) (why?). Also, Z = (X −5)/σ ∼ N (0, 1). Thus  1 . P(X ≤ 6) = P Z ≤ σ Hence  1 . 0.6915 = P Z ≤ σ Thus we conclude from the tables, that σ = 2. Exercise 24.4. Find the value of the following integral, without using your calculator. You will need to use the normal tables. (H) Z 0.5 2 e−x dx. −0.25

Exercise 24.5. Suppose the height of a randomly selected women (aged 18-64) is normally distributed with mean µ = 163cm and variance σ 2 = 102 cm2 . (a) Given that a randomly selected women is taller than average, what is the probability that she is taller than 170cm? (b) Given a random sample of 10 women (that is 10 women whose heights are independent of each other, and is given by the above distribution), what is the probability that exactly 3 of them will be taller than 165cm? (c) Same set-up as part (b), what is the probability that no more than 3 women will be taller than 165cm. Solution. Let X ∼ N (163, 102 ). Let Z = (X − 163)/10, so that Z ∼ N (0, 1). (a) We are asked to compute P(X > 170 | X > 163) =

P(X > 170) P(Z > 0.7) = ≈table 0.484. P(X > 163) 1/2

(b) Set p = P(X > 165) = P(Z >

2 ) = 1−P(Z ≤ 0.2) ≈table 1−0.5793 = 0.4207. 10

70

TERRY SOO

Let S ∼ Bin(10, p). We are asked to compute   10 3 P(S = 3) = p (1 − p)7 ≈ 0.1956. 3 (c) We are asked to compute P(S ≤ 3) ≈ 0.3318. 25. Sums of independent normal random variables 25.1. Sums (and differences) of independent normal random variables. Let µ1 , µ2 ∈ R and σ1 , σ2 > 0. Let X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) be independent random variables. Let X = X1 + X2 . It turns out that X is also a normal random variable, but what should its mean and variance be? Notice that E(X1 + X2 ) = µ1 + µ2 (we do not even need independence for this) and Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) = σ12 + σ22 (we do need independence for this). Thus X ∼ N (µ1 + µ2 , σ12 + σ22 ). Example done in class 25.1 (Linear scaling). . Let a, b ∈ R. Let X ∼ N (µ, σ 2 ). Show that if Y = aX + b, then Y ∼ N (aµ + b, a2 σ 2 ). (You may assume that you know that Y is a normal random variable with some mean and variance, that you are trying to figure out.) Example done in class 25.2. Use Exercise 25.1 to show that if X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) are independent random variables, then X = X1 − X2 ∼ N (µ1 − µ2 , σ12 + σ22 ). Exercise 25.3. Let X and Y be independent standard normal random variables. let W = X +Y . Prove using Exercise 15.3 that W ∼ N (0, 2). (H) Exercise 25.4. µ1 , µ2 ∈ R and σ1 , σ2 > 0. f (x) = n(x; µ1 , σ1 ) and g(x) = n(x; µ2 , σ2 ). Recall that we defined convolutions for pdfs as well p 2 as pmfs. Show that (f ? g)(x) = n(x; µ1 + µ2 , σ1 + σ22 ) (the algebra could be quite hard), and use Exercise 15.3 to prove the claim of Section 25.1. Example done in class 25.5. Manufacture of a certain component requires two different machining operations. Machining time for each operation has a normal distribution, and the two times are independent of one another. The mean values are 30 minutes and 20 minutes, respectively, and the standard deviations are 3 minutes and 2 minutes, respectively. What is the probability that the total machining time for (a total) one of these components will be less than 45 minutes?

MATH 526 LECTURE NOTES

71

Solution. Let X1 ∼ N (30, 32 ) and X2 ∼ N (20, 22 ) and assume that X1 and X2 are independent. Thus X = X1 + X2 has the same distribution as the total manufacturing time of one of these components. We are asked to compute P(X ≤ 45). Since X is a sum of independent normals, we have that X is also a normal random variable. It must have mean EX1 + EX2 and variance V ar(X1 ) + V ar(X2 ). Thus X ∼ N (30 + 20, 32 + 22 ). Let Z ∼ N (0, 1), then √ P(X ≤ 45) = P( 13Z + 50 ≤ 45) = P(Z ≤ −1.39). The tables give that P(Z ≤ −1.39) = 0.0823. Exercise 25.6. In Canada, according to Wikipedia, the average height of males (aged 25-44) is 1.76 meters and the average height of females (aged 25-44) is 1.633 meters. Let us assume that the heights of males and females are normally distributed. Let Y be the height of a male. 2 Then Y ∼ N (1.76, σm ), for some σm > 0. Suppose that P(1.4 ≤ Y ≤ 2.12) = 0.99. Similarly, let X be the height of a female so that X ∼ N (1.633, σf2 ). Suppose that P(1.22 ≤ X ≤ 2.046) = 0.99. (a) Find σm . (b) Find σf . (c) If I randomly choose a man and woman, what is the probability that the woman will be taller? Clearly state what assumptions are needed for your calculations. 26. Prelude to the central limit theorem 26.1. Baby central limit theorem. In particular, if Xi ∼ N (µ, σ 2 ) are independent, then we know that Sn = X1 + · · · + Xn ∼ N (nµ, nσ 2 ). Thus, ¯ = Sn ∼ N (µ, σ 2 /n) X n and

¯ −µ X S − nµ √ = n√ ∼ N (0, 1). σ/ n σ n A simple random sample from a distribution is given by independent random variables X1 , . . . , Xn , that all have same distribution; that is, P(Xi ≤ x) = F (x) = P(X1 ≤ x).

This is the basic and common sampling assumption in statistics, we discuss more about sampling later, when we state a version of a central limit theorem that works for any distribution.

72

TERRY SOO

Exercise 26.1. Assume that (from experience) if we randomly select a fish from a pond and measure its length, this random (length) variable is normally distributed with mean 13cm and variance 7cm2 . (a) If we sample 100 fish, what is the probability that their sample mean ¯ is less than 12cm? X (b) If we sample 49 fish, what is the probability that their sample mean ¯ is less than 12cm? X ¯ n < 12) for n = 49, 100. Well, Solution. We are asked to compute P(X the game is make this a probability about N (0, 1) random variables, then use the tables. Notice in general: ¯ µ a−µ ¯ < a) = P X − √ < √ . P(X σ/ n σ/ n Thus if Z =

¯ X−µ √ , σ/ n

then Z ∼ N (0, 1), and

(a) − 13  ¯ 100 < 12) = P Z < 12 √ = P(Z < −3.78) ≈ 0, P(X 7/10 and (b) − 13 ¯ 49 < 12) = P(Z < 12 √ P(X ) = P(Z < −2.65) ≈ 0.0040. 7/7 Exercise 26.2. According to Wikipedia, the SAT’s (a standardized test in the United States) have a traditional range of 200 − 800 is based on a normal distribution with mean 500 and standard deviation 100. Let X ∼ N (500, 1002 ). (a) Find P(200 ≤ X ≤ 800). (b) What is the probability that the average score of 30 randomly selected people (who wrote the test) will be greater than 500? (c) What is the probability that the average score of 30 randomly selected people (who wrote the test) will be greater than 530? 27. Chapter 8.1-8.3 (KU custom edition 7.1-7.3) 27.1. Independent random variables and simple random sampling. We will model the random outcome of an experiment using random variables. For example, before we flip a coin and see the result, we think of it as a Bernoulli random variable with some parameter p ∈ (0, 1). Thus when we sample from a population, before we actually observe the outcomes of the sample, we think of the random sample as a sequence of random variables X1 , . . . , Xn . For example, before we

MATH 526 LECTURE NOTES

73

measure the heights of say 10 KU students, we think of their heights as random variables X1 , . . . X10 . We say that two random variables have the same distribution if they have the same cdf (or same pmf or pdf). Often we assume that random variables all have same the distribution and are independent. Often in practice these may not be realistic assumptions! One of the main goals in this course so to develop methods of statistical inference; that is, to infer something about the population from the sample. We will see that treating random samples as random variables will allow us develop quantitative methods to analyze the effectiveness of a statistical inference. By a (simple) random sample of size n from a population or distribution we mean a sequence of independent random variables X1 , . . . , Xn which all have the same distribution. In the language of sampling, the expectation and variance of X1 , are sometimes referred to as the population mean or population variance or the true mean of true variance. 27.2. Statistics. Suppose we are sampling from a normal distribution, in order to determine the population mean. As we discussed at the beginning of course, one way would be to estimate it, using the sample mean: n 1X ¯ X= Xi . n i=1 ¯ (before we actually Note that we call both the random variable X observed the values of Xi ) and its value after we observe or compute its value x¯ the sample mean. We call any function of a random sample (for example, the sample mean) a statistic. Thus a statistic is also a random variable. Another statistic that will play an important role in our discussion is the sample variance: n

S2 =

1 X ¯ 2. (Xi − X) n − 1 i=1

Again, when we use the notation capital ‘S’ we are referring to the random variable and when we use the notation little ‘s’ we are referring to the observed or measured values of S. Exercise 27.1. Let X1 , . . . , Xn be a simple random sample. Let EX1 = ¯ = µ. What is the Var(X)? ¯ (R) µ and Var(X1 ) = σ 2 . Show that EX

74

TERRY SOO

Solution. n  X  ¯ = E 1 EX Xi n i=1 n

1X EXi = n i=1 n = µ = µ. n Since Xi are independent in a simple random sample, ¯ = Var Var(X)

n 1 X

Xi



n i=1 n X  1 = Var X i n2 i=1 n 1 X = Var(Xi ) n2 i=1

=

n 2 σ2 σ = . n2 n

Exercise 27.2. Let X1 and X2 be independent Bernoulli random variables with parameter p ∈ (0, 1). Let X = max(X1 , X2 ). Compute EX, in the case that p = 1/2. Solution. Note that X only takes values in {0, 1}. We have that P(X = 0) = P(X1 = 0, X2 = 0) = P(X1 = 0)P(X2 = 0), since X1 and X2 are independent. Thus P(X1 = 0) = (1 − p)2 and P(X1 = 1) = 1 − (1 − p)2 = 2p − p2 . Hence EX = 0 · (1 − p)2 + 1 · (2p − p2 ) = p(2 − p) Exercise 27.3 (The reason for the n − 1). Let X1 , . . . , Xn be a simple random sample. Suppose that Var(X1 ) = σ 2 . Show that E(S 2 ) = σ 2 . Solution. Hint: Also let EX1 = µ. Note that ¯ +X ¯ 2 = Xi2 − ¯ 2 = Xi2 − 2Xi X (Xi − X)

n 2 X 2 ¯ 2. Xi Xj − Xi2 + X n j=1,j6=i n

Note that the Xi ’s are independent, thus EXi Xj = µ2 , when i 6= j. Also recall that the short-cut formula gives that EXi2 = Var(Xi ) + (EXi )2 = σ 2 + µ2 .

MATH 526 LECTURE NOTES

75

28. Chapter 8.4 (KU custom edition 7.4) 28.1. The central limit theorem. Let X1 , X2 , X3 . . . , be independent random variables with the same distribution. Let EX1 = µ and V ar(X1 ) = σ 2 > 0. Let Z ∼ N (0, 1) and Φ(x) = P(Z ≤ x). Let Sn =

n X

Xi

i=1

and Zn =

Sn − nµ √ . σ n

(Note that we know that in the special case where the Xi are normal random variables we have that Zn ∼ N (0, 1).) The CLT states that for all x ∈ R we have that as n → ∞, Fn (x) = P(Zn ≤ x) → Φ(x). A standard rule of thumb is that Fn (x) ≈ F (x) if n > 30. This amazing theorem tells us that we can estimates the probabilities associated with Zn by pretending Zn is Z, without knowing anything about the underlying distribution of the Xi . It is this theorem that makes much of statistics possible, since in practice we may not know the underlying distribution of a population. In the context of statistics courses, it is often useful to observe that ¯ −µ X √ . Zn = σ/ n In the context of random sampling note that X1 , . . . , Xn is a simple random sample from a distribution with population mean µ and variance σ 2 . 28.2. Using the CLT. Exercise 28.1. Let X1 , . . . , X100 denote the actual net weights of 100 randomly selected bags of concrete mix, where each bag marked 50lbs, but in fact the weight is random. We assume that the Xi are independent and have the same distribution. (a) If the expected weight of each bag is 50lbs and the variance is 1lbs2 , calculate ¯ ≤ 50.1) P(49.9 ≤ X ¯ is the usual sample mean (approximately) using the CLT, where X given by

76

TERRY SOO

100

X ¯= 1 X Xi . 100 i=1 (b) If the expected weight of each bag is 49.8 rather than 50, calculate ¯ ≤ 50.1). P(49.9 ≤ X 28.3. Chapter 6.5, Normal approximations to the binomial. The central limit was first proved for Bernoulli random variables, and can be used to approximate a Binomial distribution. Such approximations were especially important before modern computers were readily available. Let (Xi )∞ i=1 be independent Bernoulli random variables with parameter p ∈ (0, 1). Recall that EX1 = p and Var(X1 ) = p(1 − p). Let Sn = X1 + · · · + Xn . Recall that Sn ∼ Bin(n, p). Let Sn − np Zn = p . np(1 − p) The central limit implies that Zn is approximately normal when n is large. Exercise 28.2. Let X ∼ Bin(100, 1/2). Use your calculator to compute P(X ≤ 45), and use the central limit to approximate P(X ≤ 45). Solution. Let Z 0 =

X−50 , 5

and let Z ∼ N (0, 1). We have that

P(X ≤ 45)

= = ≈CLT =table

P(X − 50 ≤ −5) P(Z 0 ≤ −1) P(Z ≤ −1) 0.1587

The calculator gives 0.1841 If the previous exercise seems unsatisfying, we can apply the so-called continuity correction to improve the approximation. The following adjusts for the fact that we are trying to approximate something discrete with something continuous. Let X ∼ Bin(n, p), then x + 0.5 − np  . P(X ≤ x) ≈ P Z ≤ p n(p)(1 − p) Notice the extra 0.5 term. Some textbooks state that this approximation is adequate when both np ≥ 10 and n(1 − p) ≥ 10. 

MATH 526 LECTURE NOTES

77

Exercise 28.3. Do the previous exercise using the continuity correction. (R) 28.4. Normal approximation to the Poisson. We know from Exercise 13.9 that the sum of two independent Poisson random variables is again a Poisson random variable. It particular, if Xi are independent Poisson random variables with mean 1, then Sn = X1 + · · · + Xn is a Poisson random variable with mean n and variance n. Exercise 28.4. Let Y ∼ P oi(50). Apply the central limit theorem to estimate to approximate P(Y ≤ 47). Solution. Let Xi be independent Poisson random variables with mean 1, then S = X1 + · · · + X50 ∼ P oi(50). Let S − 50 Z0 = √ . 50 The central limit theorem gives that Z 0 is approximately a standard normal random variable. Let Z ∼ N (0, 1). We have that P(S ≤ 47)

= = ≈CLT ≈table

P(S − 50 ≤ −3) P(Z 0 ≤ −0.424) P(Z ≤ −0.424) 0.3372.

Exercise 28.5 (An important observation for the case of two distributions). Later, we will state a version of the CLT for the following set-up. For now, consider the following exercise. Let X1 , . . . , Xm , Y1 , . . . , Yn be independent random variables, where all the Xi ’s have the same mean 2 and all the Yj ’s have the same mean µY and variµX and variance σX 2 ance σY . (For CLT we will need to also assume that all the Xi0 s come from a common distribution, and similarly, all the Yi0 s come from a ¯ and Y¯ in the usual way: common distribution.) Define X m

X ¯= 1 X Xi , m i=1 and

n

1X Y¯ = Yj . n j=1 Consider the random variables ¯ − Y¯ W =X

78

TERRY SOO

and

¯ − Y¯ − (µX − µY ) X q . 2 2 σY σX + n m

Z= (a) (b) (c) (d) (e)

What What What What What

is is is is is

the the the the the

¯ What is the mean of Y¯ ? (R) mean of X? ¯ What is the variance of Y¯ . (R) variance of X? mean of W ? (R) variance of W ? (R) mean and variance of Z? (R) 29. Super Quiz

Answers based on tables. Q1: a)0.07, b)0.285714, c)2.4331, d) 12.705. Q2: b)0.130377, d)0.08564. Q3: a)0.2643, b) 0.2228, c)0.7368, d)0.99375, e)0.0336. 30. Chapter 8.5 30.1. t-distribution. We will define a one-parameter family of distributions that are closely related to the standard normal distribution and will be very useful later on. Our approach here will be slightly more basic than approach taken in your textbook. Let (Xi ) be a sequence of independent normal random variables with mean µ and variance σ 2 . Assume n ≥ 2. As usual set, n 1X ¯ X= Xi , n i=1 and

n

S2 =

1 X ¯ 2. (Xi − X) n − 1 i=1

Consider the random variable T defined via: ¯ −µ X √ . T = Tn−1 = S/ n We say that the (continuous) random variable T has a t-distribution with parameter ν = n − 1 (degrees of freedom); similarly if Y is any random variable with the same distribution (same cdf, same pdf) as T , then we say that Y has a t-distribution with n − 1 degrees of freedom, and write Y ∼ tn−1 . Note that if we replaced the random variable S by the number σ in the definition of Tn−1 , then Tn−1 ∼ N (0, 1). In fact, as n → ∞, we have P(Tn ≤ x) → Φ(x); where Φ is the cdf for the standard normal distribution. We will need the following properties of Tn−1 . The pdf

MATH 526 LECTURE NOTES

79

for Tn−1 has a complicated formula, but we will only need to refer to the following property. We have that ETn−1 = 0; moreover, if gn−1 is the pdf for Tn−1 , then gn−1 is an even function. Hence Exercise 16.8 applies. As with the normal distribution, we may retrieve the cdf for a tdistribution using a table or using a calculator. For example, let T5 be a t-distribution with ν = 5 degrees of freedom. Suppose we want to compute P(−1 ≤ T5 ≤ 2): (i) Press the blue 2nd button, (ii) then press DISTR (VARS), (iii) then press 6 for tcdf, (iv) enter: tcdf(-1,2,5) (make sure you use the negative sign and not the minus) (v) which gives 0.767421529. 30.1.1. Critical values and t-alpha notation. Let Tν ∼ tν . Let α ∈ (0, 1). The number tα = tα,ν is the number such that P(Tν ≥ tα,ν ) = α. The numbers zα are sometimes called z-critical values. The tables at the back of your book give the values of tα,ν for ν = 1, . . . , 30, and α = 0.4, 0.3, 0.2, 0.15, 0.1, 0.05, 0.025. The table also gives values in the case where ν = ∞, by this, they mean tα,∞ = zα (see Section 24.1). Exercise 30.1. Let T ∼ t10 . Compute P(T > 1 | T > 0.5). 31. Chapter 9.1-9.3 31.1. Estimators and Estimates. An point estimator is a statistic that is meant to give us an idea of a parameter of a population, for example, the mean or variance. Thus both the statistics given sample mean and variance are point estimators. We call the observed values of point estimators, point estimates. How do we measure how good a point estimator is? The following is very basic way we can classify point estimators. 31.2. Unbiased estimators. Suppose that θ is a parameter for a population. An estimator X for θ is unbiased if EX = θ. ¯ the sample mean, is an Example done in class 31.1. Show that X, unbiased estimator for the population mean. Exercise 31.2. Show that S 2 , the sample variance, is an unbiased estimator for the population variance. (Use Exercise 27.3)

80

TERRY SOO

Exercise 31.3. Suppose you have a random sample X1 , . . . , Xn , where each Xi ∼ Bern(p). We know that the variance is given by σ 2 = Var(Xi ) = p(1 − p). We want to estimate σ 2 . Of course we can use S 2 , however it seems more natural to use ¯ − X), ¯ Y = X(1 ¯ is an estimator for p. Find EY . since X ¯ − EX ¯ 2 . Thus by the short-cut Solution. We have that EY = EX formula,  1 2 ¯ ¯ ¯ EY = EX − Var(X) − (EX) = p(1 − p) 1 + n Exercise 31.4 (Continuous German tank problem). Suppose you have a random sample X1 , . . . , Xn , where we have Xi ∼ U (0, θ). We do not know θ and we want to estimate it. Consider M = max {X1 , . . . , Xn } , and ˆ = n + 1 M. Θ n (a) What is the cdf for X1 ? (b) Find the cdf for M . Hint: Observe that {M ≤ x} = {X1 ≤ x} ∩ · · · ∩ {Xn ≤ x} , and remember that the Xi are assumed to be independent. (c) Find the pdf for M by using calculus. n (d) Show that EM = n+1 θ. ˆ =θ (e) Show that EΘ Solution. (a) For x < 0, we have P(X1 ≤ x) = 0, for x ≥ θ, we have P(X1 ≤ x) = 1, and for x ∈ [0, θ], we have P(X1 ≤ x) = xθ . (b) Since {M ≤ x} = {X1 ≤ x} ∩ · · · ∩ {Xn ≤ x} , and the Xi are independent, we have by the previous part that  x n P(M ≤ x) = , θ for all x ∈ [0, θ], P(M ≤ x) = 1 for all x > θ, and P(M ≤ x) = 0 for all x < 0.

MATH 526 LECTURE NOTES

81

(c) By taking a derivative, we obtain that   if x < 0 0 n n−1 fM (x) = θn x if x ∈ [0, θ]  0 if x > θ. (d) We have that Z EM = 0

θ

θ n n n n n+1 x θ. x dx = = n n θ θ (n + 1) n+1 0

ˆ = (e) By definition, Θ

n+1 M, n

so by the previous part,

n+1 EM = θ. n Exercise 31.5 (German tank problem). Let 1 ≤ k ≤ N . Let A = {1, . . . , N }. Suppose we choose a subset of size k from A uniformly  at n random, so that a subset of size k is chosen with probability 1/ k . Let M be the maximum value of the subset of size k. For example, in the case k = 3, and {2, 5, 9} is chosen, then M = 9. (a) Convince yourself that P(M ≥ k) = 1. (b) Fix m ≥ 1. How many subsets of A of size k have m as its maximum value. (Hint: if m is the maximum value, then you only have k − 1 choices left, and you can not choose something that is bigger than m, and m is already taken!) (c) Find the pdf for M . (d) Argue that for k ≤ m ≤ N , we have    N  X m−1 N = . k−1 k m=k ˆ = EΘ

(e) Argue that for k + 1 ≤ m ≤ N + 1, we have    N +1  X m−1 N +1 = . (k + 1) − 1 k+1 m=k+1 (f ) Show that k (N + 1). k+1 ˆ for N defined by (g) Consider the point estimator N EM =

ˆ = k + 1 M − 1. N k ˆ is an unbiased estimator for N . Show that N

82

TERRY SOO

31.3. Errors. If we are trying to estimate a parameter θ, sometimes ˆ The estimated standard error of an we denote its estimator by Θ. q ˆ (unbiased) estimator is given by Var(Θ). In the special case where we are sampling for a normal distribution with mean µ, it is known that the sample mean is an estimator for µ such that has the least variance amongst all the other unbiased estimators you can cook up. Exercise 31.6. Find the estimated standard error for the sample mean. Exercise 31.7. ˆ in Exercise 31.4. (a) Find the estimated standard error for Θ (b) Referring again to Exercise 31.4, consider the estimator for θ de¯ Show that 2X ¯ is also an unbiased estimator for θ. fined by 2X. (c) Which one has a smaller estimated standard error (for large values of n)? 32. Chapter 9.4, 9.10 (I) The central limit theorem or an assumption of normality will allow us ¯ (in simple random sampling) to estimate how good the sample mean X is as a point estimator for the true (population) mean µ (which we do not know). We will start first with a somewhat unrealistic, but instructive example. 32.1. Baby (exact) confidence intervals (for the population mean): normal population, known population variance. Let Xi be independent normal random variables all with mean µ and vari¯ be the usual sample ance σ 2 . Here µ is unknown, but σ is. Let X mean. We know that ¯ −µ X √ ∼ N (0, 1). Z= σ/ n Recall that for α ∈ (0, 1), zα is the number such that P(Z ≥ zα ) = α, so that P(−zα/2 ≤ Z ≤ zα/2 ) = 1 − α. Consider the following calculation 1 − α = P(−zα/2 ≤ Z ≤ zα/2 )   ¯ −µ X √ ≤ zα/2 = P − zα/2 ≤ σ/ n   σ ¯ − µ ≤ zα/2 √σ = P − zα/2 √ ≤ X n n    σ σ ¯ − zα/2 √ , X ¯ + zα/2 √ = P µ∈ X ; n n

MATH 526 LECTURE NOTES

83

that is, the probability that the true mean µ lies in the random interval given by   ¯ − zα/2 √σ , X ¯ + zα/2 √σ X n n is 1 − α. This calculation motivates the following definition. Suppose we ac¯ = x¯. Then tually observe the values of X1 , . . . , Xn , and we find that X we say that the (deterministic) interval given by σ σ  x¯ − zα/2 √ , x¯ + zα/2 √ n n is a 100(1 − α) % (exact, two-sided) confidence interval for µ. Sometimes a CI is also expressed compactly as σ x¯ ± zα/2 √ . n √ Recall that VarX = √σn . The term √σn is also sometimes called the ¯ standard error of X. Let us stress that a confidence interval (CI) for µ is computed after we collect data, and will be a deterministic interval like (134, 151). The population mean µ is a fixed deterministic number as well, like µ = 123 or µ = 143. Thus, once the confidence interval is computed, whether µ belongs to a confidence interval is not a random event. It is incorrect to say that µ will lie in a confidence interval with probability 1 − α; this is only true for the random interval given by our motivating calculation. Example done in class 32.1. What is the probability that 2.5 ∈ [2, 3]? What is the probability that 2.5 ∈ [3, 5]? Example done in class 32.2. Suppose I know that the heights of men in Lawrence is normally distributed with some unknown mean µ, and know the variance σ 2 = 16cm2 . Suppose we collect a random sample of 8 men’s heights, in cm, in Lawrence, and have the following data: 178, 173, 176, 156, 190, 175, 170, 190 Compute a 95 percent CI for µ. Solution. We have that CI is given by σ σ  x¯ − zα/2 √ , x¯ + zα/2 √ ; n n where n = 8, σ = 4, and α = 0.05. We need to compute x¯; this can be done with your calculator: (i) Press the blue 2nd button,

84

(ii) (iii) (iv) (v)

TERRY SOO

then press DISTR (LIST), side scroll to option MATH and choose mean, enter: mean({178, 173, 176, 156, 190, 175, 170, 190}) which gives 176

We need to determine the value of z0.05/2 . From Exercise 24.1, we have that z0.05/2 ≈ 1.96. Thus the CI is given by (173.228, 178.772). 33. Chapter 9.4, 9.10 (II) What about when the variance σ 2 is unknown? Here we use point estimators for σ 2 . For example, n

1 X ¯ 2. S = (Xi − X) n − 1 i=1 2

In the case of coin flips where Xi take the values 1 and 0, we can also use the random variable given by ¯ − X). ¯ X(1 33.1. Large sample size (CLT approximate) confidence intervals (for the population mean): unknown population variance. Let X1 , . . . , Xn be a simple random sample (where the Xi ’s may not necessary be normal). Then for n large Zn =

¯ −µ X √ ≈ N (0, 1); S/ n

that is P(Zn ≤ x) is approx given by P(Z ≤ x) when n is large (textbooks sometimes state n ≥ 40). A 100(1 − α)% (approx two-sided) confidence interval can be derived in a similar way by replacing σ with a point estimator in our earlier discussion in Section 32.1. We obtain that the random interval given by h

i ¯ − (zα/2 ) √S , X ¯ + (zα/2 ) √S X n n

contains the population mean µ with probability approximately 1 − α. Here, the probability is approximate since we appealed to a version of the central limit theorem. (Sometimes the term √Sn is called the estimated standard error.)

MATH 526 LECTURE NOTES

85

Suppose we actually observe the values of X1 , . . . , Xn , and we find ¯ = x¯ and S = s. Then we say that the (deterministic) interval that X given by s s  x¯ − zα/2 √ , x¯ + zα/2 √ n n is a 100(1 − α)% (approx two-sided) confidence interval for µ. Exercise 33.1. Suppose that in a random sample of 50 kitchens with gas cooking appliances we monitor the CO2 levels for a one week period and find that the sample mean was 654.16 (ppm) and the standard deviation was 164.43. (a) Calculate (approx) a 95 percent (two-sided) confidence interval for µ—the true average CO2 level in the population of all homes from which the sample was selected. (b) Suppose that we assume that the sample standard deviation will be no greater than 175. What sample size would be necessary to obtain an interval width no greater than 50 (ppm) for a confidence level of 95 percent? Hint: you may use that if Z ∼ N (0, 1), then P(−1.96 ≤ Z ≤ 1.96) = 0.95 and P(Z ≤ 1.96) = 0.975. Solution. Since the sample size is large n = 50 > 40, we may use the large sample size confidence intervals approximations that are based on the CLT. Thus we use the formula: s x¯ ± (zα/2 ) √ . n So the required confidence interval is (608.58, 699.74). For the second part of the question, note that the width is given by s 2(1.96) √ . n Since we know that s ≤ 175, the width is no greater than 175 2(1.96) √ . n Thus we need to solve for n in the inequality 175 50 ≥ 2(1.96) √ . n We easily obtain that n ≥ 188.2384. So take n = 189. In practice, it may be difficult to obtain large sample sizes. In the case where we sample from a normal distribution, it is not necessary to have a large sample size.

86

TERRY SOO

33.2. Exact confidence intervals (for the population mean): normal population, unknown population variance. Let X1 , . . . , Xn be a simple random sample where the Xi are normal random variables with mean µ and unknown variance σ 2 . Then for all n ≥ 1, we have ¯ −µ X √ ∼ tn−1 ; Zn = S/ n in other words Zn has t-distribution with n − 1 degrees of freedom. A 100(1 − α)% (approx two-sided) confidence interval can be derived in a similar way as before. Instead of appealing to the standard normal distribution, we appeal to the t-distribution. Recall that if Tν ∼ tν , then the numbers tα,ν are such that P(Tν ≥ tα,ν ) = α. We obtain that the random interval given by i h ¯ + (tα/2 ) √S ¯ − (tα/2 ) √S , X X n n contains the population mean µ with probability 1 − α. Suppose we ¯ = x¯ and actually observe the values of X1 , . . . , Xn , and we find that X S = s. Then we say that the (deterministic) interval given by s s  x¯ − tα/2 √ , x¯ + tα/2 √ n n is a 100(1 − α)% (two-sided) confidence interval for µ. Example done in class 33.2. From experience, we know that test scores for a certain standardized test can be modelled using the normal distribution; that is, if X is the test score of a randomly sampled student, then X is normally distributed with mean µ and variance σ 2 , for some µ and some (unknown) σ > 0. Suppose that in a simple random sample of 8 students we have the following test scores: 500, 620, 520, 700, 562, 658, 550, 656. Find a 95 percent confidence interval for the true mean µ. Solution. We need to compute x¯ and s¯; this can be done with your calculator. We obtain that x¯ = 595.75. To obtain s, we can proceed as follows: (i) Press the blue 2nd button, (ii) then press DISTR (LIST), (iii) side scroll to option MATH and choose stdDev, (iv) enter: mean({500, 620, 520, 700, 562, 658, 550, 656}) (v) which gives 72.8001

MATH 526 LECTURE NOTES

87

From the tables, we have that t0.05/2,7 = 2.365. So the required CI is given by 595.75 ± 60.87 or (534.88, 656.62). 33.3. Using your calculator. Your calculator is capable of computing CIs. For example, we can do Exercise 33.2 in the following way: (i) Press the STAT button, (ii) then side scroll to TESTS, (iii) choose option 8-TInterval, (iv) now you can enter raw data or computed statistics: (v) we can enter x¯ = 595.75, Sx = 72.8001, n = 8, and C-Level=0.95. Similarly, Exercises 32.2 and 33.1, can be easily treated using option 7-ZInterval. I encourage you to explore the functions of your calculator or other software. However, simply using your calculator to compute CI will not be sufficient to receive any marks on tests and assignments in this course. In a question, if you just write the the required CI from your calculator, it is very likely that you will not receive any partial credit. You are welcome to use the functions of your calculator to check your work, however, I expect you to manually compute the CI as I have in the exercises. The next exercise can be treated using option A-1-PropZInt. 33.4. Large sample size (CLT approximate) confidence intervals (for the population mean): proportions. Let X1 , . . . , Xn be a simple random sample where the Xi are Bernoulli random variables ¯ is the with parameter p ∈ (0, 1). Note that EXi = µ = p. Note that X proportion of ‘successes’ or ones that occur in n trials. In this context, ¯ = Pˆ . The observed value of Pˆ is often denoted by pˆ. we often write X A version of the central limit theorem implies that Pˆ − p Zn = q ≈ N (0, 1); ˆ ˆ P (1 − P )/n where the approximation is good if both np, n(1 − p) ≥ 10. A 100(1 − α)% (approx two-sided) confidence interval can be derived in a similar way as before. We obtain that the random interval given by s s h ˆ ˆ ˆ ˆ i ¯ − (zα/2 ) P (1 − P ) , X ¯ + (zα/2 ) P (1 − P ) X n n

88

TERRY SOO

contains the population proportion p with probability 1 − α. Suppose we actually observe the values of X1 , . . . , Xn , and we find that Pˆ = pˆ. Then we say that the (deterministic) interval given by r r  pˆ(1 − pˆ) pˆ(1 − pˆ)  pˆ − zα/2 , pˆ + zα/2 n n is a 100(1 − α)% (approx two-sided) confidence interval for p. Note that in order to ensure that the approximate CI is reliable, we check that pˆn, pˆ(1 − pˆ) ≥ 10. Example done in class 33.3. We wish to investigate p, the proportion of people with the disease Deadly-Virus who die within three years after receiving a newly discovered treatment. We have taken a random sample of 187 people with Deadly-Virus and gave them the new treatment. Three years later, 170 of these patients have died. (a) Find a 97 percent CI for p. (b) Suppose that in the future we wish to carry out a second study in which we will construct a 99 percent confidence interval which will estimate p to within 0.001. Using the data given in the setup as a pilot study, estimate the sample size needed. Solution. (a) We have that n = 187 and pˆ = 170 . Note that pˆn, pˆ(1− pˆ) ≥ 10. We 187 need to find z0.03/2 . From the tables, we have that P(Z ≤ 2.17) ≈ 0.985. The inverse normal function on your calculator gives P(Z ≤ 2.170090375) ≈ 0.985. Thus we obtain that the CI is given by 0.9090 ± 0.045619. (b) To estimate the size n, we solve for n in the equation √ 0.001 = z0.01/2 σ/ n. We have that the tables give z0.01/2 ≈ 2.576, and since we are supposed to use the data as a pilot study, we estimate that p σ ≈ pˆ(1 − pˆ), where pˆ = 170/187. Some algebra yields n ≈ 548411. Exercise 33.4. Let X ∼ Bern(p). Consider f (p) = Var(X). What is the maximum value of f . Exercise 33.5. We have a magician’s coin, and we want to investigate p ∈ (0, 1), the probability that on a flip of the coin it come up heads. We want to construct a 95 percent confidence interval for p.

MATH 526 LECTURE NOTES

89

(a) Estimate how many flips do we need to do to estimate p to within 0.001. (b) Suppose I flipped the coin 100 times and got 23 heads and 77 tails. What is the CI? 34. Chapter 10-10.4, 10.8 (I) 34.1. Introduction to hypothesis testing. Suppose I want to determine whether a coin is unfair. What if I flipped the coin 30 times, and found that I got 29 heads? Would this be enough to convince you that the coin was unfair? What if I only got 28 heads? Consider the following set-up, called hypothesis testing. It seems reasonable that without any evidence to the contrary, we should assume that the coin is fair; this is called the null hypothesis, and is often denoted by H0 . The hypothesis that the coin is not fair, is called the alternate hypothesis, and is often denoted by Ha or H1 . Suppose we set the following criteria: if we flip the coin 30 times and if we see less than or equal to 2 heads, or more than or equal to 28 heads, we deem (or guess) the coin to be unfair, since otherwise the coin is fair, and we have witnessed a unlikely/extreme event; this is called the critical range or rejection range. (Of course there is nothing stopping a coin from coming up heads 100 times in row; in fact, it would not be a fair coin if this could not occur.) Let p ∈ (0, 1) be the true probability that a flip come up heads. Let X ∼ Bin(30, p), so that X is the number of heads in 30 flips. Notice that if the coin is fair, then p = 1/2, and we can exactly compute all the probabilities associated with X. We refer to X as the test statistic, and we can express the critical range as the union of the events X ≥ 28 and X ≤ 2. I We can easily compute P(X ≤ 2 or X ≥ 28) ≈ 8.6799 × 10−7 ; this is exactly the probability that the null hypothesis is correct, but rejected in favour of the alternate hypothesis; when this occurs, this is called a type 1 error, and the probability of committing a type I error is referred to the level of significance of the test is often denoted by a Greek letter α. A type II error occurs when we do not reject the null hypothesis, and it is false; the probability of this event is denoted by the Greek letter β, and may be difficult to estimate of compute since if the null hypothesis is false, we need to know p. Notice that if X does not fall into the rejection range, then that is hardly evidence that p = 1/2. Notice also the calculation that P(X ≤ or X ≥ 28) is small is also a calculation that assumes p = 1/2. In

90

TERRY SOO

hypothesis testing, we can only reject the null hypothesis, but we do not prove it; this is akin to criminal proceedings in the United States, where people are declared non-guilty, but hardly ever is a verdict of innocent given. After we collect data, and have that X = xobs , we will easily see if X falls in the rejection range and whether we reject the null hypothesis or not; if it falls into the rejection range then we say that observed test statistic is significant. A slightly different, and perhaps better approach is to compute the P-value. The P -value is the lowest level of significance at which the observed value of the test statistic is significant. Suppose that xobs = 3, then we have that P -value = P(X ≤ 3 or X ≥ 27) = 2P(X ≤ 3), where we do the computation assuming p = 1/2. Notice that by definition of the P -value, we reject the null hypothesis with significance level α if and only if P -value ≤ α. Note that the P -value is defined without preselecting a significance level. Thus the P -value can be used to decide whether a null hypothesis would be rejected with different significance levels. For proportional data we know that the underlying distribution is a Bernoulli, and thus a natural test statistic is the sample mean, which is a Binomial random variable. Note that in the previous discussion if we did not have the binomial tables or our calculator, we could still appeal to the central limit, since under the null hypothesis that p = 1/2, we have that the distribution of X − 1/2 √ , Z0 = 0.5/ 30 can be approximated by a the distribution a standard normal random variable Z. We can use Z 0 are our test statistic, and define the rejection range based on Z 0 . For example, if we wanted a test with a level of significance α, we know that P(−zα/2 ≤ Z 0 ≤ zα/2 ) ≈CLT P(−zα ≤ Z ≤ zα/2 ) = 1 − α;  thus we could define a rejection range as the union of Z 0 < −zα/2 and Z 0 > zα/2 . 34.2. Two-sided test on proportions CLT based. We wish to investigate p, the proportion of people who will pass a certain math course using a new textbook. We are interested in testing with a significance of 0.05 whether the new textbook will effect the pass-rate. Using the old textbook, it is known that 70 percent of students will

MATH 526 LECTURE NOTES

91

pass the course. Suppose we signed up 123 students to take this math course using the new textbook. (a) What should H0 be? (b) What should Ha be? (c) Appealing to the central limit theorem, what is a suitable test statistic? (d) What is the rejection range? Suppose that the course is over and 99 students have passed the course. (i) What is the observed value of the test statistic? (ii) What is the P -value? (iii) What are your conclusions? Solution. (a) We take H0 : p = 0.7 (b) We take Ha : p 6= 0.7. (c) Let Xi be independent Bernoulli random variables with parameter ¯ to p, so that if Xi = 1, then the ith has passed the course. Set X be the usual sample mean, and consider the test statistic given by ¯ − 0.7 X Z0 = p . 0.7(0.3)/123 Note that under H0 , each Xi has mean 0.7 and variance 0.7(0.3), and Z 0 ≈ N (0, 1). (d) Notice that if Z ∼ N (0, 1), then P(−zα/2 ≤ Z 0 ≤ zα/2 ) ≈CLT P(−zα/2 ≤ Z ≤ zα/2 ) = 1 − α. Thus with α = 0.05, we take the rejection range to be the union of {Z 0 > 1.960} and {Z 0 < −1.960} (i) Z 0 = 2.53821 (ii) P -value ≈CLT 2P(Z > 2.54) = 1 − 0.9945 = 0.0055. (iii) We reject H0 in favour of Ha . 34.3. One-sided test on proportions. We wish to investigate p, the proportion of people with the disease Deadly-Virus who die within three years after receiving a newly discovered treatment. We are interested in testing, with a level of significance of 0.01, whether the new treatment will improve life expectancy. Without the treatment, it is known that 95 percent of the patients die within three years. Suppose that we signed up 187 patients to take the new drug. (a) What should H0 be? (b) What should Ha be?

92

TERRY SOO

(c) Appealing to the central limit theorem, what is an appropriate test statistic? (d) What is the rejection range? Suppose that after three years 170 of patients have died. (i) What is the observed value of the test statistic? (ii) What is the P -value? (iii) What are your conclusions? Solution. (a) We take H0 : p = 0.95 (b) We take Ha : p < 0.95. Here we do not take Ha : p 6= 0.95, since we do not care if p > 0.95; who wants a less effective drug? (c) Let Xi be independent Bernoulli random variables with parameter p, so that if Xi = 1, then the ith person has died within three years. ¯ to be the usual sample mean, and consider the test statistic Set X given by ¯ − 0.95 X Z0 = p . 0.95(0.05)/187 Note that under H0 , each Xi has mean 0.95 and variance 0.95(0.05), and Z 0 ≈ N (0, 1). (d) Notice that if Z ∼ N (0, 1), then P(Z 0 < −zα ) ≈CLT P(Z < −zα ) = α. Thus with α = 0.01, we take the rejection range to be {Z 0 < −2.326} (i) Z 0 = −2.5668 (ii) P -value ≈CLT P(Z > 2.57) =table 1 − 0.9945 = 0.0055. (iii) We reject H0 in favour of Ha . Exercise 34.1. We wish to investigate the mean lifetime, µ, of new expensive type of light bulb. We know that the lifetimes are exponentially distributed, and we want to test with a level of significance 0.05, to see if they are better than traditional lightbulbs which are known to have a mean lifetime of 5 years. Suppose we are given random sample of 100 of these new lightbulbs. (a) What should H0 be? (b) What should Ha be? (c) What is an appropriate test statistic? (d) What is the rejection range? Suppose that after a long wait, we find that the sample mean of the lifetime of the new light bulbs is 5.2 years (i) What is the observed value of your test statistic? (ii) What is the P -value?

MATH 526 LECTURE NOTES

93

(iii) What are your conclusions? (iv) What is the smallest value of the observed sample mean that would allow you to reject H0 in favour of Ha , at a significance level of 0.05. Solution. (a) We take H0 : µ = 5. (b) We take Ha : µ > 5. Here we do not take Ha : µ 6= 5, since we do not care if µ < 5; who wants a more expensive lightbulb that does not last as long? (c) Let Xi be independent exponential random variables with mean µ, ¯ to be so that if Xi = 1, is the lifetime of the ith lightbulb. Set X the usual sample mean, and consider the test statistic given by Z0 =

¯ −5 X . 5/10

Note that under H0 , the mean of each Xi is 5 and the variance is 52 , since Xi are assumed to be exponential. Thus, under H0 , the central limit theorem gives that Z 0 ≈ N (0, 1). (d) Notice that if Z ∼ N (0, 1), then P(Z 0 > zα ) ≈CLT P(Z > zα ) = α. Thus with α = 0.05, we take the rejection range to be {Z 0 > 1.645} . (i) We have that Z 0 = 0.4 (ii) We have that P -value = P(Z 0 > 0.4) ≈CLT P(Z > 0.4) =table 1 − 0.6554 = 0.3446 (iii) We do not reject H0 . (iv) We solve for x¯ in 1.645 =

x¯ − 5 , 5/10

and obtain that x¯ = 5.8225; thus any sample mean bigger than this number will do. (v) One might argue that we should take H0 to be µ ≤ 5. However, note that the data that would lead to a rejection of H0 : µ = 5, would also lead to a rejection of H0 : µ = 4.96, and if we rejected H0 : µ = 4.96, this would only allow us to conclude that µ > 4.96. Let us remark that we have defined the rejection range of a test statistic Z to be an event; in particular, it may be something of the form {Z > zα } for some zα ∈ R. In your textbook, the same rejection range is defined to be a set of real numbers {z ∈ R : z > zα }.

94

TERRY SOO

35. Test 3 Answers based on tables. Q1: a)0.6915, b)3.92, c)53, d)25, e)77, f)0.7088; Q2: a)1.661, c)0, d)0.37577; Q3: b)0.105989, c)0.46445, d)0.28735; Q4: a) 5/12, c) 35/36, d)0.5948; Q5: a)(47.222, 49,278), b)FTTT 36. Debrief 37. Chapter 10-10.4, 10.8 (II) In what follows, we will treat the cases, when we do not assume anything about the underlying distribution, and where we know that the underlying distribution is normal. 37.1. CLT based. We wish to investigate µ, the true mean price of a burger in Kansas city. It is known that in the United States the mean price of a burger is 12 dollars, and we want to test with a level of significance 0.05 to see whether the price in KC is different. Suppose we will take a random sample of 81 restaurants in KC. (a) What should H0 be? (b) What should Ha be? (c) What is an appropriate test statistic? (d) What is the rejection range? Suppose that we found that the sample mean price was 13 dollars with a sample standard deviation of 3 dollars. (i) What is the observed value of the test statistic? (ii) What is the P -value? (iii) What are your conclusions? Solution. (a) We take H0 : µ = 12 (b) We take Ha : µ 6= 12. (c) Let (Xi ) be a random sample, where EXi = µ, so that Xi is the ¯ to be the usual sample price of a burger at the ith restaurant. Set X 2 mean, and S to be the usual sample variance, and consider the test statistic given by ¯ − 12 X Z0 = . S/9 Note that under H0 , the mean of each Xi is 12. Thus, under H0 , a version of the central limit theorem gives that Z 0 ≈ N (0, 1). (d) Notice that if Z ∼ N (0, 1), then P(−zα/2 ≤ Z 0 ≤ zα/2 ) ≈CLT P(−zα/2 ≤ Z ≤ zα/2 ) = 1 − α.

MATH 526 LECTURE NOTES

95

Thus with α = 0.05, we take the rejection range to be the union of {Z 0 > 1.960} and {Z 0 < −1.960} (i) We have that Z 0 = 3 (ii) P -value ≈CLT 2P(Z > 3) = 0.0026. (iii) We reject H0 in favour of Ha . Exercise 37.1. Do the previous example with the following change: we want to test to see if a Burger in KC is cheaper. 37.2. Normal population, unknown variance. 38. Chapter 10-10.4, 10.8 (III) We wish to investigate µ, the true mean height (in cm) of women (aged 18-55) in Lawrence. It is known that in the United States the mean height of women is 165 cm, and we want to test with a level of significance 0.001 to see whether the height in Lawrence is different. Suppose we will take a random sample of 25 women in Lawrence, and assume that their heights are normally distributed with the same mean and variance. (a) What should H0 be? (b) What should Ha be? (c) What is an appropriate test statistic? (d) What is the rejection range? Suppose that we found that the sample mean height was 171 cm with a sample standard deviation of 10 cm. (i) What is the observed value of the test statistic? (ii) What is the P -value? (iii) What are your conclusions? (iv) Would your reject H0 with a significance level of 0.005? (v) Construct a 99.9 percent CI for µ. Solution. (a) We take H0 : µ = 165 (b) We take Ha : µ 6= 165. (c) Let Xi be independent normal random variables all with mean µ, ¯ to be the usual so that Xi is the height of the ith women. Set X 2 sample mean, and S to be the usual sample variance, and consider the test statistic given by ¯ − 165 X T = . S/5 Note that under H0 , the mean of each Xi is 165. Thus, under H0 , we have that T ∼ t24 .

96

TERRY SOO

(d) Notice that if T ∼ t24 , then P(−tα/2,24 ≤ T ≤ tα/2,24 ) = 1 − α. Thus with α = 0.001, we take the rejection range to be the union of {T > 3.745} and {T < −3.745} (i) We have that T = 3 (ii) Note that P(T > 2.797) =table 0.005 and P(T > 3.091) =table 0.0025. Thus P -value ≈table 0.005. The calculator gives P(T > 3) =calc 0.00310, so that P -value =calc 0.00621. (iii) We do not reject H0 . (iv) No, since P -value ≥ 0.005. (v) We know that the CI is given by s x¯ ± √ tα/2,n−1 , n where α = 0.001, x¯ = 171, s = 10, n = 25, and tα/2,n−1 =table 3.745. Thus the CI is (163.51, 178.49) Exercise 38.1. Treat the example in Section 37.2, assuming we know that the true standard deviation is σ = 10. Solution. The only difference is that we get the advantage that we can define ¯ − 165 X Z= 10/2 to be the test statistic, and the under H0 , we have that Z ∼ N (0, 1). Instead of tα/2 we get to use zα/2 . The rejection range is given by the union of {Z > 3.290} and {Z < −3.290}. The observed value of the test statistic remains 3, and we still do not reject the null hypothesis. However, the P -value = 2P(Z > 3) =table 0.0026, is smaller, so much so that we would reject H0 with a significance level of 0.005. The required CI also is smaller, and is given by σ x¯ ± √ zα/2 , n where α = 0.001, x¯ = 171, σ = 10, n = 25, and zα/2 =table 3.290. Thus the CI is (164.42, 177.58).

MATH 526 LECTURE NOTES

97

38.1. Using your calculator. Your calculator is capable of carrying out all the test that we have discussed. For example, we can do the example in Section 37.2 in the following way: (i) Press the STAT button, (ii) then side scroll to TESTS, (iii) choose option T-Test, (iv) Now you can enter raw data or computed statistics: I encourage you to explore the functions of your calculator or other software. However, simply using your calculator to do the hypothesis testing will not be sufficient to receive any marks on tests and assignments in this course. In a question, if you just copy the output from your calculator, it is very likely that you will not receive any partial credit. You are welcome to use the functions of your calculator to check your work, however, I expect you to manually do the test and show your work. 38.2. The connection between hypothesis testing and confidence intervals. There is a close connection to between hypothesis testing and confidence intervals. Consider the case where we are sampling from a normal distribution with unknown variance, and are interested in the true mean µ. Given observed data {x1 , . . . , xn }, a 100(1 − α) percent two-sided confidence interval for µ is given by   s s x¯ − √ tα/2,n−1 , x¯ + √ tα/2,n−1 . n n Let µ0 be some number. If we want to test the null hypothesis that µ = µ0 , against the alternate hypothesis that µ 6= µ0 , with a significance level of α, then with the test statistic ¯ − µ0 X √ , T = S/ n the rejection range is given by the union of   T > tα/2,n−1 and T < −tα/2,n−1 . Some easy algebra yields that when we observe the value T = t, we have that t > tα/2,n−1 or t < −tα/2,n−1 , if and only if   s s µ0 6∈ x¯ − √ tα/2,n−1 , x¯ + √ tα/2,n−1 . n n In other words, we reject H0 if and only if µ0 does not lie in the corresponding confidence interval.

98

TERRY SOO

Exercise 38.2. The connection described in Section 38.2 does not hold in the case of proportional data where build a confidence interval and test statistic by appealing to the central limit theorem. Explain. 39. Super Quiz 40. Chapter 9.8, 9.10 40.1. Confidence intervals of the difference of two means: normal population, known variance. Let X1 , . . . , Xn and Y1 , . . . Ym all be independent normal random variables, where the Xi have mean µX 2 and variance σX , and the Yi have mean µY and variance σY2 . Note that ¯ − Y¯ − (µX − µY ) W =X is a normal random variable with mean 0 and variance 2 2 ¯ + Var(Y¯ ) = σX + σY . Var(W ) = Var(X) n m

Thus Z=

¯ − Y¯ − (µX − µY ) X q 2 σX σ2 + mY n

is a normal random variable with mean 0 and variance 1. Thus by a calculation done in Section 32.1, we have that the probability that µX − µY lies in the random interval " # r r 2 2 2 2 σ σ σ σ X X ¯ − Y¯ ) − zα/2 ¯ − Y¯ ) + zα/2 (X + Y , (X + Y n m n m is 1 − α. ¯ = x¯ and Y¯ = y¯, we say that a 100(1 − α) CI With observed data X for µX − µY is given by the deterministic interval r r h 2 2 σX σY2 σX σ2 i (¯ x − y¯) − zα/2 + , (¯ x − y¯) + zα/2 + Y . n m n m As before, we can construct confidence intervals under relaxed assumptions. 40.2. Confidence intervals of the difference of two means: CLT based, large sample size. Let X1 , . . . , Xn and Y1 , . . . Ym all be independent random variables, where the Xi all have the same distribution 2 with mean µX and variance σX , and the Yi all have the same distribution with mean µY and variance σY2 . Versions of the central limit gives

MATH 526 LECTURE NOTES

that Z0 =

¯ − Y¯ − (µX − µY ) X q 2 σ2 σX + mY n

Z0 =

¯ − Y¯ − (µX − µY ) X q 2 S2 SX + mY n

or

99

is approximately a standard normal random variable when both n and m are large. Some textbooks say this approximation is good when both n, m ≥ 40. ¯ = x¯ and Y¯ = y¯, we say that a (apThus with observed data X proximate) 100(1 − α) CI for µX − µY is given by the deterministic interval r r h 2 2 σY2 σ2 i σX σX (¯ x − y¯) − zα/2 + , (¯ x − y¯) + zα/2 + Y , n m n m in the case where σX and σY are known; and in the case where they 2 are not known, we need to also observe SX = s2X and SY2 = s2Y , and we use the interval given by r r h s2X s2Y s2X s2Y i (¯ x − y¯) − zα/2 + , (¯ x − y¯) + zα/2 + . n m n m In the case of where the Xi and Yi are Bernoulli random variables, instead of using SX and SY as point estimates for the standard deviation, we use ¯ − X), ¯ (S 0 )2 = X(1 X

and (SY0 )2 = Y¯ (1 − Y¯ ), and we have that Z0 =

¯ − Y¯ − (µX − µY ) X q 0 (SX )2 (SY0 )2 + m n

is approximately a standard normal random variable when n and m are large. ¯ = x¯ and Y¯ = y¯, we say that a (apThus with observed data X proximate) 100(1 − α) CI for µX − µY is given by the deterministic interval h (¯ x − y¯) − zα/2

r

(s0X )2 (s0Y )2 + , (¯ x − y¯) + zα/2 n m

r

(s0X )2 (s0Y )2 i . + n m

100

TERRY SOO

Example done in class 40.1. Suppose we have random samples of the heights of KU and KState students of size 60 and 70, respectively, and find that the sample mean height of KU students is 177 cm with a sample standard deviation of 20cm, and the sample mean height of KState students is 173 cm with a sample standard deviation of 15 cm. Find an approximate 95 percent confidence for the difference between µKU and µKState , the true mean height of KU and KState students respectively. Solution. The required CI is given by r r h s2X s2Y s2X s2Y i (¯ x − y¯) − zα/2 + , (¯ x − y¯) + zα/2 + . n m n m We have x¯ = 177, n = 60, sX = 20 and y¯ = 173, m = 70, sY = 15. Note that α = 0.05, and zα/2 = 1.960, so putting all the numbers is we obtain the CI (−2.161, 10.161). Exercise 40.2. Suppose out of a random sample of 87 KU students we find that 5 of them are math majors, and out of a random sample 49 KState students we find that 3 of them are math majors. Find an approximate 95 percent CI of the difference between pKU and pKState , the true proportion of KU and KState students who are math majors. Solution. The required CI is given by r r h (s0X )2 (s0Y )2 (s0X )2 (s0Y )2 i (ˆ pKU −pˆKState )−zα/2 + , (¯ x−¯ y )+zα/2 + , n m n m where (s0X )2 = pˆKU (1 − pˆKU ) and (s0Y )2 = pˆKState (1 − pˆKState ). We have n = 87, pˆKU = 5/87, and m = 49, pˆKState = 3/49. Note that α = 0.05, and zα/2 = 1.960, so putting all the numbers in yields the required CI (−0.0868, 0.0793) 40.3. Confidence intervals for the difference of two means: Normal population, unknown variance. As in the case the of one mean, the t-distribution plays a role here. Let X1 , . . . , Xn and Y1 , . . . Ym all be independent normal random variables, where the Xi 2 have mean µX and variance σX , and the Yi have mean µY and variance 2 σY . Suppose we do not know σX and σY , but we do know they are equal (for some reason), then it is possible to show that with Sp2 =

2 (n − 1)SX + (m − 1)SY2 , n+m−2

MATH 526 LECTURE NOTES

101

the random variable given by ¯ − Y¯ − (µX − µY ) X p T = Sp 1/n + 1/m has a t-distribution with n + m − 2 degrees of freedom. This fact allows us to construct to confidence intervals in the usual way. In the case where we do not know that the variances are equal, the situation becomes a bit more complicated. See your textbook for more details. 41. Chapter 9.9 41.1. Paired Observations. Let (X1 , Y1 ), . . . , (Xn , Yn ) be random variables. Assume that EXi = µX and EYi = µY . Let Di = Xi = Yi . Suppose that the Di are independent and are normally distributed with the same mean µ = µX − µY and same (unknown) variance. We have that ¯ −µ D √ . T = SD / n has a t-distribution with n − 1 degrees of freedom, where n

2 SD

1 X ¯ 2. = (Di − D) n − 1 i=1

¯ = d¯ and SD = sd , then As in the previous examples, if we observe D we say that the (deterministic) interval given by sd sd  d¯ − tα/2,n−1 √ , d¯ + tα/2,n−1 √ n n is a 100(1 − α)% (two-sided) confidence interval for µ. Pairing observations occur naturally in the setting where something (the same something) goes through two different treatments, and we want to compare effects of the treatments. Example done in class 41.1. Suppose we have a random sample 6 pairs of identical twins with the following IQ scores: (103, 101), (120, 125), (99, 100), (88, 90), (112, 115), (130, 125). Assume that the differences between the IQ scores of the twins are independent and normally distributed with the same mean µ and variance. Find a 95 percent CI for µ. Solution. We have that the differences are 2, −5, −1, −2, −3, 5.

102

TERRY SOO

So we have d¯ = −2/3 and sd = 3.614784. With α = 0.05 and n = 6, we have tα/2,n−1 = 2.571, so the required CI is given by (−4.46, 3.1268). 41.2. Hypothesis testing in the case of two means. Hypothesis testing in the case of two means proceeds in a similarly to the case of one mean, except that the value of interest of the difference of the two means. 41.3. The difference of two means: CLT based, large sample size. Let µKU and µKState be the true mean height of KU and KState students respectively. We wish to test with a significance level of 0.05 to see whether the true means are different. Suppose we will collect random samples of the heights of KU and KState students of size 60 and 70, respectively (a) What should H0 be? (b) What should Ha be? (c) What is an appropriate test statistic? (d) What is the rejection range? Suppose we find that the sample mean height of KU students is 177 cm with a sample standard deviation of 20cm, and the sample mean height of KState students is 173 cm with a sample standard deviation of 15 cm. (i) What is the observed value of the test statistic? (ii) What is the P -value? (iii) What are your conclusions? (iv) Would your reject H0 with a significance level of 0.25? Solution. (a) Let µ = µKU − µKState . We take H0 : µ = 0. (b) Ha : µ 6= 0. (c) Let Xi be the heights of the KU students and Yi be the heights of the KState students. Take ¯ − Y¯ − (µKU − µKState) X q Z0 = 2 SX S2 + mY n ¯ − Y¯ X = q 2 SX S2 + 70Y 60 since under H0 , µ = 0. Since n = 60 and m = 70 are large, we have that Z 0 is approximately a standard normal random variable.

MATH 526 LECTURE NOTES

103

(d) Let Z ∼ N (0, 1). Then P(−zα/2 ≤ Z 0 ≤ zα/2 ) ≈CLT P(−zα/2 ≤ Z ≤ zα/2 ) = 1 − α. Thus with α = 0.05, we have zα/2 = 1.960, and the rejection range is given by the union of {Z 0 < −1.960} and {Z 0 > 1.960}. Suppose we find that the sample mean height of KU students is 177 cm with a sample standard deviation of 20cm, and the sample mean height of KState students is 173 cm with a sample standard deviation of 15 cm. (i) (ii) (iii) (iv)

We have Z 0 = 1.27. We have that P -value = 2P(Z > 1.27) =table 0.204. We do not reject H0 . Yes, since P -value < 0.25.

41.4. The difference of two proportions, CLT based. The case of proportions differs slightly since under H0 we have additional information on the variances we can take advantage of. Let pKU and pKState be the true proportion of KU and KState students who are biology majors. We wish to test with a significance level of 0.05 to see whether the true proportions are different. Suppose we will randomly survey 80 and 90, KU and KState students, respectively. (a) Let µ = pKU − pKState . Take H0 : µ 6= 0. (b) Take Ha : µ 6= 0. (c) Let PˆKU and PˆKState be the sample proportion of biology majors in the respective universities. Note that under H0 , we have that µ = 0, and pKU = pKState = p for some common value p. Consider W = q

PˆKU − PˆKState pKU (1−pKU ) 80

+

pKState (1−pKState ) 90

PˆKU − PˆKState = p . p(1 − p)(1/80 + 1/90) We do not know what p is but we can form a pooled estimate for it: 80PˆKU + 90PˆKState Pˆ = 80 + 90 (Another reasonable approach would be to use PˆKU and PˆKState to estimate pKU and pKState separately, as done in the case of confidence intervals, however the pooled approach is better since we take advantage of the fact that under H0 , we have that pKU and

104

TERRY SOO

pKState are equal.) Set PˆKU − PˆKState Z0 = q . ˆ ˆ P (1 − P )(1/80 + 1/90) A version of the central limit theorem gives that Z 0 is approximately a standard normal random variable. (d) Let Z ∼ N (0, 1). Then P(−zα/2 ≤ Z 0 ≤ zα/2 ) ≈CLT P(−zα/2 ≤ Z ≤ zα/2 ) = 1 − α. Thus with α = 0.05, we have zα/2 = 1.960, and the rejection range is given by the union of {Z 0 < −1.960} and {Z 0 > 1.960}. Suppose we find that 20 of the KU students are biology majors, and 25 of the KState students are biology majors. (i) We have that Z 0 = 0.4098. (ii) We do not reject H0 .