Chapter 3 Special Discrete Random Variables

Section 3.4 1 Chapter 3 – Special Discrete Random Variables. Section 3.4 Binomial random variable An experiment that has only two possible outcom...
Author: Elizabeth Carr
Section 3.4

1

Chapter 3 – Special Discrete Random Variables.

Section 3.4

Binomial random variable

An experiment that has only two possible outcomes is called a Bernoulli trial, for example, a single coin toss. For the sake of argument, we will call one of the possible outcomes “success”, and the other one “failure”. The probability of a success is p, and the probability of failure is 1 − p. We are interested in studying a sequence of identical and independent Bernoulli trials, and looking at the total number of successes that occur. Definition. A binomial random variable is the number of successes in n independent and identical Bernoulli trials. Examples. A fair coin is tossed 100 times and Y , the number of heads, is recorded. Then Y is a binomial random variable with n = 100 and p = 1/2. Two evenly matched teams play a series of 6 games. The number of wins Y is a binomial random variable with n = 6 and p = 1/2. An inspector looks at five computers where the chance that each computer is defective is 1/6. The number Y of defective computers that he sees is a binomial random variable with n = 5 and p = 1/6. If Y is a binomial random variable, then the possible outcomes for Y are obviously 0, 1, . . . , n. In other words, the number of observed successes could be any number between 0 and n. The sample space consists of all strings of length n that consist of S ’s and F ’s; for example, n trials }| { z SSFSFSSSF · · · SF . Now let us choose a value of 0 ≤ y ≤ n, and look at a couple of typical sample points belonging to the event (Y = y), y n −y z }| { z }| { SSS · · · S FFF · · · F,

y −1 n −y z }| { z }| { S S S · · · S F F F · · · F S,

y −2 n −y z }| { z }| { S S S · · · S F F F · · · F S S.

Every sample point in the event (Y = y) is an arrangement of y S ’s and n − y F ’s, and so therefore has probability p y (1 − p)n−y . How many such sample points are there? The number of sample points   ¡ in (Y = y) is the number of distinct arrangements of y S ’s and n − y F ’s, that is, ny . Putting it

Section 3.4

2

together gives the formula for binomial probabilities.

Binomial probabilities. If Y is a binomial random variable with parameters n and p, then ² ³ n y P (Y = y) = p (1 − p)n−y , y = 0, 1, . . . , n. y

Example.

Best-of-seven series

In section 1.6 we figured out that the probability of a best-of-seven series between two evenly matched teams going the full seven games was 20/64. This can also be calculated using binomial probabilities. If you play six games against an equally skilled opponent, and Y is the number of wins, then Y has a binomial distribution with n = 6 and p = 1/2. goes seven games if Y = 3, and the chance of that happening is P (Y = 3) =  The ¡ series 6 3 3 (1/2) (1/2) = 20/64 = .3125. So best-of-seven series ought to be seven games long 3 30% of the time. But, in fact, if you look at the Stanley Cup final series for the last fifty years (1946-1995), there were seven-game series only 8 times (1950, 1954, 1955, 1964, 1965, 1971, 1987, 1994). This seems to show that a lot of these match-ups were not even, which tends to make the series end sooner. If you are twice as good as your opponent, what is the chance of a full seven games?  ¡ This time p = 2/3, and so P (Y = 3) = 63 (2/3)3 (1/3)3 = .2195. This agrees more closely to the actual results, although it’s still a bit high. Example.

An even split

If I toss a fair coin ten times, what chance that I get exactly 5 heads and 5   ¡ is the 5 5 tails? The answer is P (Y = 5) = 10 (1/2) (1/2) = .2461. If I toss a fair coin 100 5 times, is the chance of exactly fifty heads? This time the answer is P (Y = 50) =  100¡ what 50 50 = .0796. You may be a bit surprised that this is such an uncommon 50 (1/2) (1/2) event. If you flip a coin 100 times the odds are pretty good that you will get about an equal number of heads and tails, but to get exactly one half heads and one half tails gets harder and harder as the sample size increases. Just for fun, here is an approximate pformula for the chance of getting exactly n heads in 2n coin tosses: P (an even split) ≈ (πn)−1 . Example.

Testing for ESP

In order to test for ESP you draw a card from an ordinary deck and ask the subject what color it is. You repeat this 20 times and the subject is correct 15 times. How likely is it that this is due to chance? If the subject is guessing, then Y , the number of correct readings, follows a binomial distribution with n = 20 and p = 1/2. We want to know the probability that someone

Section 3.4

3

can do this well (or better) by guessing. Thus

P (Y ≥ 15) = P (Y = 15) + P (Y = 16) + · · · + P (Y = 20) ² ³ ² ³ ² ³ 20 20 20 15 5 16 4 = (1/2) (1/2) + (1/2) (1/2) + · · · + (1/2)20 (1/2)0 15 16 20 = 21700(1/2)20 = 0.0207.

This is a pretty unlikely event but certainly not impossible. What conclusion can we draw? Example.

Quality control

In mass production manufacturing there is a certain percentage of acceptable loss due to defective units. To check the level of defectives, you take a sample from the day’s production. If the number of defectives is small you continue, but if there are too many defectives you shut down the production line for repairs. Suppose that 5% defectives is considered acceptable, but 10% defectives is unacceptable. Our strategy is to take a sample of n = 40 units and shut down production if we find 4 or more defectives. Our inspection strategy has two conflicting goals, it is supposed to shut down when p ≥ .10, but continue if p ≤ .05. There are two possible wrong decisions; to continue when p ≥ .10, and to shut down even though p ≤ .05. How often will we unnecessarily shut down? Suppose that there are acceptably many defectives, and to take the worst case, say there are 5% defectives, so that p = .05. Let Y be the number of observed defective units in the sample. The probability of shutting down production is

P (shut down) = P (Y ≥ 4) = 1 − P (Y ≤ 3) = 1 − P (Y = 0) − P (Y = 1) − P (Y = 2) − P (Y = 3) ² ³ ² ³ ² ³ ² ³ 40 40 40 40 0 40 1 39 2 38 =1− (.05) (.95) − (.05) (.95) − (.05) (.95) − (.05)3 (.95)37 0 1 2 3 = 1 − .1285 − .2705 − .2777 − .1851 = .1382

On the other hand, how often will we fail to spot an unacceptably high level of defectives? Let us now suppose that there are unacceptably many defectives, and again to take the worst case, let’s say there are 10% defectives, so that p = .10. The chance that

Section 3.4

4

the day’s production passes inspection anyway is P (passes inspection) = P (Y ≤ 3) = P (Y = 0) + P (Y = 1) + P (Y = 2) + P (Y = 3) ² ³ ² ³ ² ³ ² ³ 40 40 40 40 0 40 1 39 2 38 = (.10) (.90) + (.10) (.90) + (.10) (.90) + (.10)3 (.90)37 0 1 2 3 = .0148 + .0657 + .1423 + .2003 = .4231 We see that this scheme is fairly likely to make errors. If we wanted to be more certain about our decision, we would need to take a larger sample size. Example.

Multiple choice exams

If a multiple choice exam has 30 questions, each with 5 responses, what is the probability of passing the exam by guessing? If you guess on every question, then Y the number of correct answers will be a binomial random variable with n = 30 and p = 1/5. To pass you need 15 or more correct answers so P (pass the exam) = P (Y ≥ 15) = 0.000231.

Binomial moments. If Y is a binomial random variable with parameters n and p, then E (Y ) = np

Example.

and

VAR (Y ) = np(1 − p).

The accuracy of empirical probabilities

If we simulate n random events, where the chance of a success is p, then the number of observed successes Y has a binomial distribution with parameters n and p. The empirical probability is pb = Y/n. Now the binomial moments given above show that E (b p) = (np)/n = p, and VAR (b p ) = (np(1−p))/n 2 = p(1−p)/n. By computing the two standard deviation interval, we get some idea about how close pb is to p. Since the quantity p(1 − p) is maximized when p = 1/2, we find that regardless of the value of p, r p(1 − p) 1 ≤√ . 2 STD (b p) = 2 n n In most of our examples, the empirical probabilities have been based on n = 1000 repetitions. Thus, our empirical probabilities are typically within ±.03 of the true probabilities. For example, suppose we simulate 1000 throws of five dice, and find that on 71 occasions we get a sum of 14. Then we are fairly certain that the true probability of getting 14 lies somewhere between .041 and .101.

Section 3.5 Section 3.5

5

Geometric and negative binomial random variables

Like the binomial, the geometric and negative binomial random variables are based on a sequence of independent and identical Bernoulli trials. Instead of fixing the number of trials n and counting up how many successes there are, we fix the number of successes k and count up how many trials it takes to get them. The geometric random variable is the number of trials until the first success. Given an integer k ≥ 1, the negative binomial random variable is the number of trials until the k th success. You see that a geometric random variable is a negative binomial random variable where k = 1. On the other hand, note that a negative binomial random variable Y is the sum of k independent geometric random variables. That is, Y = X1 + X2 + · · · + Xk , where X1 is the number of trials until the first success, X2 is the number of trials after the first success until the second success, etc. All of these X ’s have geometric distributions with parameter p. If Y is negative binomial, then a typical sample point belonging to (Y = y) looks like F F S · · · F S S, where the first y − 1 symbols in the string contain exactly  y−1k¡ − 1 successes and y − k failures, and then the y th symbol is an S . Since there are k−1 such strings, and they all k y−k have probability p (1 − p) we get the following formula.

Negative binomial probabilities. If Y is a negative binomial random variable with parameters k and p, then ² ³ y −1 k P (Y = y) = p (1 − p)y−k , k −1

y = k , k + 1, . . . .

It follows that the geometric distribution is given by p(y) = p(1 − p)y−1 , y = 1, 2, . . . . Example. The chance of a packet arrival to a distribution hub is 1/10 during each time interval. Let Y be the arrival time of the first packet, it has a geometric distribution with p = .10. The probability that the first packet arrives during the third time interval is P (Y = 3) = (1/10)1 (9/10)2 = .081. The probability that the first packet arrives on or after the third time interval is P (Y ≥ 3) = 1 − P (Y = 1) − P (Y = 2) = 1 − .10 − (.90)(.10) = .81. If X is the arrival time packet, the chance that it arrives on the 99th time  98¡of the tenth 10 interval is P (X = 99) = 9 (1/10) (9/10)89 = 0.01332.

Section 3.7 Example.

6

The 500 goal club

With only 30 games remaining in the NHL season, veteran winger Flash LaRue is starting to get worried. With a career total of 488 goals, it is not at all certain that he will be able to score his 500th career goal before the end of the season. He will get a big bonus from his team if he manages this feat, but unfortunately Flash only scores at a rate of about once every three games. Is there any hope that he will get his 500th goal before the end of the season? Let’s try to calculate the moments of a negative binomial random variable. p + p(1 − p) + p(1 − p)2 + p(1 − p)3 p(1 − p) + p(1 − p)2 + p(1 − p)3 p(1 − p)2 + p(1 − p)3 p(1 − p)3

+··· +··· +··· +··· .. .

1 (1 − p) (1 − p)2 (1 − p)3 .. .

p + 2p(1 − p) + 3p(1 − p)2 + 4p(1 − p)3 + · · ·

1/p

This sum ought to convince you that the mean of a geometric random variable is 1/p, and the result for negative binomial follows from the equation Y = X1 + X2 + · · · + Xk . Confirming the variance formula is left as an exercise.

Negative binomial moments. If Y is a negative binomial random variable with parameters k and p, then E (Y ) =

k p

and

VAR (Y ) =

k (1 − p) . p2

We note that, as you would expect, the rarer an event is, the longer you will have to wait for it. Taking the geometric case (k = 1), we see that we will wait on average µ = 2 trials to see the first “heads” in a coin tossing experiment, we will wait on average µ = 36 trials to see the first pair of sixes in tossing a pair of dice, and we will buy on average µ = 13, 983, 816 tickets before we win Lotto 6-49. We also note that σ decreases from infinity to zero as p ranges from 0 to 1. This says that predicting the first occurrence of an event is difficult for rare events, and easy for common events. Section 3.7

Hypergeometric random variable

The hypergeometric distribution is the number of successes that arise in sampling without replacement. We suppose that there is a population of size N , of which r of them are “successes” and the rest “failures”, and a sample of size n is drawn.

Section 3.7

7

The probability formula below is simply the ratio of the number of samples containing y successes and n − y failures, to the total number of possible samples of size n. The weird looking conditions on y just ensure that you don’t try to find the probability of some impossible event. Hypergeometric probabilities. If Y is a hypergeometric random variable with parameters n, r , and N , then ² ³² ³ r N −r y n −y ² ³ , P (Y = y) = N n

y = max(0, n −(N −r )), . . . , min(n, r )

Example. A box contains 12 poker chips of which 7 are green and 5 are blue. Eight chips are selected at random without replacement from this box. Let X denote the number of green chips selected. The probability mass function is ² ³² ³ 7 5 x 8−x ² ³ , x = 3, 4, . . . , 7. p(x ) = 12 8 Note that the range of possible x values is restricted by the make-up of the population. Example.

Lotto 6-49

In Lotto 6-49 you buy a ticket with six numbers chosen from the set {1, 2, . . . , 49}. The draw consists of a random sample drawn without replacement from the same set, and your prize depends on how many “successes” were drawn. Here a “success” is any number that was on your ticket. So Y , the number of matches, follows a hypergeometric distribution with r = 6, n = 6, and N = 49. The probabilities for the different number of matches are obtained using the formula ² ³² ³ 6 43 y 6−y ² ³ , y = 0, . . . , 6. P (Y = y) = 49 6 To four decimal places, we have y

0

1

2

3

4

5

6

p(y)

.4360

.4130

.1324

.0176

.0010

.0000

.0000

Section 3.7

8

Hypergeometric moments. If Y is a hypergeometric random variable with parameters n, r , and N then E (Y ) = n

r N

and

VAR (Y ) = n

r N −r N −n . N N N −1

For example, the average number of green chips drawn in the first problem is µ = (8)(7)/12 = 4.66666. Also, the average number of matches on your Lotto 6-49 ticket is µ = (6)(6)/49 = .73469. Example.

Capture-tag-recapture

A scientific expedition has captured, tagged, and released eight sea turtles in a particular region. The expedition assumes that the population size in this region is 35, which means that 8 are tagged and 27 not tagged. The expedition will now capture 10 turtles and note how many of them are tagged. If the assumption about the population size is correct, what is the probability that the new sample will have 3 or less tagged turtles in it? P (Y ≤ 3) = P (Y = 0) + P (Y = 1) + P (Y = 2) + P (Y = 3) ² ³² ³ ² ³² ³ ² ³² ³ ² ³² ³ 8 27 8 27 8 27 8 27 1 9 2 8 3 7 0 10 = ² ³ + ² ³ + ² ³ + ² ³ 35 35 35 35 10 10 10 10 = .04595 + .20424 + .33861 + .27089 = .85969. We would certainly expect to get three or less tagged turtles in the new sample. If the expedition found five tagged turtles, is that evidence that they have over-estimated the population size? Example.

A political poll

The population of Alberta is around 2, 545, 000, and let’s suppose that about 70% of these are eligible to vote in the next provincial election. Then the population of eligible voters has N = 1781500 people in it. Suppose that n = 100 people are randomly selected from the eligible voters (without replacement) and asked whether or not they support Ralph Klein. Also suppose, for the sake of argument, that exactly 60%, or 1068900 eligible voters do support Ralph Klein. How accurately will the poll reflect that? Let Y stand for the number of Klein supporters included in the random sample. Then Y has a hypergeometric distribution with n = 100, r = 1068900, and N = 1781500. The

Section 3.8

9

mean and variance of Y are given by µ = 100

1068900 = 60 1781500

and

σ 2 = 100

1068900 712600 1781400 = 23.998666. 1781500 1781500 1781499

A two standard deviation interval says that probably between 50 to 70 people in the poll will be Klein supporters. Note that if the sampling were done with replacement, then Y would follow a binomial distribution with n = 100 and p = .6. In this case, we would have µ = 100(.6) = 60

and

σ 2 = 100(.6)(.4) = 24.

Since n is small relative to N , the ratio N −n 1781400 = ≈ 1, 1781499 N −1 and the mean and variance of the hypergeometric distribution coincide with the mean and variance of the binomial distribution. The distributions of these two random variables are also essentially the same whenever n is small relative to N . Section 3.8

Poisson random variable

This probability distribution is named after the French mathematician Poisson, according to whom. . . Life is good for only two things, discovering mathematics and teaching mathematics – Sim´eon Poisson In Recherches sur la probabilit´ e des jugements en mati` ere criminelle et en mati` ere civile, an important work on probability published in 1837, the Poisson distribution first appeared. The Poisson distribution describes the probability that a random event will occur in a time or space interval under the conditions that the probability of the event occurring is very small, but the number of trials is very large so that the event actually occurs a few times. To illustrate this idea, suppose you are interested in the number of arrivals to a queue in a one day period. You could divide the time interval up into little subintervals, so that for all practical purposes, only one arrival can occur per subinterval. Therefore, for each subinterval of time, we have P (no arrival) = 1 − p,

P (one arrival) = p,

P (more than one arrival) = 0.

The total number of arrivals X , is the number of subintervals that contain an arrival. This has a binomial distribution, where n is the number of subintervals. The probability of seeing x arrivals during the day is ² ³ n x P (X = x ) = p (1 − p)n−x . x

Section 3.8

10

Now let’s suppose that you keep on dividing the time interval into smaller and smaller subintervals; increasing n but decreasing p so that the product µ = np remains constant. What happens to P (X = x )? ² ³ ² ³° ± ° µ ±n−x n x n µ x n−x 1− p (1 − p) = n n x x n(n − 1) · · · (n − x + 1) ° µ ±x ° µ ±n ° µ ±−x = 1− 1− x! n n n ° n − x + 1 ±° µx ° µ ±n ° n ±° n − 1 ± µ ±−x = . 1− ··· 1− x! n n n n n Now you take the limit as n → ∞, and obtain °

µ ±n → e −µ 1− n

and

° n ±° n − 1 ± ° n − x + 1 ±° µ ±−x → 1. ··· 1− n n n n

This leads to the following formula.

Poisson probabilities. If X is a Poisson random variable with parameter µ, then e −µ µ x P (X = x ) = , x!

x = 0, 1, . . . ,

The derivation of the Poisson distribution explains why it is sometimes called the law of rare events. Let’s look at an example involving the rarest event I can think of. Example.

More Lotto 6-49

The odds of winning the jackpot in Lotto 6-49 are one in 13,983,816, or p = 7.1511 × 10−8 . Suppose you play twice a week, every week for 10,000 years. The total number of plays is then n = 2 × 52 × 10000 = 1, 040, 000. Setting µ = np = .07437 and using the Poisson formula, we see that the chance of hitting zero jackpots during this time is P (X = 0) = (e −.07437 )(.07437)0 /0! = .928327. After all that time, we still have only about a 7% chance of getting a Lotto 6-49 jackpot. The probability of getting exactly two jackpots during this time is P (X = 2) = (e −.07437 )(.07437)2 /2! = .002567. Example.

Hashing

Hashing is a tool for organizing files, where a hashing function transforms a key into an address, which is then the basis for searching for and storing records. Hashing has two important features: 1. With hashing, the addresses generated appear to be random — there is no immediate connection between the key and the location of the record.

Section 3.8

11

Poisson moments. If X is a Poisson random variable, then E (X ) = µ

Example.

and

VAR (X ) = µ.

Particle emissions

In 1910, Hans Geiger and Ernest Rutherford conducted a famous experiment in which they counted the number of α-particle emissions during 2608 time intervals of equal length. Their data is as follows. x intervals

0

1

2

3

4

5

6

7

8

9

10 > 10

57 203 383 525 532 408 273 139 45

27

10

6

A total of 10097 particles were observed, giving a rate of µ = 10097/2608 = 3.8715 particles per time period. If these particles were following a Poisson distribution, then the number of intervals with no particles should be about 2608 ×

e −3.8715 (3.8715)0 = 54.31, 0!

Section 3.8

12

the number of intervals with exactly one particle should be about 2608 ×

e −3.8715 (3.8715)1 = 210.27, 1!

and so on. In fact, the frequencies that we would expect to observe are

0

1

2

3

4

5

6

7

8

9

10

> 10

54.31 210.27 407.06 525.31 508.44 393.69 254.03 140.50 67.99 29.25 11.32 5.83 By comparing these two tables, you can see that the Poisson distribution seems to describe this phenomenon quite well.