Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 8 (MWF) The binomial and introducing th...
Author: Cecilia Higgins
0 downloads 0 Views 408KB Size
Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html

Lecture 8 (MWF) The binomial and introducing the normal distribution

Suhasini Subba Rao

Lecture 8 (MWF) Introducing the normal distribution

The binomial: mean and variance • Suppose that each individual in the population can have either a success (1) or a failure (0). • Let p denote the proportion of the entire population who have a success (the proportion of 1s in the population). • Suppose a random sample of size n is drawn and let Sn be the number of successes (number of 1s) in that sample. • Sn is a random variable taking values in {0, 1, . . . , n} (eg. S4 is the number of successes out of 4 and has the outcomes {0, 1, 2, 3, 4}). • Sn has all the properties of a random variable, we can associate a probability to each outcome (the binomial distribution) and it has a 1

Lecture 8 (MWF) Introducing the normal distribution

probability plot. Since it has a probability plot, it must have a center and a spread, therefore it has a mean and a variance. – The mean of a binomial is n × p. – The p variance of a binomial is n × p × (1 − p) the standard deviation is n × p × (1 − p).

2

Lecture 8 (MWF) Introducing the normal distribution

Example 1: n= 4 and p=0.2 • Suppose there are 4 exam questions, each answer is either wrong or right, and we give it the value {0, 1}. The probability of a success is P (X = 1) = 0.2 (the probability of getting it right by random guessing) and the probability of a failure is P (X = 0) = 0.8. • We are interested in the number of successes out of 4, this is the random variable S4. Using the arguments we gave earlier we can show that: 4

3

P (S4 = 1) = 4 × (0.8) × (0.2)

P (S4 = 0)

=

(0.8) ,

P (S4 = 2)

=

6 × (0.8) × (0.2) ,

P (S4 = 4)

=

(0.2)

2

2

1

P (S4 = 3) = 4 × (0.8) × (0.2)

3

4

Hence we can plot the histogram, which has a center and a spread. The √ mean of S4 is 4 × 0.2 = 0.8 and the standard deviation is 4 × 0.2 × 0.8 = 0.8. 3

Lecture 8 (MWF) Introducing the normal distribution

The x-axis corresponds to the 5 different possible grades. Eg. none say yes, one says yes, two says yes, three says yes and all say yes. Since this is discrete numerical variable the y-axis can correspond to a probability.

4

Lecture 8 (MWF) Introducing the normal distribution

Example 2: n=50 and p=0.2 • Suppose there are 50 exam questions. There is 20% getting one correct. This is a binomial with B(50, 0.2). • This relative frequency histogram will be the Binomial distribution. In Statcrunch you can make a plot of it (stat -> calculators -> binomial), use n = 50 and p = 0.2. • Being completely clueless and answering randomly means that the average number of question I get right will be 20% of 50, this is 0.2 × 50 = 10 (formula mean = n × p). • The variation is measured with a standard for √ √ deviation. The variation × 0.2 × 0.8 = 8 (using more score (or standard deviation) is s= 50 p the formula standard deviation of a binomial is n × p × (1 − p)). Look 5

Lecture 8 (MWF) Introducing the normal distribution

at the distribution given in Statcrunch, does this look representative of the spread? • The distribution is not symmetric but it is close to symmetric (locally) about 10 (compare that with the the with an exam with only 15 question, this distribution looks ‘less’ symmetric and bell shaped.). • Observe that the probability of, say, scoring 10 out of 50 is quite small, despite this being the average grade if one were to randomly guess. This is because as the sample size grows the chance of any on outcome becomes very small. It is reason we are interested in the chance of an interval, for example, scoring between 5-15.

6

Lecture 8 (MWF) Introducing the normal distribution

7

Lecture 8 (MWF) Introducing the normal distribution

Example 3: n=50 and p=0.5 • Suppose now the exam is a true or false exam. The probability of getting a question correct by randomly guessing is 50%. There are 50 questions on the exam, the grade in the exam follows a Bin(50, 0.5). • I am interested in the number out of 50. Ie. S50 (the number of successes out of 50). The average number of successes is 50 × 0.5 = 25. Because S50 is a random variable, it has a histgram (distribution) and thus a variance (measure of spread). Its variance is 50 × 0.5 × 0.5 = 12.5. This measures how spread out the distribution is from the mean. • In Statcrunch, make a plot of the histogram (stat -> calculators -> binomial), use n = 50 and p = 0.5. • We see that it is symmetric about 25. 8

Lecture 8 (MWF) Introducing the normal distribution

9

Lecture 8 (MWF) Introducing the normal distribution

Plots in summary

10

Lecture 8 (MWF) Introducing the normal distribution

Summary of binomial plots Suppose that the number of questions in the exam was just 4: • We showed if P (X = 1) = 0.2 and P (X = 0) =√0.8, then for n = 4 the mean is 4 × 0.2 = 0.8 (the standard deviation is 4 × 0.2 × 0.8) and the histogram is right skewed. This means we were more likely to observe large values of Sn (in terms of surveys this means a lot of people say yes). • On the other hand if the chance of success is 0.8, ie. P (X = 1) = 0.8 and P (X = 0) = 0.8, then for n = 4, the mean is 4 × 0.8 = 3.2 (the variance is 4 × 0.2 × 0.8) and the histogram is left skewed. • If P (X = 1) = P (X = 0) = 1/2, then for n = 4, the mean is 4×0.5 = 2 and we are most likely to observe in the middle of the interval [0, 4]. This time the histogram is symmetric (about 2). 11

Lecture 8 (MWF) Introducing the normal distribution

• Now suppose the number of people we sample increases (we go from n = 4 to n = 50). What we observe is that around the peak of the histogram there is a symmetry (regardless of whether overall there is symmetry or not). In other words, regardless of the overall skew, about the peak its close to symmetric, with a similar shape. • As an exercise in statcrunch make plots of the histogram for p = 0.05 and let the sample size grow. See how the histogram looks less and less skewed about the center.

12

Lecture 8 (MWF) Introducing the normal distribution

Binomial p = 0.05 for various n

13

Lecture 8 (MWF) Introducing the normal distribution

• In statistics it is not the number Sn out of n that we are interested in (for example, the number of people who say they will vote for a party out of random sample of 500 individuals), but the proportion. In other words, the proportion of the sample who say they will vote for a party. This sample proportion pˆ = Sn/n is an estimator of the true proportion p of people of who will vote for a party. • The sample proportion is random (since Sn is random). Therefore it has a distribution. The distribution of pˆn has the same shape as Sn. – The plots above show that if n is quite large, p and 1 − p are not too small, the distribution of Sn (and pˆn) looks ‘bell shaped’. – It is centered p about np (the mean of the Binomial) with standard deviation np(1 − p). The p distribution of pˆ is centered about p with standard deviation p(1 − p)/n. 14

Lecture 8 (MWF) Introducing the normal distribution

• In other words, the distribution of the sample proportions becomes more and more bell shaped. • What you are seeing is the central limit theorem coming into play, that the sample proportion becomes more bell shaped as the sample size grows. • Note, it is not that the distribution of the sample that gets more bell shaped as the sample size increases, but the distribution of the sample proportion or sample mean - the estimator - that gets more bell shaped. You cannot change the original distribution, it is what it is. Sometimes it is bell shaped, a lot of the times it is not. • You would have observed something similar from HW1, Q5, when you plotted the averages of M&Ms.

15

Lecture 8 (MWF) Introducing the normal distribution

What is that bell shape? The normal distribution • We often find that the distribution of random variables that arise in nature have a distinctive shape. • This distinctive shape of bell shape curve is called a normal distribution. The arises all over the place: – The distribution of bullets when fired at a target. – The outcomes of social surveys. – Biological data (such as the height of a women). • The normal distribution is a family of densities which are different but have certain characteristics in common. • The normal distribution (sometimes called the Gaussian) is the most commonly used distribution in statistics. 16

Lecture 8 (MWF) Introducing the normal distribution

The normal distribution (cont.) • It is completely characterised by two parameters, the mean and variance. – The mean µ. – The variance σ 2. Formally the density function of the normal distribution looks like:   2 (x − µ) 1 (you don’t have to remember this!) exp − y = f (x) = p 2 2 σ 2pσ This is a symmetric curve which is centered about µ with ‘spread’ σ. See handout: normal distribution introduction.pdf.

17

Lecture 8 (MWF) Introducing the normal distribution

Different normal distributions

Fit the distribution to female human, male human, gorilla and giraffe. 18

Lecture 8 (MWF) Introducing the normal distribution

Calculating probabilities • What we want do is use the fact that observations come from a normal distribution to calculate probabilities. For example, suppose female heights are normally distributed with mean 64.5 inches and standard deviation 2.5 inches. I want to use this information to: – Calculate things like percentiles. Jane is 71 inches tall, what is her percentile (ie. the proportion of people who are less than 71 inches). – If someone is in the 90th percentile, how tall are they? • In order to calculate these percentiles (probabilities), we need to utilize the normality of heights. • But first before doing this, we need to introduce the z-transform. This is a transformation, which ‘measures’ the number of standard deviations the 19

Lecture 8 (MWF) Introducing the normal distribution

data is from the mean. For example 71 inches is (71 − 64.5)/2.5 = 2.6 standard deviations from the mean. • Once the z-tranform has been evaluated, we can use the standard normal to calculate probabilities. • We start by calculating probabilities in the z-transform world, all of which use the standard normal.

20

Lecture 8 (MWF) Introducing the normal distribution

The standard normal - page 1090 of Longnecker and Ott • The normal tables give the probabilities P (Z < z) in the special case Z ∼ N (0, 1) (the so called standard normal): – mean is zero (µ = 0) – variance is one σ 2 = 1. • Look at the normal tables. Suppose we want to use it to evaluate the P (Z < b). The two sides of the table give together b, the inside of the table yields the probability P (Z < b). • Suppose we want to evaluate P (Z ≤ 1.23), since 1.23 = 1.2 + 0.03, the first column gives the 1.2 values and first row gives the 0.03 value. We find the 1.2 and 0.03 values and locate the value in the inside of the table where this column and row intersect. 21

Lecture 8 (MWF) Introducing the normal distribution

• This intersection point is the probability, that is P (Z ≤ 1.23) = 0.8907.

• The area under the graph is the probability, which corresponds to the value given in the table. 22

Lecture 8 (MWF) Introducing the normal distribution

Examples - standard normal The following exercise will seem dry (because there are not many real life populations which follow a standard normal distribution). However, calculating standard normal probabilities will be very useful in calculating nonstandard normal probabilities (which are widely used). (a) Evaluate P (0.6 < Z ≤ 1.3). (b) (i) P (Z ≤ −1.1), (ii) P (Z ≤ 0.6), (iii) P (Z ≤ 3.0), (iv) P (Z ≤ −2.12). (c) How to interprete P (Z ≤ −1.1) and P (Z ≤ 3.0)? (d) (i) P (Z > −1.1), (ii) P (Z > 0.6), (iii) P (Z > 3.0), (iv) P (Z > −2.12). (e) (i) P (−1.1 < Z ≤ 0.6), (ii) P (−2.12 < Z ≤ 3.0), (iii) P (−2.12 < Z ≤ 0) 23

Lecture 8 (MWF) Introducing the normal distribution

Look at the handout http://www.stat.tamu.edu/~suhasini/ teaching651/standard_normal_tables.pdf for the solutions.

24

Lecture 8 (MWF) Introducing the normal distribution

Calculating probabilities using Statcrunch Probabilities can also be calculated in Statcrunch. Go to Stat → Calculate and select Normal. Put in the correct mean and standard deviation (0 and 1).

25

Lecture 8 (MWF) Introducing the normal distribution

A ‘practical’ example • Suppose that Z is the weight of an alien species (that has learnt the mystery of zero and negative weight). We will assume that this species of aliens has a weight distribution which follows a standard normal, mean zero and standard deviation/variance which is one. A plot the density (remember this is similar to a histogram) is given below.

26

Lecture 8 (MWF) Introducing the normal distribution

• From the plot you can see they are equally likely to have a positive weight as it is to have a negative weight. Indeed the chance of the alien’s weight being positive is 50% (since it is symmetric about zero). • Again, by symmetry, the chance of it’s weight being more than 2 is the same as it is to be less than -2. • Furthermore, the weight of this alien species can take any number from negative infinite to positive infinite. • However, most of the time (in fact more than 99% of the time) if you randomly select one of these aliens, their weight will be between (−3, 3). We now illustrate why: • Draw the normal distribution show what P (−3.0 ≤ Z ≤ 3) is on it. Evaluate this probability. 27

Lecture 8 (MWF) Introducing the normal distribution

• We see that P (−3.0 ≤ Z ≤ 3) = P (Z ≤ 3) − P (Z < −3) = 0.9987 − 0.0013, which is large. Hence we are likely to draw alien weights in this interval.

28

Lecture 8 (MWF) Introducing the normal distribution

Given a probability, locating the value on the x-axis • Suppose Z ∼ N (0, 1) (Z is a random variable with a standard normal distribution). Question 1 We want to find the value of t such that P (Z ≤ t) = 0.8. For example, if the weight of randomly selected person followed a standard normal distribution, then P (Z ≤ t) = 0.8, means the probability I draw a randomly selected person and their weight is less than t (we want to find this t) is equal to 0.8.

29

Lecture 8 (MWF) Introducing the normal distribution

• Solution 1 To find the t, look inside the normal tables, and locate 0.8 (very, very important you look inside the table not on the sides). Then read out and you will see that you should get the value 0.84. Hence P (Z ≤ 0.84) = 0.8, and t = 0.84.

30

Lecture 8 (MWF) Introducing the normal distribution

• Question 2 We want to find the value of t such that P (Z > t) = 0.02. • Solution 2 First draw this. We know that if P (Z > t) = 0.02, then P (Z ≤ t) = 1 − 0.02 = 0.98. Look inside the tables to find 0.98. You should see approximately 2.06. Hence P (Z ≤ 2.06) = 0.98. Hence t = 2.06.

31