Objectives. 3.3 Toward statistical inference. Toward statistical inference Sampling variability

Objectives   3.3 Toward statistical inference p  Population versus sample (CIS, Chapter 6) p  Toward statistical inference p  Sampling variabi...
Author: Marianna Webb
4 downloads 1 Views 1MB Size
Objectives   3.3 Toward statistical inference p 

Population versus sample (CIS, Chapter 6)

p 

Toward statistical inference

p 

Sampling variability

p 

Further reading:

http://onlinestatbook.com/2/estimation/characteristics.html (some of the concepts introduced in this link are beyond this class)

Adapted  from  authors’  slides  ©  2012  W.H.  Freeman  and  Company  

The  inconvenient  truth   p  p 

p 

p 

So far we have assumed the mean of a population is known. In reality the population is unknown so its mean is unknown. Inference is detecting/find the unknown population mean based on a very small sample from the population. We illustrate what is meant by this in the following examples. p 

See also the recent journal article from Poultry Science.

Towards  statistical  inference   p 

p 

p 

p 

p 

A survey of 2000 randomly sampled college students, 62% of this sample reported they have encountered some type of harassment. Parents are worried: What is the truth about the millions of students who are currently at college? Because the sample was taken at random it seems quite reasonable to suppose this sample is representative of the population of college students. This suggests that about 62% of all college students may have encountered some type of harassment. 62% is in fact an estimate of the total proportion who may have encountered harassment. What is the exact proportion? This is the start of statistical inference, where we infer conclusions on the entire population based on a sample. 62% is not the exact value, it will vary from sample to sample, and our objective in the next few lectures is to understand this variability. This will help us to understand the reliability of the estimate.

Refresher:  De9initions   p 

Population: The entire group of individuals in which we are interested but cannot assess or observe directly.

p 

How well the sample represents the population depends on the sampling method, as well as on the sample size.

Examples: All college students, All calves etc. p 

Often the population is described by a

Population

mathematical model.

p 

Sample: The part of the population we actually examine and for which we do have data.

A parameter is a number describing a characteristic of the population.

Sample

p 

A statistic is a number describing a characteristic of a sample.

Example:  M&M  data   q 

q 

q 

q  q 

To illustrate what we mean by a population and sample, let us return to the M&M example. Let us suppose that the 170 M&M bags represent the population of M&Ms (in reality we do not observe the population – so this is just an example for illustration). The population mean for the number of M&Ms is 13.54. A random sample of size 5 is taken. There are 1705 different random samples that can be taken! Note: Examples of random samples are given in homework 1. On the next two slides we show how to sample from the distribution. q 

q  q 

q 

Top plot: The distribution for the number of M&Ms in a bag (over 170 bags). Middle plot: One sample of size 5 Lower plot: The average of that sample (sample mean).

Observe how the sample mean is different for the two samples.

Sample  1  

Sample  2  

Sampling  variability   As illustrated from the previous example, for every sample taken from a population, we are likely to get a different set of individuals and calculate a different value for our statistic (such as the sample mean). This is called sampling variability. This would suggest that the sample and the statistic contains no information about the population. However…. The good news is that, if we imagine taking lots of random samples of the same size from a given population, the variation from sample to sample—the sampling distribution—will follow a predictable pattern. All of statistical inference is based on this; to see how trustworthy a statistic is what happens of we kept repeating the sampling many times?

We measure the quality of a statistic (such as the sample mean) with: p 

Accuracy (bias) – Random samples provide accurate estimates of a parameter because they are unbiased (or close to unbiased, depending on the random sampling method). p 

p 

Using a well constructed statistic.

p 

Typically we will assume an estimator is unbiased.

p 

p 

This is done by sampling in a good way (ie. Randomly sampling over the population of interest).

When reading an article identify the population of interest and potentially biases which may arise.

Reliability (variable) – A reliable estimation method is one that would give similar results if the random sampling is repeated over.

The less variable a statistic, the more reliable it is. p  Random sampling enables us to measure the variability of a statistic. p  We do this with the standard error – in the next slide we define what this means. p  Important: The larger the sample size, the less variable the corresponding estimator will be. To understand the above concepts look at the question at the end of this page: http://onlinestatbook.com/2/estimation/characteristics.html p 

Measuring  Variability   p 

We have come across variability before. Recall in Chapter 3 we used the standard deviation to measure the variability in the sample. We recall that the sample standard deviation is the deviation from each observation to the sample mean: v u u s=t

q 

1 n

1

n X

(Xi

¯ 2 X)

i=1

The same criterion is used to measure the variability in the sample mean (and all other estimators). This is called the standard error. q  More precisely, we measure the average spread from each estimator to the true mean. q  Looking back at the M&M examples, it would appear that we have to calculate 1705 sample means! q  This is impossible. q  Remarkably we can find a very nice expression for the standard error which requires very little effort!

Population  size  does  not  matter   There are about 15 million students in higher education. In the harassment survey about 2000 people were randomly surveyed. This means that the sexual harassment survey interviewed about one in every 7500 students. 62% is a estimate of the true population proportion. p  Question: Would the estimate of the proportion be better if the population size were smaller? For example, 1.5 million students rather than 15 million student. p  Answer: No. Only the size of the sample, in this case n=2000, has an influence on it’s reliability, not the size of the population. Statistical inference is not based on how close the sample size is to the population (usually we assume that the population is infinite). It is based on the idea that simple random sample gives a representative sample over the entire population. p 

Summary  and  what’s  to  come   The techniques of statistics allow us to draw inferences or conclusions about a population using the data from a sample. p 

Your estimate of the population parameter is only as good as your sampling design. à Work hard to eliminate biases (design your experiment well).

p 

Your sample statistic is only an estimate − and if you randomly sampled again you would probably get a somewhat different result (more of this next).

p 

In the next section we will show: q 

The distribution of the estimates (for much of the course it will be the sample mean) will, if the sample size is large enough, be normally distributed – even if the observations are not normal.

q 

The standard error (reliability) has a simple formula!

Objectives   5.1

Sampling distribution of a sample mean (CIS, Chapter 8)

x

p 

The mean and standard deviation of

p 

For normally distributed populations

p 

The central limit theorem (CIS, Chapter 8 and p103)

p 

Additional reading:



http://onlinestatbook.com/2/sampling_distributions/ samp_dist_mean.html

Adapted  from  authors’  slides  ©  2012  W.H.  Freeman  and  Company  

Simulation  tools  used   p 

To demonstrate the concepts I am using here I will be using an Applet in Statcrunch called sampling distribution. It is highly recommended that you try this out yourself. p  p 

p  p 

p 

Applets -> Sampling Distributions. Select the distribution (from uniform etc) or choose the data table (your own data). Press computer. Choose your sample size (this is how large a sample you use). 1000 times etc. has NOTHING to do with sample size. It is the number of samples you draw (this part is the thought experiment). You should make this as large as possible (I usually set it to 100,000). Press the + sign next to Sampling means to get the QQplot of the distribution of the sample mean. Do not press the + sign next to Samples – this will give you the QQplot of the sample.

Conceptionally, what we will be doing is rather sophisticated and it will take time to precisely understand the ideas behind inference. This is NOT plug and chug. Note that you can customize the (parent) distribution from which you sample from by simply left clicking over the parent distribution and moving the cursor as you want the shape of the distribution to be. p 

M&M  example     p 

Look first at the distribution of the total number M&Ms in a bag. We will treat this as our `population’.

Just comparing the histogram with the normal curve we can see that it is not normal. There are two reasons for this: a) The mix of different type of M&Ms (milk chocolate, peanut and peanut butter), will induce multimodalness in the distribution. b) The number of M&Ms is a numerical discrete random variable.

In the following examples we will be drawing M&M bags (numbers) from this distribution. It is analogous to putting all 170 counts in a bag and drawing them out (with replacement). We see that we are most likely to draw the number 18 and least likely to draw 14 (within the range 5-21).

Distribution  of    average:  sample  5   p 

Let us now look at the distribution of the sample mean of all samples of size 5. That is we randomly sample 5 values from the population, and take the sample mean.

QQplot  of  average:  sample  5   p 

Let us now look at the QQplot of the sample mean of all samples of size 5 (corresponding to the histogram on the previous page)

Observations: 1.  The histogram of the sample mean is more bell-shaped than the original distribution. However, it is certainly not normal (the spikes we see is due taking average of 5 numbers, which is not continuous enough). 2.  There is less spread in the distribution of the averages than the original histogram. 3.  The QQplot shows a large deviation from normality in the tails.

Distribution  of    average:  sample  10   p 

Let us now look at the distribution of the sample mean of all samples of size 10. That is we randomly sample 10 values from the population, and take the sample mean.

QQplot  of  average:  sample  10   p 

Let us now look at the QQplot of the sample mean of all samples of size 10 (corresponding to the histogram on the previous page)

Observations: 1.  The histogram of the sample mean is a lot more bell-shaped than the original distribution. The spikes that were seen for sample size 5 have gone (the bumps you see on the histogram are due to binwidth). 1.  There is even less spread in the distribution of the averages than the original histogram. 2.  The QQplot shows only a small deviation from normality in the top tail of the distribution.

Distribution  of    average:  sample  20   p 

Let us now look at the distribution of the sample mean of all samples of size 20. That is we randomly sample 20 values from the population, and take the sample mean.

QQplot  of  average:  sample  20   p 

Let us now look at the QQplot of the sample mean of all samples of size 20 (corresponding to the histogram on the previous page)

Observations: 1.  The histogram of the sample mean is pretty much normal. 2.  There is even less spread in the distribution of the averages than the original histogram. 3.  The QQplot shows only a very tiny deviation from normality in the tails of the distribution.

Distribution  of    average:  sample  40   p 

Let us now look at the distribution of the sample mean of all samples of size 40. That is we randomly sample 40 values from the population, and take the sample mean.

QQplot  of  average:  sample  40   p 

Let us now look at the QQplot of the sample mean of all samples of size 40 (corresponding to the histogram on the previous page)

Observations: 1.  The histogram of the sample mean is almost normal. 2.  There is even less spread in the distribution of the averages than the original histogram. 3.  The QQplot is very close to the x=y line.

Summary:  Sampling  distribution  of  M&Ms  

Summary  of  averages  of  M&Ms   Sample size original 5 10 20 40

mean 13.54 13.54 13.54 13.54 13.54

standard error 4.64 2.07 1.466 1.037 0.7357

p = 4.64/p1 = 4.64/p 5 = 4.64/p10 = 4.64/p20 = 4.64/ 40

comment Not normal More unimodal Getting normal Mostly there Pretty much normal.

This example illustrates three major insights: q  The distributions of the sample means are centered about the true mean 13.54. This tells us that the sample mean is not biased. q  We see that the spread in the sample means decreases as the sample size used to evaluate them increases. The spread/reliability/ variability is measured using the standard error which has the formula σ/√n (in this case σ=4.64 and n=5,10,20 or 40). q  The distribution of the sample mean becomes more normal (look at the QQplots) as the sample size grows.

Properties:  Sample  mean  for  normally   distributed  data   When a variable in a population is normally distributed, the sampling distribution of x for all possible samples of size n is also normally distributed.



If the population is Normal(µ, σ)

Sampling distribution

then the sample mean’s distribution is Normal(µ, σ/√n). Note that the sample average has less variability than any individual observation.

Population

Properties:  Sample  mean  of  non-­‐normal   distributed  data   Central Limit Theorem: When randomly sampling from any population with mean µ and standard deviation σ, if n is large enough then the sampling distribution of is approximately normal: ~ N(µ, σ /√n).

x

Sampling distribution of x for n = 2 observations

Population with strongly skewed distribution

€ € Sampling distribution of x for n = 10 observations

Sampling distribution of x for n = 25 observations



Calculation  Practice   In 2010 the combined SAT scores had mean 1016 and standard deviation 212. They also had approximately normal distribution. Population distribution is Normal(µ = 1016; σ = 212). p 

In Chapter 4, we used the normal distribution to show that the probability of a randomly selected student scoring 1100 or higher is 34.5%.

p 

Now, suppose 50 students are randomly selected and their SAT scores averaged. What is the probability that the average is greater than 1100? Sampling distribution of the sample average when n = 50 is Normal(µ = 1016; σ /√n = 212 /√ 50 = 29.98). Using these values, the z-score for 1100 is

z=

( x − µ) σ

n

=

1100 − 1016 212

`

50

=

84 = 2.80. 29.98

In Table A, the area to the right of 2.80 is 0.0025. So there is only a 0.25% chance that the average of 50 randomly sampled students is more than 1100. In this example we do not use the CLT because the original data is assumed normal.

Calculation  Practice   Hypokalemia is diagnosed when blood potassium levels are below 3.5mEq/dl. Let’s assume that we know a patient whose measured potassium levels vary daily according to the Normal(µ = 3.8, σ = 0.2) distribution . If only one measurement is made, what is the probability that this patient will be misdiagnosed with Hypokalemia?

( x − µ) 3.5 − 3.8 z= = = −1.5 , P(z < −1.5) = 0.0668 ≈ 7%. σ 0.2 Instead, if measurements are taken on 4 separate days, what is the probability of a misdiagnosis (in this case sample mean based on 4 is below 3.5)?

z=

( x − µ) σ

n

=

3.5 − 3.8 0.2

4

= −3 ,

P(z < −3) = 0.0013 ≈ 0.1%.

Note: If the problem is about the sample mean, make sure to standardize (get z) using the standard error for the sample mean.

Calculation  Practice:  using  the  CLT   p 

In Chapter 4 we discussed ACT scores. We argued that because the grades were numerical discrete over a small range, that the grade distribution could not be normally distributed. This means we cannot use the normal distribution to calculate probabilities for one randomly selected person. BUT if the sample size is large enough we can use the normal distribution to calculate probabilities for averages. We recall the mean ACT score is 20 with standard deviation 5. p 

p 

Question: 50 students are randomly selected. Calculate the probability their average (sample mean) score will be greater than 18. Answer: The mean of the sample mean has the same mean as the original distribution, which we know is 20. The standard error of the sample mean is s.e. = 5/√50 = 0.707. We use this to make the z-transform

z=

18 20 = 0.707

2.828

Looking up the z-tables using a computer we see that probability is 99.7%. This means there is a very large chance the sample mean greater is than 18.

Calculation  Practice   p 

q 

Let us return to the weights of calves at 0.5 weeks. The distribution is below

Looking at the plot, it seems that a normal density (with mean 90.11 and standard deviation 7.7) is a rough approximation of the underlying distribution of calves weights (see also the QQplot given at the end of Chapter 4). q 

q 

Question (a): Using the normal density calculate the approximate probability that a calf weights more than 100 pounds. Answer: Make a z-transform=(100-90.11)/7.7 =1.28. Looking this up in the z-tables we have 90%. Therefore the approximate probability that a calf is greater than 100 is 10%.

Question (b): Let us suppose that the sample mean of 10 calves is taken. Using the normal approximation of the sample mean, what is the probability that the sample mean will be greater than 100 pounds? p  Answer: The mean of the sample mean is the same as the mean weight of cows which is 90.11. The standard deviation of the sample mean is 7.7/√10 = 2.4. By making the z-transform we have z=(100-90.11)/2.4 = 4.12. Looking 4.12 up in the z-tables, we see that it is in the far upper tails, thus the probability is close to 0%. The size of the probabilities calculated in (a) and (b) are compared in the above plots. p 

q 

Of the two probabilities calculated above, which is likely to be closest to the true probability? q 

q 

Both probabilities were calculated using the normal distribution. But this is only an approximation of the true distribution of calf weights and sample mean of calf weights. From the histogram on two pages back, it appears that the density for the underlying weights of calves is only very approximately normal. Thus it is unlikely that the probability calculated for the weight of one calf is that accurate. On the other hand the Central Limit Theorem tells is that the distribution of the sample mean gets closer to normal as the sample size grows. The second probability we calculated was based on the average weight of 10 calves. The distribution of the average is likely to be more normal than the weight of calves. Thus the second probability based on the average is more accurate (close to the true probability).

Calculation  practice   p 

A farmer wants to use a vehicle to carry 30 0.5 week old calves. The vehicle he plans to use can carry a maximum load of 2760 pounds. He knows that the mean weight of a calf is 90.11 pounds and the standard deviation is 7.7. What is the chance the vehicle can carry the calves? p 

We need to turn the total weight into the sample mean. We observe, if the total weight of 30 calves needs to be less than 2760 pounds this is the same as the sample mean weight of 30 calves must be less than 2760/30 = 92: 30 X

30

1 X 2760 ¯ Xi < 2760 ) X = Xi < 30 30 i=1 i=1

Therefore, we have turned the problem from totals into averages and apply the CLT to calculate the probability using the normal distribution.

Calculation  practice  (cont)   p 

p 

p 

p 

P

We know from the central limit theorem that the sample mean is close to normally distributed. Thus the distribution of the sample mean is normal with mean 90.11 and standard deviation 7.7/√30 = 1.4. We know that for the vehicle to carry the calves, the sample mean has to be less than 92 pounds. Calculate the z-transform z=(92-90.11)/1.4 = 1.35 and look up the ztables to get 91.1. Conclusion: There is a 91.1% chance the vehicle can carry the 30 0.5 week old calves. In mathematical symbols:

30 X i=1

!

Xi < 2760

=P

1 ¯ X= 30

30 X

2760 Xi < 30 i=1

!

= P (Z < 1.35) = 0.911

How  large  is  a  large  enough  sample  size?   It depends on the population distribution. More observations are required if the population distribution has a large standard deviation or if it is far from normal in distribution. p 

p 

p 

p 

A sample size of 25 is generally enough to obtain a normal sampling distribution from a population with some skewness or even mild outliers. A sample size of 40 will typically be good enough to overcome some skewness and outliers. More importantly, n should be large enough to make the standard error sufficiently small – then we can get meaningful and precise inferences. We can check this by using the Sampling distribution applet. In many cases, even n = 40 is not large enough to give results reliable enough when there is a lot at stake. This is why clinical trials, political polls and marketing surveys typically observe 100’s or even 1000’s of individuals.

The  effect  of  skewness  on  the  CLT   Below we look at the sample mean taken from data with a large right skew

The  corresponding  QQplot  of  the  sample  mean   Observations: 1.  We see that the standard error is 0.756 = 4.7/√40, which is as it should be. 2.  However, the QQplot deviates far from normality in the tables. The distribution of the sample mean still has a slight right skew (look back at the QQplots in Chapter 4). This demonstrates that when data is highly skewed, we need a much large sample size for the CLT to kick in. 3.  Calculations based on normality of the the average will not be completely correct.

Effect  of  binary  data  on  the  CLT     Binary data arises in several situations. It includes Male or Female. Like or Dislike, wherever there are two possible outcomes. In this example, we have encrypted one outcome with zero and the other with 1 (it does not really matter which way). We see that the proportion in the one category is about 20% - this is what is meant by the mean. This data is discrete and clearly skewed.

The  corresponding  QQplot  of  the  sample  mean   Observations: 1.  We see that the standard error is 0.0571 = 0.405/√50, which is as it should be. 2.  However, the QQplot deviates far from normality in the tables. The lines across demonstrate that the average over 50 still takes discrete values (though not integers). We also see a U shape that shows that the sample mean is still skewed. 3.  Calculations based on normality of the the average will not be completely correct.

Example:  Income  distribution   Let’s consider the very large database of individual incomes from the Bureau of Labor Statistics as our population. Income is strongly right skewed. p 

p 

We take 1000 SRSs of 25 incomes, calculate the sample mean for each, and make a histogram of these 1000 means. We also take 1000 SRSs of 100 incomes, calculate the sample mean for each, and make a histogram of these 1000 means.

Which histogram corresponds to samples of size 100? Which to samples of size 25?

So  many  standard  deviations!   In statistics we talk about different kinds of standard deviations, and it can be hard to keep track of them: p  s is the standard deviation of a set (sample) of data. It is a statistic we can compute once we have the data. p  σ is the standard deviation of a population (which is much too big to observe completely). It is a parameter – usually, we will never know its true value. p  σ /√n is the standard deviation of the values of x from all possible random samples of size n. It refers to the sample mean, not to data. It is also called the standard error of x . p  s /√n is our estimate of σ /√n, since we do not know the value of σ.

€ n = 459 responded to the From a survey of students taking statistics, question “How many Facebook friends do you have?” The sample mean was x = 566.9 and the sample € standard deviation was s = 589.5. The standard error for the sample mean is s /√n = 589.5/√459 = 27.52. x

is an estimate for µ = mean of the population of all students required to take the class and s is an estimate for the population standard deviation σ.

Summary   x

p 

is always unbiased for µ, even if the population’s distribution is very different from a normal distribution.

p 

The standard deviation of random sampling.

p 

If the population is approximately normal or if the sample size n is large, we can use the normal distribution to compute probabilities for € x. We just have to remember to use σ /√n, not σ, in the denominator when calculating z.

p 

p 

x , σ /√n, measures the variability due to

This means we can say something about how close x is likely to be to µ. Generally it is quite likely (95% chance) that it will be within 2 standard errors of µ. Not all variables are normally distributed and large samples are not € always attainable. In such circumstances, a statistician should be consulted for proper methods of statistical inference and calculation.

Accompanying  problems   associated  with  this  Chapter   p  p  p  p 

Quiz 5 Quiz 6 Homework 2, Q6. Homework 3.

Suggest Documents