From sampling distributions to confidence intervals. Sociology 360 Statistics for Sociologists I Chapter 14 Confidence Intervals

From sampling distributions to confidence intervals Sampling distributions will allow us to estimate confidence intervals for our statistics. Confiden...
Author: Cecil Randall
2 downloads 4 Views 703KB Size
From sampling distributions to confidence intervals Sampling distributions will allow us to estimate confidence intervals for our statistics. Confidence intervals: more formal definition to come, but essentially: A range of numeric values that provides a broader estimate of the value of the population mean. The interval is constructed so it will succeed in including the population mean a specified percent of the time, on average.

Sociology 360 Statistics for Sociologists I Chapter 14 Confidence Intervals

We would like to be relatively certain that the population mean falls within our interval, but will have to settle for a compromise between a narrow interval and a high percentage chance of including the mean.

1

2

Creating a confidence interval from Soc 360 data

What sampling distribution should we assume?

Suppose we are interested in the mean height of women at UWMadison. We have a sample of data from female Soc 360 students.

We have a relatively small sample, at n = 23 students. The observed mean, 64.6 inches, is one draw from the sampling distribution of x-bars. What should we assume is the shape of the sampling distribution? Why?

Can we consider the heights of women in Soc360 to be a random sample of heights of all women at UW-Madison? Suppose we do consider our students to be a random sample. Let’s use them to make a 95% confidence interval for the unknown mean of all female students at UW-Madison.

If we use the symbols ! and " to represent the unknown values of the population mean and standard deviation, what should the mean and standard deviation of the sampling distribution be (using the same symbols)?

Here are the summary statistics (Spring, 08, section 1): . summ height if female Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------height | 23 64.58696 2.972037 59 72

3

4

What sampling distribution should we assume?

Relationships of distributions and a random draw

We have a relatively small sample, at n = 23 students.

Population distribution of x values, with mean µ, standard deviation !:

The observed mean, 64.6 inches, is one draw from the sampling distribution of x-bars. What should we assume is the shape of the sampling distribution? Why?

µ √ Sampling distribution of x, ¯ with mean µ, standard deviation !/ n:

If we use the symbols ! and " to represent the unknown values of the population mean and standard deviation, what should the mean and standard deviation of the sampling distribution be (using the same symbols)? Theoretical sampling distribution of x¯

µ A single draw of an x¯ value:

! std. dev. = √ n

We want to create a confidence interval around the x¯ that was drawn. It will be our interval estimate for the unknown value of µ.

µ 5

6

Reasoning behind the confidence interval

Constructing a confidence interval for mean height

Here is a larger diagram of the sampling distribution of x-bars. Also shown are several sample means drawn from the distribution, represented as small circles, with error bars extending ± 1.96 standard deviations around each. (Important: In practice we draw only one mean.)

The method just outlined uses the observed sample mean plus and minus about 2 standard deviations of the sampling distribution. The result should cover the population mean 95% of the time. For our problem the standard deviation of the √ sampling distribution — also called the standard error — should be !/ 23 . But we don’t know the value of ! . Let’s get around this problem for now by assuming that we somehow know the value is 2.7 inches. (Later we will do better.) Then our required sampling distribution standard deviation is: √ 2.7/ 23 = 0.563 and the required 95% confidence interval for the population mean is: 64.6 ± 1.96(0.563) = 64.6 ± 1.1 = [63.5, 65.7] 7

8

Things to note about confidence intervals

When writing about confidence intervals…

Confidence intervals are centered around x¯ , not !.

You may say: “We are 95% confident that the mean height of women at UW-Madison is between 63.5 and 65.7 inches.”

If we drew a different sample of the same size from the population, we would calculate a different confidence interval.

But it is important to know what this means. It means: “If random samples of size 23 were repeatedly selected, then in the long run 95% of the confidence intervals formed would contain the true population mean height. In this case our guess is that the population mean is between 63.5 and 65.7 inches, but we don’t know if this is one of the 95% of all intervals which would include the population mean.”

Constructing a 95% CI means that 95% of the CIs constructed in the same way (and with the same n) will contain the parameter, in the long run. For a given sample, our interval either contains the population mean or it doesn’t. Thus, for a given sample, the CI does not mean that the chances are 95% that the CI contains the parameter.

Note again that the calculated confidence intervals will vary from sample to sample, since the sample means on which they are based vary!

It means that, if we drew repeated samples of the same size from the population, in the long run, 95% of the time the intervals we construct would contain !.

9

10

Components of correct statements about confidence intervals

The anatomy of a confidence interval

Shorthand: “We are 95% confident that the mean height of women at UW-Madison is between 63.5 and 65.7 inches.”

A level C confidence interval for a parameter has two parts: An interval calculated from the data, usually of the form estimate ± margin of error.

Tells us how confident we are: “95% confident” Refers to the population parameter of interest in words: “mean height of women at UW-Madison”

A confidence level C, which gives the probability that the interval will capture the true parameter value in repeated samples, or the success rate for the method.

Gives the CI as a range “63.5 to 65.7 inches”

11

12

Confidence intervals in action

More on the anatomy of a confidence interval

You might enjoy playing with this statistical applet:

Confidence intervals contain the population mean µ in C% of samples. Different areas under the curve give different confidence levels C.

http://bcs.whfreeman.com/bps4e/default.asp?s=&n=&i=&v=&o=&ns=0&uid=0&rau=0

The confidence interval is:

C

z* is called the “critical value” of z z* is related to the chosen confidence level C. C is the area under the standard normal curve between #z* and z*.

#Z*

Z*

Example: For an 80% confidence level C, 80% of the normal curve’s area is contained in the interval.

13

How to find specific critical values, z*

14

95% confidence interval We are 95% confident that the population mean will fall within the range:

We can use Table C. For a particular confidence level C, the appropriate z* value is just above it.

Ex. For a 98% confidence level, z*=2.326

We often use 2" as an approximation – as we did with the 68/95/99.7 rule. But 1.96 is more accurate.

We can also use Table A. Note that the most common confidence intervals are 90%, 95%, and 99%, with 95% being the most popular of all time.

15

16

Problem: hours of TV each day

Problem: hours of TV each day

On the 1990 GSS, 925 respondents were asked:

On the 1990 GSS, 925 respondents were asked:

“On the average day, about how many hours do you personally watch television?”

“On the average day, about how many hours do you personally watch television?”

The mean in the sample is 2.87. Suppose we know that " is 2.05

The mean in the sample is 2.87. Suppose we know that " is 2.05

Give a 98% confidence interval for the number of hours TV is watched.

Give a 98% confidence interval for the number of hours TV is watched.

Carefully describe the meaning of the interval.

Carefully describe the meaning of the interval. ! = 2.05; n = 925; z∗ = 2.326; x¯ = 2.87 2.05 = 2.87 ± 0.16 = [2.71, 3.03] CI = 2.87 ± 2.326 √ 925

Interpretation: We are 95% confident the population mean is between 2.71 and 3.03. Our method of constructing this interval contains the true mean 95% of the time on average. 17

Simple conditions for inference about a mean

18

What determines the size of a confidence interval?

We have an SRS from the population of interest. There is no nonresponse or any other practical difficulties.

! margin of error = m = z∗ √ n

The variable we measure has a Normal distribution in the population (Moore pg. 344), or else the sample size is large.

z* - Critical value. Confidence level C. — The larger the required level of confidence, the wider the confidence interval, since z* is larger.

The second condition works because of the CLT. Rule of thumb: just need a sample size of 40 or more if the population distribution of the variable is not normal or has strong outliers.

" – If the standard deviation in the population is smaller, ! is easier to pin down, because the interval is smaller.

If we have an n smaller than 40 we can check to see if the distribution looks normal with a histogram. We don’t know the population mean !, but we do know the population standard deviation ".

19

n – sample size — Note the sample size must increase by a factor of 4 to decrease m by a factor of 2. It must increase 9 times to reduce the interval by a factor of 3. And so on.

20

The impact of sample size

Graph relating sample size to standard deviation

The spread in the sampling distribution of the mean is a function (in part) of the number of individuals in the sample.

The standard deviation of the sampling distribution decreases in inverse proportion to the square root of the sample size: 10

The larger the sample size, the smaller the standard deviation (spread) of the sample mean distribution.

sampling distribution std. deviation 8 2 4 6

n = 40 population

n = 10

0

0

y .05 .1 .15 .2 .25

But the spread only decreases at a rate proportional to !n. (Cutting the standard deviation in half requires increasing the sample four times.)

0

-20

-10

0 x

10

10

20

30

20

40

50 60 sample size

70

80

90

100

21

22

Planning a sample size for a desired margin of error

Planning a sample size for a study of hours of TV

We may want to select the sample size to obtain a confidence interval of a specified size.

A TV executive is going to do a survey to find out about the TV watching habits of Americans. He wants to estimate the mean number of hours of TV Americans watch per day to within 0.1 of an hour with 95% confidence.

The previous formula for the margin of error was: ! m = z∗ √ n

He knows from prior studies that the standard deviation of the number of hours of TV watched is 2.

We can rearrange this to obtain a formula for the sample size corresponding to a desired margin of error: n=

!

z∗! m

The sampling method for the study is simple random sampling. How big should the sample be?

"2

To be conservative, we always round up to the nearest whole number, even if the fractional part is less than 0.5.

23

24

Planning a sample size for a study of hours of TV A TV executive is going to do a survey to find out about the TV watching habits of Americans. He wants to estimate the mean number of hours of TV Americans watch per day to within 0.1 of an hour with 95% confidence. He knows from prior studies that the standard deviation of the number of hours of TV watched is 2. The sampling method for the study is simple random sampling. How big should the sample be? m = 0.1; ! = 2; z∗ = 1.96 ! ∗ "2 ! "2 z! 1.96(2) n= = = 1536.64 m 0.1

Rounding up, the study should have 1537 respondents.

25