Chapter 7. Inference for Population Proportions

Lecture notes, Lang Wu, UBC 1 Chapter 7. Inference for Population Proportions 7.1. Introduction In the previous chapter, we have discussed the basic...
Author: Cornelius Smith
0 downloads 1 Views 99KB Size
Lecture notes, Lang Wu, UBC

1

Chapter 7. Inference for Population Proportions 7.1. Introduction In the previous chapter, we have discussed the basic ideas of statistical inference. To illustrate the basic ideas, we considered confidence intervals and hypothesis testing for the (unknown) mean parameter µ of a population following a normal distribution with known variance σ 2 . In practice, however, the variance of a normal population is usually unknown, so assuming a known variance seems unrealistic. In this chapter, we apply the basic ideas of statistical inference to a more realistic case. We consider a binary population, with unknown population proportion p of an event of interest (e.g., proportion of “success”). As discussed in earlier chapters, the proportion p can completely determine the population, unlike a normal population which is determined by both the mean µ and variance σ 2 . We discuss confidence intervals and hypothesis testing for p based on data in the sample. As discussed before, proportions or percentages are used to summarize binary data. For example, we may be interested in proportion of adults who smoke in a country or proportion of voters who will vote for candidate A in a coming election. Usually the population proportion p is unknown since population is often too large. However, the population proportion p can be estimated by the sample proportion pˆ if a sample is obtained. The sample proportion pˆ may or may not be an accurate estimate of the population proportion p since a sample is only a small subset of the population (i.e., there is some uncertainty of the estimate). A main goal of statistical inference is to incorporate this estimation uncertainty. The two tools for statistical inference are confidence intervals and hypothesis testing: we can construct a confidence interval for p or perform a hypothesis testing for p. To make inference about the population proportion p based on the sample proportion pˆ, we need to know the sampling distribution of pˆ. This is because, to derive a confidence interval or perform a hypothesis test for the population proportion p based on the sample proportion pˆ, we must know the distribution of pˆ. As discussed in earlier chapters, the exact distribution of the sample proportion pˆ is difficult to derive. However, an approximate distribution of the sample proportion pˆ is available if the sample size is reasonably large and p is not too close to 0 or 1, that is, when np ≥ 10 and

Lecture notes, Lang Wu, UBC

2

n(1 − p) ≥ 10), we have 

s

pˆ ∼ N p,



p(1 − p)  , n

approximately.

The above normal approximation becomes more accurate as the sample size n gets larger. This approximate normal distribution can be used to construct a confidence intervals and perform hypothesis testing for p, as shown in the next section.

7.2. Confidence Interval for the Population Proportion To construct a confidence interval for the unknown population proportion p, we can use the normal approximation (assuming np ≥ 10, n(1 − p) ≥ 10) 

s

pˆ ∼ N p,



p(1 − p)  , n

where the mean and standard deviation of the normal distribution are respectively s

µ = E(ˆ p) = p,

σ=

q

V ar(ˆ p) =

p(1 − p) . n

Note that the standard deviation σ in the above normal distribution contains the population proportion (mean) parameter p, which is unknown. In practice, when constructing confidence intervals for the unknown parameter p, we can obtain an estimate of the standard deviation σ by replacing the unknown p by its estimate pˆ, i.e., s

σ ˆ=

pˆ(1 − pˆ) . n

Recall that the sample proportion pˆ = X/n, where X is the number of “success”  qand n is  the sample size. Thus, we can use the approximate normal distribution N p,

pˆ(1−ˆ p) n

to construct an approximate confidence interval for p. Based on the results in the previous chapter, i.e., the formulas of confidence intervals for the mean parameter µ of a normal distribution with known standard deviation σ, we have, for example, the following approximate 95% confidence interval for p  pˆ − 1.96 ×

s

pˆ(1 − pˆ) , n

s

pˆ + 1.96 ×



pˆ(1 − pˆ)  . n

Lecture notes, Lang Wu, UBC

3

In general, when the sample size is reasonably large, an approximate (1−α)×100% confidence interval for p is given by 

s

pˆ − z ∗

×

s

pˆ(1 − pˆ) , n

pˆ + z ∗ ×



pˆ(1 − pˆ)  , n

where z ∗ is the 1 − α/2 percentile of the standard normal distribution. The above formula works well when n is large and p is not too close to 0 or 1. When n is small and/or p is close to 0 or 1, we can actually construct an exact confidence interval for p based on the fact that X ∼ B(n, p). This approach involves some algebra, so we omit the detail and leave it as an exercise for interested readers.

Example. In an election, candidate A wishes to know the percentage (or proportion) of all potential voters who will support him. His assistant randomly selects 100 potential voters and finds that 40% of them support candidate A. What is the possible percentage of all potential voters who will support candidate A? Solution: The population here is all potential voters. This is a binary (discrete) population since each potential voter will either support candidate A or does not support candidate A (i.e., either “success” or “failure”). The data contain a sample of size n = 100, with the sample proportion pˆ = 0.4. We may construct a 95% confidence interval for the (unknown) population proportion p (i.e., the proportion of all potential voters who will support candidate A). Based on the formula given above, the 95% confidence interval for p is given by 

s

pˆ − 1.96 ×

pˆ(1 − pˆ) , n



= 0.4 − 1.96 ×

s

s

pˆ + 1.96 ×

0.4 × 0.6 , 100



pˆ(1 − pˆ)  n s

0.4 + 1.96 ×



0.4 × 0.6  = (0.304, 0.496). 100

Thus, we are 95% confident that about 30.4% to 49.6% of all potential voters may support candidate A. The interpretation of the above confidence interval is as follows: if we choose many samples of size 100 and construct confidence intervals using the above formula, about 95% of all these confidence intervals will contain (or cover) the unknown population proportion p.

Lecture notes, Lang Wu, UBC

4

7.3. A Test for Population Proportion A confidence interval provides a range of possible values of the unknown population proportion p may take, with certain degree of confidence. In some situations, we wish to know if the population proportion p takes a specific value or is larger/smaller than a specific value. For example, some researchers claim that no more than 10% of adults in Vancouver exercise regularly. To check if this claim is true or not, we can collect data and then test the hypotheses H0 : p ≥ 0.1 versus p ≤ 0.1 (or H0 : p = 0.1 versus p ≤ 0.1). As another example. It is claimed that over 40% UBC students take bus to come to class. To confirm if this claim is true or not, we can survey some randomly selected UBC students, and then test the hypotheses H0 : p ≤ 0.4 versus p ≥ 0.4. In general, a two-sided hypothesis for the population proportion p can be written as H0 : p = p 0

versus

Ha : p 6= p0 ,

where p0 is a known number (e.g., p0 = 0.1 or 0.4). We can construct a hypothesis test based on the approximate normal distribution of the sample proportion pˆ. Note that, in any hypothesis testing problems, the value of the test statistic and its distribution are evaluated under the null hypothesis (i.e., assuming p = p0 holds), and then we use data to check if there is strong evidence against or support the null hypothesis. In the current situation, under the null hypothesis H0 , when np0 ≥ 10 and n(1 − p0 ) ≥ 10, we have s

pˆ ∼ N (p0 ,

p0 (1 − p0 ) ) n

approximately. That is, after standardization, we have pˆ − p0 Z=q

p0 (1−p0 ) n

∼ N (0, 1),

approximately, if the null hypothesis H0 : p = p0 holds. Note: This is different from confidence intervals where there is no known value p0 , while here the standard deviation of pˆ under H0 is known since p0 is known. Note also that, before we obtain a specific sample, the above Z is viewed as a random variable. However, if a specific sample/data is obtained, we can compute the value of Z, so Z becomes a specific number (i.e., it is no longer a random variable). In this case, we often write Z as z. Sometimes we slightly abuse the notation between Z and z based on the understanding that readers know the difference based on context.

Lecture notes, Lang Wu, UBC

5

Since the sample proportion pˆ is an estimate of the unknown population proportion p, if p 6= p0 (i.e., if the null hypothesis does not hold), the value of the test statistic Z should be either too large or too small (say, larger than 2 or smaller than –2), because most (say, 95%) values of Z should be between –2 and 2 if p = p0 and so Z ∼ N (0, 1). In other words, we should reject H0 at 5% level if |z| > 2. On the other hand, if the null hypothesis holds, the value of Z should be close to 0 (say, between –2 and 2). That is, ∗ we do not reject H0 if |z| ≤ 2. (Note that z0.025 = 1.96 ≈ 2. We use 2 since it’s easier to

remember and the test is only an approximate method.) In general, given a significance ∗ level α, we reject H0 in two-sided hypotheses if |z| > zα/2 .

Alternatively, we can compute p-values and then use p-values as evidence against or support the null hypothesis. Suppose that z is the observed value of the test statistic Z based on a given dataset. An approximate p-value for the two-sided alternative can be computed using the following formula p-value = P (Z > |z|) + P (Z < −|z|) = 2P (Z > |z|), where Z ∼ N (0, 1) and |z| is the absolute value of z. Note that rejecting H0 at level α is equivalent to p-value < α. That is, if a p-value is smaller than α, we reject H0 at level α. A general one-sided hypothesis can be written as H0 : p ≤ p 0

versus

H1 : p > p0 .

In this case, the test statistic Z remains the same, i.e., pˆ − p0 . Z=q p0 (1−p0 ) n

However, the p-value is now given by p-value = P (Z > z), where z is the observed value of the test statistic based on a dataset and Z now represents a random variable following N (0, 1) distribution. Or we reject H0 if z > zα∗ (or p-value < α), where zα∗ = z1−α is the 1 − α percentile of the standard normal distribution. Similarly, the approach for testing the following one-sided hypothesis H0 : p ≥ p 0

versus

H1 : p < p0

Lecture notes, Lang Wu, UBC

6

is similar. In this case, p-value = P (Z < z) and rejection region is z < −zα∗ . Note that whether the hypotheses should be one-sided or two-sided is determined by the research objectives. Also, the alternative hypothesis is often something you wish to prove or verify, which determines whether the alternative should be one-sided or twosided. It is important to correctly specify the right hypotheses.

Example 1. It is claimed that only 10% of adults in city A exercise regularly based on a study conducted two years ago. A researcher suspects that this claim is not true, so he randomly surveyed 400 adults and found that 15% of them exercise regularly. (a) Based on the new data, does previous claim still hold? (b) Another researcher suspects that more than 10% adults exercise regularly, without seeing the data the first researcher has collected. Does the new data provide sufficient evidence to support this researcher’s conjecture? Solution: (a) Let p be the proportion of all adults in city A who exercise regularly. The value of p is usually unknown since it is difficult to survey all adults. However, we can test the following two-sided hypotheses based on the data: H0 : p = 0.1

versus

Ha : p 6= 0.1,

i.e., p0 = 0.1. The summaries of the data are n = 400 and sample proportion pˆ = 0.15. The value of the test statistic is given by z=

0.15 − 0.10 q

0.1×0.9 400

= 3.33.

The p-value is p = P (Z > 3.33) + P (Z < −3.33) = 0.00087, where Z is a random variable following N (0, 1) distribution and 0.00087 is obtained from software (you can also obtain a very close number from a standard normal table). Thus, there is very strong evidence against the null hypothesis, i.e., there is very strong evidence to suggest that the proportion of all adults who exercise regularly is not 10%. This result is statistically significant at 5% level (or 1% level), or we reject the null hypothesis and are in favour of the alternative hypothesis. (b) The second researcher’s question is different from the first one: the second research thinks that p should be greater than 10% while the first researcher thinks that p is not

Lecture notes, Lang Wu, UBC

7

10%. Thus, the hypotheses based on the second researcher’s question (objective) should be one-sided rather than two-sided, i.e., we now should test the following one-sided hypotheses H0 : p ≤ 0.1 (or p = 0.1)

versus

H1 : p > 0.1.

The value of the test statistic remains the same, i.e., z = 3.33. The p-value is now given by p = P (Z > 3.33) = 0.00043. Thus, there is very strong evidence against the null hypothesis, suggesting that the proportion of all adults who exercise regularly is higher than 10%. Again, this result is statistically significant at either 5% level or 1% level.

Example 2. An instructor wishes to evaluate the performances of all students in STAT 200 in previous years. He randomly selected 100 students from STAT 200 classes he has taught in the past 10 years. He finds that these 100 students have an average final exam score of 70, with a standard deviation of 16, and 82% students passed the final exams (with scores at least 50%). The instructor wishes to verify that, among all students who took STAT 200 in the past 10 years, more than 80% passed the final exams. Perform a hypothesis test to verify the instructor’s conjecture, with significance level 5%.

Solutions: Here the population is all students who took STAT200 in the past 10 years. Let p be the proportion of all students who took STAT200 in the past 10 years and passed the final exams. The value of the population proportion p is unknown since it is difficult to obtain all final exam scores for all students who took STAT200 in the past 10 years. However, we can test the following one-sided hypotheses H0 : p ≤ 0.8

versus

Ha : p > 0.8

based on final exam scores of the 100 randomly selected students. The sample proportion is pˆ = 0.82, with a sample size of n = 100. Here p0 = 0.8, and np0 = 80 and n(1 − p0 ) = 20, so we can perform an approximate test based on the normal approximation to the sampling distribution of the sample proportion pˆ. The value of the test statistic is pˆ − 0.8

z=q

(0.8 × 0.2)/100

=

0.82 − 0.8 = 0.5. 0.04

∗ = 1.645. Since 0.5 < 1.645 (or p-value At α = 0.05, the one-sided critical value is z0.05

¿ 0.05, see below), we fail to reject H0 at 5% level, i.e., there may be less than 80%

Lecture notes, Lang Wu, UBC

8

students passed the STAT200 final exams in the past 10 years. The p-value is given by (based on software or a standard normal table) p = P (Z > 0.5) = 1 − 0.691 = 0.309, where Z is a random variable following N (0, 1) distribution and 0.309 is obtained from software or a standard normal table. The interpretation of the above p-value is: if in fact 80% of all students in the past 10 years passed STAT 200 final exams (i.e., if H0 : p = 0.80 holds), the probability of observing the sample proportion pˆ = 0.82 or a larger value is about 0.309, which is not small. Thus, there is no strong evidence against the null hypothesis (i.e., the null hypothesis may hold).

Finally, we wish to point out that two-sided hypothesis testing is in fact equivalent to a confidence interval. Specifically, for example, if the null hypothesis H0 is rejected at 5% level for testing H0 : p = p0 versus H1 : p 6= p0 , then the 95% confidence interval for p will not include the value p0 , or vice versa. In this chapter, we consider a binary (discrete) population and make inference for the population proportion of an event of interest. Sometimes a continuous population may be converted into a binary population. For example, exam scores are continuous data. However, if we are only interested in whether a student pass the exam or not (e.g., if the student’s exam score is greater than 50% or not), the population is converted into a binary one (i.e., a student either pass or fail the exam), and the population proportion is the proportion of all students who pass the exam. Or we may be interested in whether a student obtains a score over 90% or not, and again we now have a binary population and the population proportion may be the proportion of all students who obtain scores over 90%.

7.4. Chapter Summary In this chapter, we have considered a binary (discrete) population and introduced statistical inference for the population proportion p. The basic idea is to use the normal approximation to the sampling distribution of the sample proportion pˆ, assuming np > 10 and n(1 − p) > 10. So the resulting confidence intervals and hypothesis testing are approximate methods, not exact methods. The materials in this chapter may be viewed

Lecture notes, Lang Wu, UBC

9

as an application of the ideas and methods introduced in the previous chapter where we assume that the population is normal with known standard deviation. Although statistical inference in this chapter is performed under the normal distributions, the approximate normal distribution of the sampling distribution of the sample proportion pˆ has mean µ = p and standard deviation σ =

q

p(1 − p)/n, i.e., both the mean and the

standard deviation are determined by p. This is a unique feature of the methods in this chapter. Moreover, the methods in this chapter only work well when the sample size is large. In other words, if the sample size is small or p is too close to 0 or 1, the methods in this chapter may not work well.

7.5. Review Questions 1. Comparing the formulas of confidence intervals and test statistics in this chapter and those in the previous chapter, what are the main differences and similarities? What care must be taken when using the methods in this chapter? 2. When the sample size is small or p is too close to 0 or 1, can we use methods in this chapter? If not, do you think it is possible to derive a new method by yourself? If so, what is the basic idea behind your method?