SAMPLING DISTRIBUTIONS (REVIEW TOPIC)

SAMPLING DISTRIBUTIONS (REVIEW TOPIC) Introduction. We begin by noting that the concept of a sampling distribution underlies everything that we will d...
Author: Lora Simmons
28 downloads 0 Views 97KB Size
SAMPLING DISTRIBUTIONS (REVIEW TOPIC) Introduction. We begin by noting that the concept of a sampling distribution underlies everything that we will do in the first part of this course; it is therefore extremely important to get these ideas straight in your head from the beginning. Done properly, sampling is a random process, so that which particular members of a population are included in a sample is a matter of chance. If there is variation in the population itself, this means that the value of a sample statistic, say the mean or sample proportion, will depend on the particular sample chosen. In general, if we choose two different samples from a population, we will get two different values for whatever statistics we wish to calculate. A third sample will give us yet a third set of values for our statistics, and so on. To see this, imagine some very simple experiment: let us suppose that we choose a sample of 100 students and ask them how much they earned last summer. Apart from those with zero earnings, we would not expect that any two students had exactly the same earnings; therefore, the value we get for the sample mean depends on which hundred students are in our sample. There is a very large number of possible samples of this size, so many, in fact, that we might as well say that their number is infinite. Because of variation in the population, no two of these samples will have exactly the same mean, and no sample mean will be exactly equal to the population mean. At first glance, this is somewhat discouraging since our purpose in sampling is to get an estimate of the population mean. However, since the sample mean is determined by a random process, it can be regarded as a random variable, and the laws of probability govern its value. Put another way, statistics such as the sample mean, sample standard deviation, sample proportion, and so on, have known probability distributions. Thus, although we cannot know whether a statistic is an accurate estimate of the corresponding population parameter, we can state the probability that it lies within a given distance from the population value. In the example above, we could not say that our estimate of students’ summer earnings is accurate, but we could make a statement such as, “There is a 90% probability that the sample mean is different from the population mean by no more than $200.” The size of the error, the amount by which the sample mean differs from the population mean, is determined partly by the degree of variation in the population itself and partly by the sample size. We can always state the maximum amount by which a statistic will diverge from the population value. Further, because we can control the sample size, we can control the maximum error in our estimate. To summarize: a Sampling Distribution is the probability distribution of some statistic, such as the sample mean or sample proportion. The expected value of the statistic is the expected value of the statistic considered as a random variable; the expected value of the sample mean, for example, is the mean of the sample means from all possible samples that can be drawn from a given population. The standard error of a statistic is the standard deviation of the statistic’s probability distribution. If we know how a statistic is distributed, then knowing its expected value and standard error allow us to state the probability that values of the statistic will be in any particular range. Therefore, when we talk about the sampling distribution of any statistic, we will want to know three important things about that statistic: 1) the shape of the distribution, such as z, t or several others we will meet; 2) the expected value of the distribution; and 3) the standard error of the distribution.

SOME REVIEW (CONTINUED) I.

The sampling distribution of X when the population standard deviation σ is known A. Definition: The sampling distribution of X is the probability distribution of sample means 1. This distribution depends, to some extent, on the nature of the population and, to some extent, on the sample B. For normal populations, if „ sampling from a very large population, or „ sampling with replacement: 1. the sampling distribution of X is a normal distribution 2. E ( x ) = µ x = µ σ 3. σ x = where σ is the population standard deviation and n is the sample size n a. note that the population standard deviation σ is assumed to be known and that this result holds, regardless of the sample size 4. Example: In Zenith City, the number of days a house is on the market before it is sold is normally distributed with µ = 42 days and σ = 12 days. a. Find the expected value and standard error for a sample of 25 house sale periods. Solution: E(X ) = µ = 42; σX = σ/√n = 12/√25 = 12/5 = 2.4 b. Find the expected value and standard error of a sample of 100 house sale periods: E(X ) = 42; σX = 12/10 = 1.2 c. What is the probability that a sample of 25 houses has sample mean less than 40 days? More than 50 days? Between 37.2 and 46.8 days? Solution 1: population normal and population standard deviation known ⇒ the sampling distribution is normal, and we can find the required probabilities by reference to a table of normal probabilities or Excel’s NORMDIST formula. To find the probability X < 40, enter =normdist(40,42,1.4,true): 0.20 Solution 2: 0.0004 Solution 3: For 37.2, z = (X -µX)/σX = (37.2 - 42)/2.4 = -2. For 46.8, z = (46.8 - 42)/2.4 = +2, so the question is the probability of values within two standard deviations of the mean on a normal distribution. About 95%. d. Contrasting Problems: What is the probability that any particular house will be sold in less than 40 days? In more than 50 days? Solution: These are normal problems, but they refer to the original population distribution, not to the distribution of sample means. „ For the first, 0.4438 „ P(x > 50) = 0.2525 NOTE: The problems discussed under (c) and (d) have different answers because they refer to different probability distributions, respectively the distribution of sample means and the distribution of the population. Because the distribution of sample means is derived from the population distribution, it is easy to confuse the two. It is extremely important for the student to get straight in his or her head that these are different distributions.

Stats II, Lecture Notes, page 3

e. What is the probability that a sample of 100 houses will have sample mean sales period of less than 40 days? Solution: From above, with n = 100, σX = 1.2. Accordingly P = 0.04779 „ Notice that the sampling distribution with n = 100 is somewhat more concentrated than that with n = 25: for n = 100, σX = 0.7 vs. 2.4 for n = 25. Thus the probability of sample means any given distance away from the population mean is somewhat less. B. For non-normal populations or for populations with unknown distribution: 1. for small samples, the sampling distribution’s shape reflects the shape of the population, and the rules are not simple a. unless we know the shape of the population, we cannot determine the shape of the sampling distribution, so cannot easily calculate probabilities 2. For large samples we have the following principle: C. Central Limit Theorem: for sufficiently large samples (by rule of thumb, n ≥ 30), the sampling distribution of x) 1. is approximately normal, regardless of the shape of the population distribution a. strictly, the approximation to a normal distribution is increasingly close, the larger is the sample size 2. E(X) = µ 3. σX = σ/√n 4. The Central Limit Theorem allows us to find probabilities on the sampling distribution, even when the population has some skewed, bimodal, or otherwise strange shape 5. Example: The distribution of income in the U.S. is skewed to the right; for the population, µ = $12,000 and σ = $3,000. a. for a sample of 20 people, what is the probability X ≥ $12,500? Answer: There is no simple solution to this problem; this is a skewed population and n < 30. b. for a sample of 144 people, what is the probability that X ≥ $12,500? Solution: By the Central Limit Theorem, the sampling distribution is approximately normal; hence E(X ) = $12,000 and σX = 3000/12 = 250. From Excel, P(X > 12500) = 0.02275 II. For sampling without replacement from a small population: the Finite Correction Factor A. The same principles noted under I. apply with respect to the normality of the sampling distribution and the sample size, but B. sampling without replacement reduces the possible range of variation in the sample C. For a normal population when sampling without replacement from a small population 1. the sampling distribution is normal 2. E(X ) = µ 3. σX = σ/√n × fcf where the finite correction factor is given by

N −n N −1 b. In this expression, N and n have their usual meanings. fcf =

Stats II, Lecture Notes, page 4

c. Examples: Ø From 211 faculty members at Ramapithecus County Community College (RCCC) we drew a sample of 24 and asked them how often a week they eat out. The population of meals out is normally distributed with µ = 4.7 and σ = 1.2. Describe the sampling distribution. Solution: The sampling distribution is normal with E(X )= µ = 4.7. Since the population is small, we must use the finite correction factor. Begin by calculating it: N −n 211 − 24 187 fcf = = = = 0.94365 N −1 211 − 1 210 Then σX = (σ/√n) × fcf = (1.2/√24 ) × 0.94365 = 0.23 Ø 1st Armored Division has 723 tanks that they’re taking to Bosnia; the distribution of mileage is skewed downward with a population mean of 1250 miles and standard deviation of 424 miles. In a sample of 64 tanks, what’s the probability of having a sample mean greater than 1350 miles? Solution: By the Central Limit Theorem, the sampling distribution is normal with µ = 1250. Since the population is relatively small, we must use the finite correction factor = √(723 - 64)/(723 - 1) = 0.9554 and σX = (σ/√n) × fcf = [(424/√64)] × 0.9554 = 50.63. probability X > 1350 = 0.024. D. From this last example, note that the fcf applies when sampling from a skewed or otherwise non-normal population, so long as the sample size is at least 30 E. Note that the quantity (N - n)/(n - 1) approaches 1 as N grows larger. For large populations, we may safely ignore the fcf unless the sample is very, very large 1. how large is large? 2. A rule of thumb: use the fcf whenever the sample is 5% or more of the population, that is, if n ≥ 0.05N

Stats II, Lecture Notes, page 5

III. The Sampling Distribution of the sample proportion A. The sample proportion is the proportion of objects in a sample that display some particular characteristic 1. ps = x/n where x is the number with the characteristic of interest and n the sample size 2. Example: Fifty students were asked whether they currently hold jobs; 26 replied yes. The sample proportion ps = 26/50 = 0.52 B. Note that ps is an estimate of the population proportion p 1. sample proportion is a random variable, dependent on which particular sample is chosen 2. a different sample composed of 50 other students would give us a different ps 3. the sampling distribution of p s is the probability distribution of sample results; it should depend on the population proportion and on the size of the sample a. suppose p = 0.2; then a sample result as big as 0.52 seems somewhat unlikely b. but we wouldn’t be surprised to find ps = 0.23 C. When sampling from an infinite population or when sampling with replacement, the sampling distribution of ps the sample proportion 1. is a normal distribution 2. E(p s) = p p × (1 − p ) 3. σ p s = n D. Examples: 1. Of the population of ASU students, last summer 62% worked at paid jobs. Characterize the sampling distribution of ps for a sample of n = 200. Solution: The distribution of ps is normal with expected value = 0.62. The standard error is given by σps = √(0.62 X (1 - 0.62))/200 = √(0.62 X 0.38)/200 = 0.034322 This sampling distribution would appear as in the following diagram: Sampling Distribution: Proportions

0.42

0.47

0.52

0.57

0.62

0.67

Sample Proportion

2. What is the probability of selecting a sample with ps ≥ 0.72?

0.72

0.77

0.82

Stats II, Lecture Notes, page 6

Solution: We require the area to the right of 0.67 on a normal curve with µ = 0.62 and σ = 0.034322. From Excel: area = 0.0018 3. Choose instead a sample of 50 students. In this new sample what is the probability of a sample proportion greater than or equal to 0.72 if p = 0.62. Solution: The solution is suggested by the following diagram:

The shaded area represents the probability that ps ≥ 0.72. From Excel, changing only the standard deviation, we have that the area = 0.073 σp = √(.62 × .38)/50 = 0.068644

4. Reducing the sample size increased the dispersion in the sample proportion. In this case, the probability that a sample value would be as much as 0.1 greater than the actual population value is over 7%; when n = 200, that same probability was effectively zero. 5. Example: 5% of the shirts in a large lot of shirts have serious defects in workmanship; For a sample of 150 shirts, characterize the distribution of ps and find the probability that fewer than 3% will be defective. Solution: The distribution of ps is normal with E(p s) = p = 0.05 and σp = √(0.05 X 0.95)/150 = 0.017795. The required probability is 0.1305

Stats II, Lecture Notes, page 7

E. Finite Correction Factor also applies when sampling for proportions, so that when sampling from a finite population p × (1 − p ) σp = × fcf n where all terms are as previously defined. 2. Example: During a particular semester, there are 612 students enrolled in Stats I. A professor selects a sample of 100 and asks how many have had the course before. If the population proportion is 20%, what is the probability that the sample proportion is more than 24%? Solution: Note first that n > 0.05N, so that by our rule of thumb we must use the fcf. Here the fcf = √(612 - 100)/(612 - 1) = 0.915. The distribution of the sample proportion ps is normal with E(p s) = 0.2 and σps = √(0.2 X 0.8)/100 × 0.915 = 0.04 × 0.915 = 0.0366. The required probability is 0.1372 IV. Homework Assignment: LBS, 6.30 (397), 6.34, 6.39 (401), 6.42, 6.46, 6.47 (405), 6.50, 6.52