13 Collecting Statistical Data

13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 1.1 - 1 • Polls, studies, surveys and other data collecting t...
Author: Zoe Watkins
0 downloads 2 Views 324KB Size
13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling

1.1 - 1

• Polls, studies, surveys and other data collecting tools collect data from a small part of a larger group so that we can learn something about the larger group. • This is a common and important goal of statistics: Learn about a large group by examining data from some of its members. 1.1 - 2

 Data collections of observations (such as measurements, genders, survey responses)

1.1 - 3

 Statistics is the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data

1.1 - 4

 Population the complete collection of all individuals (scores, people, measurements, and so on) to be studied; the collection is complete in the sense that it includes all of the individuals to be studied

1.1 - 5



Census Collection of data from every member of a population



Sample Subcollection of members selected from a population

1.1 - 6

A Survey •

• •

The practical alternative to a census is to collect data only from some members of the population and use that data to draw conclusions and make inferences about the entire population. Statisticians call this approach a survey (or a poll when the data collection is done by asking questions). The subgroup chosen to provide the data is called the sample, and the act of selecting a sample is called sampling.

1.1 - 7

A Survey •



The first important step in a survey is to distinguish the population for which the survey applies (the target population) and the actual subset of the population from which the sample will be drawn, called the sampling frame. The ideal scenario is when the sampling frame is the same as the target population–that would mean that every member of the target population is a candidate for the sample. When this is impossible (or impractical), an appropriate sampling frame must be chosen.

1.1 - 8

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

The U.S. presidential election of 1936 pitted Alfred Landon, the Republican governor of Kansas, against the incumbent Democratic President, Franklin D. Roosevelt. At the time of the election, the nation had not yet emerged from the Great Depression, and economic issues such as unemployment and government spending were the dominant themes of the campaign. 1.1 - 9

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

The Literary Digest had used polls to accurately predict the results of every presidential election since 1916, and their 1936 poll was the largest and most ambitious poll ever. The sampling frame for the Literary Digest poll consisted of an enormous list of names that included: 1.1 - 10

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

(1) every person listed in a telephone directory anywhere in the United States, (2) every person on a magazine subscription list, and (3) every person listed on the roster of a club or professional association.

1.1 - 11

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

From this sampling frame a list of about 10 million names was created, and every name on this list was mailed a mock ballot and asked to mark it and return it to the magazine.

1.1 - 12

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

• Based on the poll results, the Literary Digest predicted a landslide victory for Landon with 57% of the vote, against Roosevelt’s 43%. • the election turned out to be a landslide victory for Roosevelt with 62% of the vote, against 38% for Landon.

1.1 - 13

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

• The difference between the poll’s prediction and the actual election results was 19%, the largest error ever in a major public opinion poll.

1.1 - 14

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

• For the same election, a young pollster named George Gallup was able to predict accurately a victory for Roosevelt using a sample of “only” 50,000 people. • What went wrong with the Literary Digest poll and why was Gallup able to do so much better?

1.1 - 15

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

• The first thing seriously wrong with the Literary Digest poll was the sampling frame, consisting of names taken from telephone directories, lists of magazine subscribers, rosters of club members, and so on. Telephones in 1936 were something of a luxury, and magazine subscriptions and club memberships even more so, at a time when 9 million people were unemployed. 1.1 - 16

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

• When it came to economic status the Literary Digest sample was far from being a representative cross section of the voters. This was a critical problem, because voters often vote on economic issues, and given the economic conditions of the time, this was especially true in 1936.

1.1 - 17

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

• When the choice of the sample has a built-in tendency (whether intentional or not) to exclude a particular group or characteristic within the population, we say that a survey suffers from selection bias. • Selection bias must be avoided, but it is not always easy to detect it ahead of time. Even the most scrupulous attempts to eliminate selection bias can fall short. 1.1 - 18

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

• The second serious problem with the Literary Digest poll was the issue of nonresponse bias. • In a typical survey it is understood that not every individual is willing to respond to the survey request (and in a democracy we cannot force them to do so).

1.1 - 19

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

• Those individuals who do not respond to the survey request are called nonrespondents, and those who do are called respondents. • The percentage of respondents out of the total sample is called the response rate.

1.1 - 20

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

• For the Literary Digest poll, out of a sample of 10 million people who were mailed a mock ballot only about 2.4 million mailed a ballot back, resulting in a 24% response rate. • When the response rate to a survey is low, the survey is said to suffer from nonresponse bias.

1.1 - 21

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

• One of the significant problems with the Literary Digest poll was that the poll was conducted by mail. This approach is the most likely to magnify nonresponse bias, because people often consider a mailed questionnaire just another form of junk mail.

1.1 - 22

CASE STUDY 2

THE 1936 LITERARY DIGEST POLL

The Literary Digest story has two morals: (1) You’ll do better with a well-chosen small sample than with a badly chosen large one, and (2) watch out for selection bias and nonresponse bias.

1.1 - 23

Examples •

Page 516, problems 17,18,19 (Solutions on following slides)

NOTE:

students should omit problem 28 from homework exercises

1.1 - 24

Examples Solutions 17(a) the citizens of Cleansburg

17(b) the sampling frame is limited to that part of the target population that passes by a city street corner between 4:00 pm and 6:00 pm

1.1 - 25

Examples Solutions 18(a) 475

18(b) yes, this survey is subject to nonresponse bias. The response rate is

475 0.266 1313 475

1.1 - 26

Examples Solutions 19(a) the choice of street corner could make a difference in responses collected

19(b) interviewer D. We are assuming that people who live or work downtown are more likely to answer yes than people in other parts of town.

1.1 - 27

Examples Solutions 19(c) yes. People on street between 4 pm and 6 pm are not representative of the population at large. Also, the five street corners were chosen by the interviewers and the passers-by are unlikely to represent a cross-section of the target population. 19(d) omit

1.1 - 28

Convenience Sampling •



One commonly used short-cut in sampling is known as convenience sampling. In convenience sampling the selection of which individuals are in the sample is dictated by what is easiest or cheapest for the data collector, never mind trying to get a representative sample. A classic example of convenience sampling is when interviewers set up at a fixed location such as a mall or outside a supermarket and ask passersby to be part of a public opinion poll.

1.1 - 29

Convenience Sampling •



A different type of convenience sampling occurs when the sample is based on self-selection–the sample consists of those individuals who volunteer to be in it. Self-selection is the reason why many Area Code 800 polls are not to be trusted. Convenience sampling is not always bad–at times there is no other choice or the alternatives are so expensive that they have to be ruled out.

1.1 - 30

Quota Sampling •



Quota sampling is a systematic effort to force the sample to be representative of a given population through the use of quotas–the sample should have so many women, so many men, so many blacks, so many whites, so many people living in urban areas, so many people living in rural areas, and so on. The proportions in each category in the sample should be the same as those in the population. If we can assume that every important characteristic of the population is taken into account when the quotas are set up, it is reasonable to expect that the sample will be representative of the population and produce reliable data.

1.1 - 31

Random Sampling •



The best alternative to human selection is to let the laws of chance determine the selection of a sample. Sampling methods that use randomness as part of their de- sign are known as random sampling methods, and any sample obtained through random sampling is called a random sample (or a probability sample).

1.1 - 32

Simple Random Sampling •



The most basic form of random sampling is called simple random sampling. It is based on the same principle a lottery is. Any set of numbers of a given size has an equal chance of being chosen as any other set of numbers of that size. In theory, simple random sampling is easy to implement. We put the name of each individual in the population in “a hat,” mix the names well, and then draw as many names as we need for our sample. Of course “a hat” is just a metaphor.

1.1 - 33

Simple Random Sampling •



These days, the “hat” is a computer database containing a list of members of the population. A computer program then randomly selects the names. This is a fine idea for small, compact populations, but a hopeless one when it comes to national surveys and public opinion polls. For most public opinion polls–especially those done on a regular basis” the time and money needed to do this are simply not available.

1.1 - 34

13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.5 The Capture-Recapture Method 13.6 Clinical Studies

1.1 - 35

Survey In a survey, we use a subset of the population, called a sample, as the source of our information, and from this sample, we try to generalize and draw conclusions about the entire population.

1.1 - 36

 Parameter a numerical measurement describing some characteristic of a

population.

population parameter

1.1 - 37

 Statistic a numerical measurement describing some characteristic of a sample.

sample statistic 1.1 - 38

Sampling Error • •

We will use the term sampling error to describe the difference between a parameter and a statistic used to estimate that parameter. In other words, the sampling error measures how much the data from a survey differs from the data that would have been obtained if a census had been used.

1.1 - 39

Sampling Error •

Sampling error can be attributed to two factors: chance error and sampling bias.

1.1 - 40

Chance Error • •

Chance error is the result of the basic fact that a sample can only give us approximate information about the population. Different samples are likely to produce different statistics for the same population, even when the samples are chosen in exactly the same way–a phenomenon known as sampling variability.

1.1 - 41

Chance Error •

While sampling variability, and thus chance error, are unavoidable, with careful selection of the sample and the right choice of sample size they can be kept to a minimum.

1.1 - 42

Sampling Bias • •

Sample bias is the result of choosing a bad sample and is a much more serious problem than chance error. Getting a sample that is representative of the entire population can be very difficult and can be affected by many subtle factors.

1.1 - 43

Sampling Bias •

As opposed to chance error, sample bias can be eliminated by using proper methods of sample selection.

1.1 - 44

Sampling Proportion • • •

The size of the sample is denoted by the letter n The size of the population is denoted by the letter N The ratio n/N is called the sampling proportion.

1.1 - 45

Examples •

Page 515, problems 5,6,7,8

1.1 - 46

Examples 5(a) 680/8325 5(b) 45%

1.1 - 47

Examples 6(a) the registered voters in Cleansburg 6(b) the 680 registered voters polled by telephone 6(c) simple random sampling

1.1 - 48

Examples 7. Smith 3%, Jones 3%, and Brown 0%

1.1 - 49

Examples 8. Chance, the sample was chosen randomly to eliminate selection bias and there was a 100% response rate to eliminate non-response bias.

1.1 - 50

13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 13.4 Sampling: Terminology and Key Concepts 13.5 The Capture-Recapture Method 13.6 Clinical Studies

1.1 - 51

Finding the N-value • • •

Finding the exact N-value of a large and elusive population can be extremely difficult and sometimes impossible. In many cases, a good estimate is all we really need, and such estimates are possible through sampling methods. The simplest sampling method for estimating the N-value of a population is called the capturerecapture method.

1.1 - 52

THE CAPTURERECAPTURE METHOD ■ Step 1. Capture (sample): Capture (choose) a sample of size n1, tag (mark, identify) the animals (objects, people), and release them back into the general population.

1.1 - 53

THE CAPTURERECAPTURE METHOD ■ Step 2. Recapture (resample): After a certain period of time, capture a new sample of size n2 and take an exact head count of the tagged individuals (i.e., those that were also in the first sample). Call this number k.

1.1 - 54

THE CAPTURERECAPTURE METHOD ■ Step 3. Estimate: The N-value of the population can be estimated to be approximately (n1• n2)/k.

1.1 - 55

Capture-Recapture Method The capture-recapture method is based on the assumption that both the captured and recaptured samples are representative of the entire population.

1.1 - 56

Capture-Recapture Method •



Under these assumptions, the proportion of tagged individuals in the recaptured sample is approximately equal to the proportion of the tagged individuals in the population. In other words, the ratio k/n2 is approximately equal to the ratio n1/N. From this we can solve for N and get N ≈ (n1• n2)/k

1.1 - 57

Example 13.6 Small Fish in a Big Pond A large pond is stocked with catfish. As part of a research project we need to estimate the number of catfish in the pond. An actual head count is out of the question (short of draining the pond), so our best bet is the capture-recapture method.

1.1 - 58

Example 13.6 Small Fish in a Big Pond Step 1. For our first sample we capture a predetermined number n1 of catfish, say n1 = 200. The fish are tagged and released unharmed back in the pond.

1.1 - 59

Example 13.6 Small Fish in a Big Pond Step 2. After giving enough time for the released fish to mingle and disperse throughout the pond, we capture a second sample of n2 catfish. While n2 does not have to equal n1, it is a good idea for the two samples to be of approximately the same order of magnitude. Let’s say that n2 = 150. Of the 150 catfish in the second sample, 21 have tags (were part of the original sample).

1.1 - 60

Example 13.6 Small Fish in a Big Pond Assuming the second sample is representative of the catfish population in the pond, the ratio of tagged fish in the second sample (21/150) is approximately the same as the ratio of tagged fish in the pond (200/N). This gives the approximate proportion 21/150 ≈ 200/N which in turn gives N ≈ 200 150/21 ≈ 1428.57

1.1 - 61

Example 13.6 Small Fish in a Big Pond Obviously, the value N = 1428.57 cannot be taken literally, since N must be a whole number. Besides, even in the best of cases, the computation is only an estimate. A sensible conclusion is that there are approximately N = 1400 catfish in the pond.

1.1 - 62

Suggest Documents