Sampling. Chapter 3. Contents

Chapter 3 Sampling Contents 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Differenc...
Author: Annis Summers
2 downloads 3 Views 1MB Size
Chapter 3

Sampling Contents 3.1

3.2

3.3 3.4

3.5 3.6

3.7

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Difference between sampling and experimental design . . . . . 3.1.2 Why sample rather than census? . . . . . . . . . . . . . . . . 3.1.3 Principle steps in a survey . . . . . . . . . . . . . . . . . . . . 3.1.4 Probability sampling vs. non-probability sampling . . . . . . . 3.1.5 The importance of randomization in survey design . . . . . . . 3.1.6 Model vs. Design based sampling . . . . . . . . . . . . . . . . 3.1.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of Sampling Methods . . . . . . . . . . . . . . . . . . . . 3.2.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . 3.2.2 Systematic Surveys . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Multi-stage sampling . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Multi-phase designs . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 Panel design - suitable for long-term monitoring . . . . . . . . 3.2.7 Sampling non-discrete objects . . . . . . . . . . . . . . . . . 3.2.8 Key considerations when designing or analyzing a survey . . . Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple Random Sampling Without Replacement (SRSWOR) . . . 3.4.1 Summary of main results . . . . . . . . . . . . . . . . . . . . 3.4.2 Estimating the Population Mean . . . . . . . . . . . . . . . . 3.4.3 Estimating the Population Total . . . . . . . . . . . . . . . . . 3.4.4 Estimating Population Proportions . . . . . . . . . . . . . . . 3.4.5 Example - estimating total catch of fish in a recreational fishery Sample size determination for a simple random sample . . . . . . . 3.5.1 Example - How many angling-parties to survey . . . . . . . . . Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Advantages of systematic sampling . . . . . . . . . . . . . . . 3.6.2 Disadvantages of systematic sampling . . . . . . . . . . . . . 3.6.3 How to select a systematic sample . . . . . . . . . . . . . . . 3.6.4 Analyzing a systematic sample . . . . . . . . . . . . . . . . . 3.6.5 Technical notes - Repeated systematic sampling . . . . . . . . Stratified simple random sampling . . . . . . . . . . . . . . . . . .

99

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

101 101 101 102 102 103 106 107 107 107 109 111 115 117 119 120 120 121 122 122 123 124 124 125 131 133 136 136 137 137 137 138 140

CHAPTER 3. SAMPLING 3.7.1

3.8

3.9

3.10

3.11

3.12 3.13 3.14

A visual comparison of a simple random sample vs. a stratified simple random sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 3.7.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 3.7.3 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 3.7.4 Example - sampling organic matter from a lake . . . . . . . . . . . . . . . . . 149 3.7.5 Example - estimating the total catch of salmon . . . . . . . . . . . . . . . . . 153 3.7.6 Sample Size for Stratified Designs . . . . . . . . . . . . . . . . . . . . . . . 161 3.7.7 Allocating samples among strata . . . . . . . . . . . . . . . . . . . . . . . . 163 3.7.8 Example: Estimating the number of tundra swans. . . . . . . . . . . . . . . . 166 3.7.9 Post-stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 3.7.10 Allocation and precision - revisited . . . . . . . . . . . . . . . . . . . . . . . 173 Ratio estimation in SRS - improving precision with auxiliary information . . . . . 174 3.8.1 Summary of Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 3.8.2 Example - wolf/moose ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 3.8.3 Example - Grouse numbers - using a ratio estimator to estimate a population total183 Additional ways to improve precision . . . . . . . . . . . . . . . . . . . . . . . . . 191 3.9.1 Using both stratification and auxiliary variables . . . . . . . . . . . . . . . . 192 3.9.2 Regression Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 3.9.3 Sampling with unequal probability - pps sampling . . . . . . . . . . . . . . . 193 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 3.10.1 Sampling plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 3.10.2 Advantages and disadvantages of cluster sampling compared to SRS . . . . . 196 3.10.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 3.10.4 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 3.10.5 Example - estimating the density of urchins . . . . . . . . . . . . . . . . . . . 198 3.10.6 Example - estimating the total number of sea cucumbers . . . . . . . . . . . . 204 Multi-stage sampling - a generalization of cluster sampling . . . . . . . . . . . . . 211 3.11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 3.11.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 3.11.3 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 3.11.4 Example - estimating number of clams . . . . . . . . . . . . . . . . . . . . . 213 3.11.5 Some closing comments on multi-stage designs . . . . . . . . . . . . . . . . 216 Analytical surveys - almost experimental design . . . . . . . . . . . . . . . . . . . 217 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . . . 220 3.14.1 Confusion about the definition of a population . . . . . . . . . . . . . . . . . 220 3.14.2 How is N defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 3.14.3 Multi-stage vs. Multi-phase sampling . . . . . . . . . . . . . . . . . . . . . . 221 3.14.4 What is the difference between a Population and a frame? . . . . . . . . . . . 222 3.14.5 How to account for missing transects. . . . . . . . . . . . . . . . . . . . . . . 222

The suggested citation for this chapter of notes is: Schwarz, C. J. (2015). Sampling. In Course Notes for Beginning and Intermediate Statistics. Available at http://www.stat.sfu.ca/~cschwarz/CourseNotes. Retrieved 2015-08-20.

c

2015 Carl James Schwarz

100

2015-08-20

CHAPTER 3. SAMPLING

3.1

Introduction

Today the word "survey" is used most often to describe a method of gathering information from a sample of individuals or animals or areas. This "sample" is usually just a fraction of the population being studied. You are exposed to survey results almost every day. For example, election polls, the unemployment rate, or the consumer price index are all examples of the results of surveys. On the other hand, some common headlines are NOT the results of surveys, but rather the results of experiments. For example, is a new drug just as effective as an old drug. Not only do surveys have a wide variety of purposes, they also can be conducted in many ways – including over the telephone, by mail, or in person. Nonetheless, all surveys do have certain characteristics in common. All surveys require a great deal of planning in order that the results are informative. Unlike a census, where all members of the population are studied, surveys gather information from only a portion of a population of interest – the size of the sample depending on the purpose of the study. Surprisingly to many people, a survey can give better quality results than an census. In a bona fide survey, the sample is not selected haphazardly. It is scientifically chosen so that each object in the population will have a measurable chance of selection. This way, the results can be reliably projected from the sample to the larger population. Information is collected by means of standardized procedures The survey’s intent is not to describe the particular object which, by chance, are part of the sample but to obtain a composite profile of the population.

3.1.1

Difference between sampling and experimental design

There are two key differences between survey sampling and experimental design. • In experiments, one deliberately perturbs some part of population to see the effect of the action. In sampling, one wishes to see what the population is like without disturbing it. • In experiments, the objective is to compare the mean response to changes in levels of the factors. In sampling the objective is to describe the characteristics of the population. However, refer to the section on analytical sampling later in this chapter for when sampling looks very similar to experimental design.

3.1.2

Why sample rather than census?

There are a number of advantages of sampling over a complete census: • reduced cost • greater speed - a much smaller scale of operations is performed • greater scope - if highly trained personnel or equipment is needed • greater accuracy - easier to train small crew, supervise them, and reduce data entry errors

c

2015 Carl James Schwarz

101

2015-08-20

CHAPTER 3. SAMPLING • reduced respondent burden • in destructive sampling you can’t measure the entire population - e.g. crash tests of cars

3.1.3

Principle steps in a survey

The principle steps in a survey are: • formulate the objectives of the survey - need concise statement • define the population to be sampled - e.g. what is the range of animals or locations to be measured? Note that the population is the set of final sampling units that will be measured - refer to the FAQ at the end of the chapter for more information. • establish what data is to be collected - collect a few items well rather than many poorly • what degree of precision is required - examine power needed • establish the frame - this is a list of sampling units that is exhaustive and exclusive – in many cases the frame is obvious, but in others it is not – it is often very difficult to establish a frame - e.g. a list of all streams in the lower mainland. • choose among the various designs; will you stratify? There are a variety of sampling plans some of which will be discussed in detail later in this chapter. Some common designs in ecological studies are: – simple random sampling – systematic sample – cluster sampling – multi-stage design All designs can be improved by stratification, so this should always be considered during the design phase. • pre-test - very important to try out field methods and questionnaires • organization of field work - training, pre-test, etc • summary and data analysis - easiest part if earlier parts done well • post-mortem - what went well, poorly, etc.

3.1.4

Probability sampling vs. non-probability sampling

There are two types of sampling plans - probability sampling where units are chosen in a ‘random fashion’ and non-probability sampling where units are chosen in some deliberate fashion. In probability sampling • every unit has a known probability of being in the sample • the sample is drawn with some method consistent with these probabilities c

2015 Carl James Schwarz

102

2015-08-20

CHAPTER 3. SAMPLING • these selection probabilities are used when making estimates from the sample The advantages of probability sampling • we can study biases of the sampling plans • standard errors and measures of precision (confidence limits) can be obtained Some types of non-probability sampling plan include: • quota sampling - select 50 M and 50 F from the population – less expensive than a probability sample – may be only option if no frame exists • judgmental sampling - select ‘average’ or ‘typical’ value. This is a quick and dirty sampling method and can perform well if there are a few extreme points which should not be included. • convenience sampling - select those readily available. This is useful if is dangerous or unpleasant to sample directly. For example, selecting blood samples from grizzly bears. • haphazard sampling (not the same as random sampling). This is often useful if the sampling material is homogeneous and spread throughout the population, e.g. chemicals in drinking water. The disadvantages of non-probability sampling include • unable to assess biases in any rational way. • no estimates of precision can be obtained. In particular the simple use of formulae from probability sampling is WRONG!. • experts may disagree on what is the “best” sample.

3.1.5

The importance of randomization in survey design

[With thanks to Dr. Rick Routledge for this part of the notes.] . . . I had to make a ‘cover degree’ study... This involved the use of a Raunkiaer’s Circle, a device designed in hell. In appearance it was all simple innocence, being no more than a big metal hoop; but in use it was a devil’s mechanism for driving sane men mad. To use it, one stood on a stretch of muskeg, shut one’s eyes, spun around several times like a top, and then flung the circle as far away as possible. This complicated procedure was designed to ensure that the throw was truly ‘random’; but, in the event, it inevitably resulted in my losing sight of the hoop entirely, and having to spend an unconscionable time searching for the thing. Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed to follow such a bizarrelooking scheme for collecting a representative sample of tundra vegetation? Could she not have obtained c

2015 Carl James Schwarz

103

2015-08-20

CHAPTER 3. SAMPLING a typical cross-section of the vegetation by using her own judgment? Undoubtedly, she could have convinced herself that by replacing an awkward, haphazard sampling scheme with one dependent solely on her own judgment and common sense, she could have been guaranteed a more representative sample. But would others be convinced? A careful, objective scientist is trained to be skeptical. She would be reluctant to accept any evidence whose validity depended critically on the judgment and skills of a stranger. The burden of proof would then rest squarely with Farley Mowat to prove his ability to take representative, judgmental samples. It is typically far easier for a scientist to use randomization in her sampling procedures than it is to prove her judgmental skills. Hovering and Patrolling Bees It is often difficult, if not impossible, to take a properly randomized sample. Consider, e.g., the problem faced by Alcock et al. (1977) in studying the behavior of male bees of the species, Centris pallida, in the deserts of south-western United States. Females pupate in underground burrows. To maximize the presence of his genes in the next generation, a male of the species needs to mate with as many virgin females as possible. One strategy is to patrol the burrowing area at a low altitude, and nab an emerging female as soon as her presence is detected. This patrolling strategy seems to involve a relatively high risk of confrontation with other patrolling males. The other strategy reported by the authors is to hover farther above the burrowing area, and mate with those females who escape detection by the hoverers. These hoverers appear to be involved in fewer conflicts. Because the hoverers tend to be less involved in aggressive confrontations, one might guess that they would tend to be somewhat smaller than the more aggressive patrollers. To assess this hypothesis, the authors took measurements of head widths for each of the two subpopulations. Of course, they could not capture every single male bee in the population. They had to be content with a sample. Sample sizes and results are reported in the Table below. How are we to interpret these results? The sampled hoverers obviously tended to be somewhat smaller than the sampled patrollers, although it appears from the standard deviations that some hoverers were larger than the average-sized patroller and vice-versa. Hence, the difference is not overwhelming, and may be attributable to sampling errors. Table Summary of head width measurements on two samples of bees. y SD Sample n Hoverers

50

4.92 mm

0.15 mm

Patrollers

100

5.14 mm

0.29 mm

If the sampling were truly randomized, then the only sampling errors would be chance errors, whose probable size can be assessed by a standard t-test. Exactly how were the samples taken? Is it possible that the sampling procedure used to select patrolling bees might favor the capture of larger bees, for example? This issue is indeed addressed by the authors. They carefully explain how they attempted to obtain unbiased samples. For example, to sample the patrolling bees, they made a sweep across the sampling area, attempting to catch all the patrolling bees that they observed. To assess the potential for bias, one must in the end make a subjective judgment. Why make all this fuss over a technical possibility? It is important to do so because lack of attention to such possibilities has led to some colossal errors in the past. Nowhere are they more obvious than in the field of election prediction. Most of us never find out the real nature of the population that we are sampling. Hence, we never know the true size of our errors. By contrast, pollsters’ errors are often painfully obvious. After the election, the actual percentages are available for everyone to see. Lessons from Opinion Polling

c

2015 Carl James Schwarz

104

2015-08-20

CHAPTER 3. SAMPLING In the 1930’s, political opinion was in its formative years. The pioneers in this endeavor were training themselves on the job. Of the inevitable errors, two were so spectacular as to make international headlines. In 1935, an American magazine with a large circulation, The Literary Digest, attempted to poll an enormous segment of the American voting public in order to predict the outcome of the presidential election that autumn. Roosevelt, the Democratic candidate, promised to develop programs designed to increase opportunities for the disadvantaged; Landon, the candidate for the Republican Party, appealed more to the wealthier segments of American society. The Literary Digest mailed out questionnaires to about ten million people whose names appeared in such places as subscription lists, club directories, etc. They received over 2.5 million responses, on the basis of which they predicted a comfortable victory for Landon. The election returns soon showed the massive size of their prediction error. The cumbersome design of this highly publicized survey provided a young, wily pollster with the chance of a lifetime. Between the time that the Digest announced its plans and released its predictions, George Gallup planned and executed a remarkable coup. By polling only a small fraction of these individuals, and a relatively small number of other voters, he correctly predicted not only the outcome of the election, but also the enormous size of the error about to be committed by The Literary Digest. Obviously, the enormous sample obtained by the Digest was not very representative of the population. The selection procedure was heavily biased in favor of Republican voters. The most obvious source of bias is the method used to generate the list of names and addresses of the people that they contacted. In 1935, only the relatively affluent could afford magazines, telephones, etc., and the more conservative policies of the Republican Party appealed to a greater proportion of this segment of the American public. The Digest’s sample selection procedure was therefore biased in favor of the Republican candidate. The Literary Digest was guilty of taking a sample of convenience. Samples of convenience are typically prone to bias. Any researcher who, either by choice or necessity, uses such a sample, has to be prepared to defend his findings against possible charges of bias. As this example shows, it can have catastrophic consequences. How did Gallup obtain his more representative sample? He did not use randomization. Randomization is often criticized on the grounds that once in a while, it can produce absurdly unrepresentative samples. When faced with a sample that obviously contains far too few economically disadvantaged voters, it is small consolation to know that next time around, the error will likely not be repeated. Gallup used a procedure that virtually guaranteed that his sample would be representative with respect to such obvious features as age, race, etc. He did so by assigning quotas which his interviewers were to fill. One interviewer might, e.g., be assigned to interview 5 adult males with specified characteristics in a tough, inner-city neighborhood. The quotas were devised so as to make the sample mimic known features of the population. This quota sampling technique suited Gallup’s needs spectacularly well in 1935 even though he underestimated the support for the Democratic candidate by about 6%. His subsequent polls contained the same systematic error. In 1947, the error finally caught up with him. He predicted a narrow victory for the Republican candidate, Dewey. A Newspaper editor was so confident of the prediction that he authorized the printing of a headline proclaiming the victory before the official results were available. It turned out that the Democrat, Truman, won by a narrow margin. What was wrong with Gallup’s sampling technique? He gave his interviewers the final decision as to whom would be interviewed. In a tough inner-city neighborhood, an interviewer had the option of passing by a house with several motorcycles parked out in front and sounds of a raucous party coming from within. In the resulting sample, the more conservative (Republican) voters were systematically over-represented.

c

2015 Carl James Schwarz

105

2015-08-20

CHAPTER 3. SAMPLING Gallup learned from his mistakes. His subsequent surveys replaced interviewer discretion with an objective, randomized scheme at the final stage of sample selection. With the dominant source of systematic error removed, his election predictions became even more reliable. Implications for Biological Surveys The bias in samples of convenience can be enormous. It can be surprisingly large even in what appear to be carefully designed surveys. It can easily exceed the typical size of the chance error terms. To completely remove the possibility of bias in the selection of a sample, randomization must be employed. Sometimes this is simply not possible, as for example, appears to be the case in the study on bees. When this happens and the investigators wish to use the results of a nonrandomized sample, then the final report should discuss the possibility of selection bias and its potential impact on the conclusions. Furthermore, when reading a report containing the results of a survey, it is important to carefully evaluate the survey design, and to consider the potential impact of sample selection bias on the conclusions. Should Farley Mowat really have been content to take his samples by tossing Raunkier’s Circle to the winds? Definitely not, for at least two reasons. First, he had to trust that by tossing the circle, he was generating an unbiased sample. It is not at all certain that certain types of vegetation would not be selected with a higher probability than others. For example, the higher shrubs would tend to intercept the hoop earlier in its descent than would the smaller herbs. Second, he has no guarantee that his sample will be representative with respect to the major habitat types. Leaving aside potential bias, it is possible that the circle could, by chance, land repeatedly in a snowbed community. It seems indeed foolish to use a sampling scheme which admits the possibility of including only snowbed communities when tundra bogs and fellfields may be equally abundant in the population. In subsequent chapters, we shall look into ways of taking more thoroughly randomized surveys, and into schemes for combining judgment with randomization for eliminating both selection bias and the potential for grossly unrepresentative samples. There are also circumstances in which a systematic sample (e.g., taking transects every 200 meters along a rocky shore line) may be justifiable, but this subject is not discussed in these notes.

3.1.6

Model vs. Design based sampling

Model-based sampling starts by assuming some sort of statistical model for the data in the population and the goal is to select data to estimate the parameters of this distribution. For example, you may be willing to assume that the distribution of values in the population is log-normally distributed. The data collected from the survey are then used along with a likelihood function to estimate the parameters of the distribution. Model-based sampling is very powerful because you are willing to make a lot of assumptions about the data process. However, if your model is wrong, there are big problems. For example, what if you assume log-normality but data is not log-normally distributed? In these cases, the estimates of the parameters can be extremely biased and inefficient. Design-based sampling makes no assumptions about the distribution of data values in the population. Rather it relies upon the randomization procedure to select representative elements of the population. Estimates from design-based methods are unbiased regardless of the distribution of values in the population, but in “strange” populations can also be inefficient. For example, if a population is highly clustered, a random sample of quadrats will end up with mostly zero observations and a few large values and the resulting estimates will have a large standard error. Most of the results in this chapter on survey sampling are design-based, i.e. we don’t need to make

c

2015 Carl James Schwarz

106

2015-08-20

CHAPTER 3. SAMPLING any assumptions about normality in the population for the results to valid.

3.1.7

Software

For a review of packages that can be used to analyze survey data please refer to the article at http: //www.fas.harvard.edu/~stats/survey-soft/survey-soft.html. CAUTIONS IN USING STANDARD STATISTICAL SOFTWARE PACKAGES Standard statistical software packages generally do not take into account four common characteristics of sample survey data: (1) unequal probability selection of observations, (2) clustering of observations, (3) stratification and (4) nonresponse and other adjustments. Point estimates of population parameters are impacted by the value of the analysis weight for each observation. These weights depend upon the selection probabilities and other survey design features such as stratification and clustering. Hence, standard packages will yield biased point estimates if the weights are ignored. The estimated standard errors based on sample survey data are impacted by clustering, stratification and the weights. By ignoring these aspects, standard packages generally underestimate the standard error, sometimes substantially so. Most standard statistical packages can perform weighted analyses, usually via a WEIGHT statement added to the program code. Use of standard statistical packages with a weighting variable may yield the same point estimates for population parameters as sample survey software packages. However, the estimated standard error often is not correct and can be substantially wrong, depending upon the particular program within the standard software package. For further information about the problems of using standard statistical software packages in survey sampling please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/ donna_brogan.html. Fortunately, for simple surveys, we can often do the analysis using standard software as will be shown in these notes. Many software packages also have specialized software and, if available, these will be demonstrated. SAS includes many survey design procedures as shown in these notes.

3.2 3.2.1

Overview of Sampling Methods Simple Random Sampling

This is the basic method of selecting survey units. Each unit in the population is selected with equal probability and all possible samples are equally likely to be chosen. This is commonly done by listing all the members in the population (the set of sampling units) and then choosing units using a random number table. An example of a simple random sample would be a vegetation survey in a large forest stand. The stand is divided into 480 one-hectare plots, and a random sample of 24 plots was selected and analyzed using aerial photos. The map of the units selected might look like:

c

2015 Carl James Schwarz

107

2015-08-20

CHAPTER 3. SAMPLING

Units are usually chosen without replacement, i.e., each unit in the population can only be chosen once. In some cases (particularly for multi-stage designs), there are advantages to selecting units with replacement, i.e. a unit in the population may potentially be selected more than once. The analysis of a simple random sample is straightforward. The mean of the sample is an estimate of the population mean. An estimate of the population total is obtained by multiplying the sample mean by the number of units in the population. The sampling fraction, the proportion of units chosen from the entire population, is typically small. If it exceeds 5%, an adjustment (the finite population correction) will result in better

c

2015 Carl James Schwarz

108

2015-08-20

CHAPTER 3. SAMPLING estimates of precision (a reduction in the standard error) to account for the fact that a substantial fraction of the population was surveyed. A simple random sample design is often ‘hidden’ in the details of many other survey designs. For example, many surveys of vegetation are conducted using strip transects where the initial starting point of the transect is randomly chosen, and then every plot along the transect is measured. Here the strips are the sampling unit, and are a simple random sample from all possible strips. The individual plots are subsamples from each strip and cannot be regarded as independent samples. For example, suppose a rectangular stand is surveyed using aerial overflights. In many cases, random starting points along one edge are selected, and the aircraft then surveys the entire length of the stand starting at the chosen point. The strips are typically analyzed section- by-section, but it would be incorrect to treat the smaller parts as a simple random sample from the entire stand. Note that a crucial element of simple random samples is that every sampling unit is chosen independently of every other sampling unit. For example, in strip transects plots along the same transect are not chosen independently - when a particular transect is chosen, all plots along the transect are sampled and so the selected plots are not a simple random sample of all possible plots. Strip-transects are actually examples of cluster-samples. Cluster samples are discuses in greater detail later in this chapter.

3.2.2

Systematic Surveys

In some cases, it is logistically inconvenient to randomly select sample units from the population. An alternative is to take a systematic sample where every k th unit is selected (after a random starting point); k is chosen to give the required sample size. For example, if a stream is 2 km long, and 20 samples are required, then k = 100 and samples are chosen every 100 m along the stream after a random starting point. A common alternative when the population does not naturally divide into discrete units is gridsampling. Here sampling points are located using a grid that is randomly located in the area. All sampling points are a fixed distance apart. An example of a systematice sample would be a vegetation survey in a large forest stand. The stand is divided into 480 one-hectare plots. As a total sample size of 24 is required, this implies that we need to sample every 480/24 = 20th plot. We pick a random starting point (the 9th ) plot in the first row, and then every 20 plots reading across rows. The final plan could look like:

c

2015 Carl James Schwarz

109

2015-08-20

CHAPTER 3. SAMPLING

If a known trend is present in the sample, this can be incorporated into the analysis (Cochran, 1977, Chapter 8). For example, suppose that the systematic sample follows an elevation gradient that is known to directly influence the response variable. A regression-type correction can be incorporated into the analysis. However, note that this trend must be known from external sources - it cannot be deduced from the survey. Pitfall: A systematic sample is typically analyzed in the same fashion as a simple random sample.

c

2015 Carl James Schwarz

110

2015-08-20

CHAPTER 3. SAMPLING However, the true precision of an estimator from a systematic sample can be either worse or better than a simple random sample of the same size, depending if units within the systematic sample are positively or negatively correlated among themselves. For example, if a systematic sample’s sampling interval happens to match a cyclic pattern in the population, values within the systematic sample are highly positively correlated (the sampled units may all hit the ‘peaks’ of the cyclic trend), and the true sampling precision is worse than a SRS of the same size. What is even more unfortunate is that because the units are positively correlated within the sample, the sample variance will underestimate the true variation in the population, and if the estimated precision is computed using the formula for a SRS, a double dose of bias in the estimated precision occurs (Krebs, 1989, p.227). On the other hand, if the systematic sample is arranged ‘perpendicular’ to a known trend to try and incorporate additional variability in the sample, the units within a sample are now negatively correlated, the true precision is now better than a SRS sample of the same size, but the sample variance now overestimates the population variance, and the formula for precision from a SRS will overstate the sampling error. While logistically simpler, a systematic sample is only ‘equivalent’ to a simple random sample of the same size if the population units are ‘in random order’ to begin with. (Krebs, 1989, p. 227). Even worse, there is no information in the systematic sample that allows the manager to check for hidden trends and cycles. Nevertheless, systematic samples do offer some practical advantages over SRS if some correction can be made to the bias in the estimated precision: • it is easier to relocate plots for long term monitoring • mapping can be carried out concurrently with the sampling effort because the ground is systematically traversed. This is less of an issue now with GPS as the exact position can easily be recorded and the plots revisited alter. • it avoids the problem of poorly distributed sampling units which can occur with a SRS [but this can also be avoided by judicious stratification.] Solution: Because of the necessity for a strong assumption of ‘randomness’ in the original population, systematic samples are discouraged and statistical advice should be sought before starting such a scheme. If there are no other feasible designs, a slight variation in the systematic sample provides some protection from the above problems. Instead of taking a single systematic sample every kth unit, take 2 or 3 independent systematic samples of every 2k th or 3k th unit, each with a different starting point. For example, rather than taking a single systematic sample every 100 m along the stream, two independent systematic samples can be taken, each selecting units every 200 m along the stream starting at two random starting points. The total sample effort is still the same, but now some measure of the large scale spatial structure can be estimated. This technique is known as replicated sub-sampling (Kish, 1965, p. 127).

3.2.3

Cluster sampling

In some cases, units in a population occur naturally in groups or clusters. For example, some animals congregate in herds or family units. It is often convenient to select a random sample of herds and then measure every animal in the herd. This is not the same as a simple random sample of animals because individual animals are not randomly selected; the herds are the sampling unit. The strip-transect example in the section on simple random sampling is also a cluster sample; all plots along a randomly selected transect are measured. The strips are the sampling units, while plots within each strip are sub-sampling units. Another example is circular plot sampling; all trees within a specified radius of a randomly selected point are measured. The sampling unit is the circular plot while trees within the plot are sub-samples.

c

2015 Carl James Schwarz

111

2015-08-20

CHAPTER 3. SAMPLING The reason cluster samples are used is that costs can be reduced compared to a simple random sample giving the same precision. Because units within a cluster are close together, travel costs among units are reduced. Consequently, more clusters (and more total units) can be surveyed for the same cost as a comparable simple random sample. For example, consider the vegation survey of previous sections. The 480 plots can be divided into 60 clusters of size 8. A total sample size of 24 is obtained by randomly selecting three clusters from the 60 clusters present in the map, and then surveying ALL eight members of the seleced clusters. A map of the design might look like:

c

2015 Carl James Schwarz

112

2015-08-20

CHAPTER 3. SAMPLING

Alternatively, cluster are often formed when a transect sample is taken. For example, suppose that the vegetation survey picked an initial starting point on the left margin, and then flew completely across the landscape in a a straight line measuring all plots along the route. A map of the design migh look like:

c

2015 Carl James Schwarz

113

2015-08-20

CHAPTER 3. SAMPLING

c

2015 Carl James Schwarz

114

2015-08-20

CHAPTER 3. SAMPLING In this case, there are three clusters chosen from a possible 30 clusters and the clusters are of unequal size (the middle cluster only has 12 plots measured compared to the 18 plots measured on the other two transects). Pitfall A cluster sample is often mistakenly analyzed using methods for simple random surveys. This is not valid because units within a cluster are typically positively correlated. The effect of this erroneous analysis is to come up with an estimate that appears to be more precise than it really is, i.e. the estimated standard error is too small and does not fully reflect the actual imprecision in the estimate. Solution: In order to be confident that the reported standard error really reflects the uncertainty of the estimate, it is important that the analytical methods are appropriate for the survey design. The proper analysis treats the clusters as a random sample from the population of clusters. The methods of simple random samples are applied to the cluster summary statistics (Thompson, 1992, Chapter 12).

3.2.4

Multi-stage sampling

In many situations, there are natural divisions of the population into several different sizes of units. For example, a forest management unit consists of several stands, each stand has several cutblocks, and each cutblock can be divided into plots. These divisions can be easily accommodated in a survey through the use of multi-stage methods. Selection of units is done in stages. For example, several stands could be selected from a management area; then several cutblocks are selected in each of the chosen stands; then several plots are selected in each of the chosen cutblocks. Note that in a multi-stage design, units at any stage are selected at random only from those larger units selected in previous stages. Again consider the vegetation survey of previous sections. The population is again divided into 60 clusers of size 8. However, rather than surveying all units within a cluster, we decide to survey only two units within each cluster. Hence, we now sample at the first stage, a total of 12 clusters out of the 60. In each cluster, we randomly sample 2 of the 8 units. A sample plan might look like the following where the rectangles indicate the clusters selected, and the checks indicate the sub-sample taken from each cluster:

c

2015 Carl James Schwarz

115

2015-08-20

CHAPTER 3. SAMPLING

The advantage of multi-stage designs are that costs can be reduced compared to a simple random sample of the same size, primarily through improved logistics. The precision of the results is worse than an equivalent simple random sample, but because costs are less, a larger multi-stage survey can often be done for the same costs as a smaller simple random sample. This often results in a more precise estimate for the same cost. However, due to the misuse of data from complex designs, simple designs are often highly preferred and end up being more cost efficient when costs associated with incorrect decisions are incorporated.

c

2015 Carl James Schwarz

116

2015-08-20

CHAPTER 3. SAMPLING Pitfall: Although random selections are made at each stage, a common error is to analyze these types of surveys as if they arose from a simple random sample. The plots were not independently selected; if a particular cut- block was not chosen, then none of the plots within that cutblock can be chosen. As in cluster samples, the consequences of this erroneous analysis are that the estimated standard errors are too small and do not fully reflect the actual imprecision in the estimates. A manager will be more confident in the estimate than is justified by the survey. Solution: Again, it is important that the analytical methods are suitable for the sampling design. The proper analysis of multi-stage designs takes into account that random samples takes place at each stage (Thompson, 1992, Chapter 13). In many cases, the precision of the estimates is determined essentially by the number of first stage units selected. Little is gained by extensive sampling at lower stages.

3.2.5

Multi-phase designs

In some surveys, multiple surveys of the same survey units are performed. In the first phase, a sample of units is selected (usually by a simple random sample). Every unit is measured on some variable. Then in subsequent phases, samples are selected ONLY from those units selected in the first phase, not from the entire population. For example, refer back to the vegetation survey. An initial sample of 24 plots is closen in a simple random survey. Aerial flights are used to quickly measure some characteristic of the plots. A second phase sample of 6 units (circled below) is then measured using ground based methods.

c

2015 Carl James Schwarz

117

2015-08-20

CHAPTER 3. SAMPLING

Multiphase designs are commonly used in two situations. First, it is sometimes difficult to stratify a population in advance because the values of the stratification variables are not known. The first phase is used to measure the stratification variable on a random sample of units. The selected units are then stratified, and further samples are taken from each stratum as needed to measure a second variable. This avoids having to measure the second variable on every unit when the strata differ in importance. For example, in the first phase, plots are selected and measured for the amount of insect damage. The plots are then stratified by the amount of damage, and second phase allocation of units concentrates on plots

c

2015 Carl James Schwarz

118

2015-08-20

CHAPTER 3. SAMPLING with low insect damage to measure total usable volume of wood. It would be wasteful to measure the volume of wood on plot with much insect damage. The second common occurrence is when it is relatively easy to measure a surrogate variable (related to the real variable of interest) on selected units, and then in the second phase, the real variable of interest is measured on a subset of the units. The relationship between the surrogate and desired variable in the smaller sample is used to adjust the estimate based on the surrogate variable in the larger sample. For example, managers need to estimate the volume of wood removed from a harvesting area. A large sample of logging trucks is weighed (which is easy to do), and weight will serve as a surrogate variable for volume. A smaller sample of trucks (selected from those weighed) is scaled for volume and the relationship between volume and weight from the second phase sample is used to predict volume based on weight only for the first phase sample. Another example is the count plot method of estimating volume of timber in a stand. A selection of plots is chosen and the basal area determined. Then a subselection of plots is rechosen in the second phase, and volume measurements are made on the second phase plots. The relationship between volume and area in the second phase is used to predict volume from area measurements seen the first phase.

3.2.6

Panel design - suitable for long-term monitoring

One common objective of long-term studies is to investigate changes over time of a particular population. There are three common designs. First, separate independent surveys can be conducted at each time point. This is the simplest design to analyze because all observations are independent over time. For example, independent surveys can be conducted at five year intervals to assess regeneration of cutblocks. However, precision of the estimated change may be poor because of the additional variability introduced by having new units sampled at each time point. At the other extreme, units are selected in the first survey, permanent monitoring stations are established and the same units are remeasured over time. For example, permanent study plots can be established that are remeasured for regeneration over time. Ceteris paribus (all else being equal), this design is the more efficient (i.e. has higher power) compared to the previous design. The advantage of permanent study plots occurs because in comparisons over time, the effects of that particular monitoring site tend to cancel out and so estimates of variability are free of additional variability introduced by new units being measured at every time point. One possible problem is that survey units may become ‘damaged’ over time, and the sample size will tend to decline over time resulting in a loss of power. Additionally, an analysis of these types of designs is more complex because of the need to account for the correlation over time of measurements on the same sample plot and the need to account for possible missing values when units become ‘damaged’ and are dropped from the study. A compromise between these two design are partial replacement designs or panel designs. In these designs, a portion of the survey units are replaced with new units at each time point. For example, 1/5 of the units could be replaced by new units at each time point - units would normally stay in the study for a maximum of 5 time periods. This design combines the advantages of repeatedly measuring semi-permanent monitoring stations with the ability to replace (or refresh) the sample if units become damaged or are lost. The analysis of these designs is non-trival, but manageable with modern software.

c

2015 Carl James Schwarz

119

2015-08-20

CHAPTER 3. SAMPLING

3.2.7

Sampling non-discrete objects

In some cases, the population does not have natural discrete sampling units. For example, a large section of land may be arbitrarily divided into 1 m2 plots, or 10 m2 plots. A natural question to ask is what is the ‘best size’ of unit. This has no simple answer and depends upon several factors which must be addressed for each survey: • Cost. All else being equal, sampling many small plots may be more expensive than sampling fewer larger plots. The primary difference in cost is the overhead in traveling and setup to measure the unit. • Size of unit. An intuitive feeling is that more smaller plots are better than few large plots because the sample size is larger. This will be true if the characteristic of interest is ‘patchy’ , but surprisingly, makes no difference if the characteristic is randomly scattered through out the area (Krebs, 1989, p. 64). Indeed if the characteristic shows ‘avoidance’, then larger plots are better. For example, competition among trees implies they are spread out more than expected if they were randomly located. Logistic considerations often influence the plot size. For example, if trampling the soil affects the response, then sample plots must be small enough to measure without trampling the soil. • Edge effects. Because the population does not have natural boundaries, decisions often have to be made about objects that lie on the edge of the sample plot. In general larger square or circular plots are better because of smaller edge-to-area ratio. [A large narrow rectangular plot can have more edge than a similar area square plot.] • Size of object being measured. Clearly a 1 m2 plot is not appropriate when counting mature Douglas-fir, but may be appropriate for a lichen survey. A pilot study should be carried out prior to a large scale survey to investigate factors that influence the choice of sampling unit size.

3.2.8

Key considerations when designing or analyzing a survey

Key considerations when designing a survey are • what are the objectives of the survey? • what is the sampling unit? This should be carefully distinguished from the observational unit. For example, you may sample boats returning from fishing, but interview the individual anglers on the boat. • What frame is available (if any) for the sampling units? If a frame is available, then direct sampling can be used where the units can be numbered and the randomization used to select the sampling units. If no frame is available, then you will need to figure out how to identify the units and how to select then on the fly. For example, there is no frame of boats returning to an access point, so perhaps a systematic survey of every 5th boat could be used. • Are all the sampling units are the same size? If so, then a simple random sample (or variant thereof) is likely a suitable design. If the units vary considerably in size, then an unequal probability design may be more suitable. For example, if your survey units are forest polygons (as displayed on a GIS), these polygons vary considerably in size with many smaller polygons and fewer larger polygons. A design that selects polygons with a probability proportional to the size of the polygon may be more suited than a simple random sample of polygons. c

2015 Carl James Schwarz

120

2015-08-20

CHAPTER 3. SAMPLING • Decide upon the sampling design used (i.e. simple random sample, or cluster sample, or multistate design, etc.) The availablity of the frame and the existence of different sized sampling units will often dictate the type of design used. • What precision is required for the estimate? This (along with the variability in the response) will determine the sample size needed. • If you are not stratifying your design, then why not? Stratification is a low-cost or no-cost way to improve your survey. When analyzing a survey, the key steps are: • Recognize the design that was used to collect the data. Key pointers to help recognize various designs are: – How were the units selected? A true simple random sample makes a list of all possible items and then chooses from that list. – Is there more than one size of sampling unit? For example, were transects selected at random, and then quadrats within samples selected at random? This is usually a multi-stage design. – Is there a cluster? For example, transects are selected, and these are divided into a series of quadrats - all of which are measured. Any analysis of the data must use a method that matches the design used to collect the data! • Are there missing values? How did they occur? If the missingness is MCAR, then life is easy and the analysis proceeds with a reduced sample size. If the missingness is MAR, then some reweighting of the observed data will be required. If the missingness is IM, seek help - this is a difficult problem. • Use a suitable package to analyze the results (avoid Excel except for the simplest designs!). • Report both the estimate and the measure of precision (the standard error).

3.3

Notation

Unfortunately, sampling theory has developed its own notation that is different than that used for design of experiments or other areas of statistics even though the same concepts are used in both. It would be nice to adopt a general convention for all of statistics - maybe in 100 years this will happen. Even among sampling textbooks, there is no agreement on notation! (sigh). In the table below, I’ve summarized the “usual” notation used in sampling theory. In general, large letters refer to population values, while small letters refer to sample values.

c

2015 Carl James Schwarz

121

2015-08-20

CHAPTER 3. SAMPLING Characteristic

Population value

Sample value

number of elements

N

n

units

Yi

total

τ=

yi N P

Yi

i=1

mean

µ=

1 N

n P

y=

yi

i=1

N P

1 n

y=

Yi

i=1

τ N

p=

proportion

π=

variance

S2 =

N P

variance of a prop

S2 =

i=1 N N −1 π(1

(Yi −µ)2 N −1

− π)

n P

yi

i=1

y n

s2 =

n P

s2 =

i=1 np(1−p) n−1

(yi −y)2 n−1

Note: • The population mean is sometimes denoted as Y in many books. • The population total is sometimes denoted as Y in many books. • Again note the distinction between the population quantity (e.g. the population mean µ) and the corresponding sample quantity (e.g. the sample mean y).

3.4

Simple Random Sampling Without Replacement (SRSWOR)

This forms the basis of many other more complex sampling plans and is the ‘gold standard’ against which all other sampling plans are compared. It often happens that more complex sampling plans consist of a series of simple random samples that are combined in a complex fashion. In this design, once the frame of units has been enumerated, a sample of size n is selected without replacement from the N population units. Refer to the previous sections for an illustration of how the units will be selected.

3.4.1

Summary of main results

It turns out that for a simple random sample, the sample mean (y) is the best estimator for the population mean (µ). The population total is estimated by multiplying the sample mean by the POPULATION size. And, a proportion is estimated by simply coding results as 0 or 1 depending if the sampled unit belongs to the class of interest, and taking the mean of these 0,1 values. (Yes, this really does work - refer to a later section for more details). As with every estimate, a measure of precision is required. We say in an earlier chapter that the standard error (se) is such a measure. Recall that the standard error measures how variable the results of our survey would be if the survey were to be repeated. The standard error for the sample mean looks very similar to that for a sample mean from a completely randomized design (refer to later chapters) with a common correction of a finite population factor (the (1 − f ) term). The standard error for the population total estimate is found by multiplying the standard error for the mean by the POPULATION SIZE. c

2015 Carl James Schwarz

122

2015-08-20

CHAPTER 3. SAMPLING The standard error for a proportion is found again, by treating each data value as 0 or 1 and applying the same formula as the standard error for a mean. The following table summarizes the main results: Estimated se

Parameter

Population value

Estimator

Mean

µ

µ b=y

Total

τ

τb = N × µ b = N yy

Proportion

π

π b = p = y 0/1 =

q

y n

s2 n (1

− f)

N × se(b µ) = N q

p(1−p) n−1

q

s2 n (1

− f)

(1 − f )

Notes: • Inflation factor The term N/n is called the inflation factor and the estimator for the total is sometimes called the expansion estimator or the simple inflation estimator. • Sampling weight Many statistical packages that analyze survey data will require the specification of a sampling weight. A sampling weight represent how many units in the population are represented by this unit in the sample. In the case of a simple random sample, the sampling weight is also equal to N/n. For example, if you select 10 units at random from 150 units in the population, the sampling weight for each observation is 15, i.e. each unit in the sample represents 15 units in the population. The sampling weights are computed differently for various designs so won’t always be equal to N/n. • sampling fraction the term n/N is called the sampling fraction and is denoted as f . • finite population correction (fpc) the term (1 − f ) is called the finite population correction factor and reflects that if you sample a substantial part of the population, the standard error of the estimator is smaller than what would be expected from experimental design results. If f is less than 5%, this is often ignored. In most ecological studies the sampling fraction is usually small enough that all of the fpc terms can be ignored.

3.4.2

Estimating the Population Mean

The first line of the above table shows the “basic” results and all the remaining lines in the table can be derived from this line as will be shown later.

is

The population mean (µ) is estimated by the sample mean (y). The estimated se of the sample mean r s2 s p se(y) = (1 − f ) = √ (1 − f ) n n

Note that if the sampling fraction (f) is small, then the standard error of the sample mean can be approximated by: r s2 s se(y) ≈ =√ n n which is the familiar form seen previously. In general, the standard error formula changes depending upon the sampling method used to collect the data and the estimator used on the data. Every different sampling design has its own way of computing the estimator and se. c

2015 Carl James Schwarz

123

2015-08-20

CHAPTER 3. SAMPLING Confidence intervals for parameters are computed in the usual fashion, i.e. an approximate 95% confidence interval would be found as: estimator ± 2se. Some textbooks use a t-distribution for smaller sample sizes, but most surveys are sufficiently large that this makes little difference.

3.4.3

Estimating the Population Total

Many students find this part confusing, because of the term population total. This does NOT refer to the total number of units in the population, but rather the sum of the individual values over the units. For example, if you are interested in estimating total timber volume in an inventory unit, the trees are the sampling units. A sample of trees is selected to estimate the mean volume per tree. The total timber volume over all trees in the inventory unit is of interest, not the total number of trees in the inventory unit. As the population total is found by N µ (total population size times the population mean), a natural dAL = τb = N y. estimator is formed by the product of the population size and the sample mean, i.e. T OT Note that you must multiply by the population size not the sample size. Its estimated se is found by multiplying the estimated se for the sample mean by the population size as well, i.e., r s2 se(b τ) = N (1 − f ) n In general, estimates for population totals in most sampling designs are found by multiplying estimates of population means by the population size. Confidence intervals are found in the usual fashion.

3.4.4

Estimating Population Proportions

A “standard trick” used in survey sampling when estimating a population proportion is to replace the response variable by a 0/1 code and then treat this coded data in the same way as ordinary data. For example, suppose you were interested the proportion of fish in a catch that was of a particular species. A sample of 10 fish were selected (of course in the real world, a larger sample would be taken), and the following data were observed (S=sockeye, C=chum): S

C

C

S

S

S

S

C

S

S

Of the 10 fish sampled, 3 were chum so that the sample proportion of fish that were chum is 3/10 = 0.30. If the data are recoded using 1=Chum, 0=Sockeye, the sample values would be: 0

1

1

0

0

0

0

1

0

0

The sample average of these numbers gives y = 3/10 = 0.30 which is exactly the proportion seen. It is not surprising then that by recoding the sample using 0/1 variables, the first line in the summary table reduces to the last line in the summary table. In particular, s2 reduces to np(1−p)/(n−1) resulting in the se seen above. c

2015 Carl James Schwarz

124

2015-08-20

CHAPTER 3. SAMPLING Confidence intervals are computed in the usual fashion.

3.4.5

Example - estimating total catch of fish in a recreational fishery

This will illustrate the concepts in the previous sections using a very small illustrative example. For management purposes, it is important to estimate the total catch by recreational fishers. Unfortunately, there is no central registry of fishers, nor is there a central reporting station. Consequently, surveys are often used to estimate the total catch. There are two common survey designs used in these types of surveys (generically called creel surveys). In access surveys, observers are stationed at access points to the fishery. For example, if fishers go out in boats to catch the fish, the access points are the marinas where the boats are launched and are returned. From these access points, a sample of fishers is selected and interviews conducted to measure the number of fish captured and other attributes. Roving surveys are commonly used when there is no common access point and you can move among the fishers. In this case, the observer moves about the fishery and questions anglers as they are encountered. Note that in this last design, the chances of encountering an angler are no longer equal - there is a greater chance of encountering an angler who has a longer fishing episode. And, you typically don’t encounter the angler at the end of the episode but somewhere in the middle of the episode. The analysis of roving surveys is more complex - seek help. The following example is based on a real life example from British Columbia. The actual survey is much larger involving several thousand anglers and sample sizes in the low hundreds, but the basic idea is the same. An access survey was conducted to estimate the total catch at a lake in British Columbia. Fortunately, access to the lake takes place at a single landing site and most anglers use boats in the fishery. An observer was stationed at the landing site, but because of time constraints, could only interview a portion of the anglers returning, but was able to get a total count of the number of fishing parties on that day. A total of 168 fishing parties arrived at the landing during the day, of which 30 were sampled. The decision to sample an fishing party was made using a random number table as the boat returned to the dock. The objectives are to estimate the total number of anglers and their catch and to estimate the proportion of boat trips (fishing parties) that had sufficient life-jackets for the members on the trip. Here is the

c

2015 Carl James Schwarz

125

2015-08-20

CHAPTER 3. SAMPLING raw data - each line is the results for a fishing party..

Number Anglers 1 3 1 1 3 3 1 1 1 1 2 1 2 1 3 1 1 2 3 1 2 1 1 1 1 2 2 1 1 1

Party Catch 1 1 2 2 2 1 0 0 1 0 0 1 0 2 3 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0

Sufficient Life Jackets? yes yes yes no no yes no no yes yes yes yes yes yes yes no yes yes yes yes yes yes yes yes no yes no no yes yes

What is the population of interest? The population of interest is NOT the fish in the lake. The Fisheries Department is not interested in estimating the characteristics of the fish, such as mean fish weight or the number of fish in the lake. Rather, the focus is on the anglers and fishing parties. Refer to the FAQ at the end of the chapter for more details. It would be tempting to conclude that the anglers on the lake are the population of interest. However, note that information is NOT gathered on individual anglers. For example, the number of fish captured by each angler in the party is not recorded - only the total fish caught by the party. Similarly, it is impossible to say if each angler had an individual life jacket - if there were 3 anglers in the boat and only two life jackets, which angler was without? 1 1 If data were collected on individual anglers, then the anglers could be taken as the population of interest. However, in this case, the design is NOT a simple random sample of anglers. Rather, as you will see later in the course, the design is a cluster sample where a simple random sample of clusters (boats) was taken and all members of the cluster (the anglers) were interviewed. As you will see later in the course, a cluster sample can be viewed as a simple random sample if you define the population in terms of clusters.

c

2015 Carl James Schwarz

126

2015-08-20

CHAPTER 3. SAMPLING For this reason, the the population of interest is taken to be the set of boats fishing at this lake. The fisheries agency doesn’t really care about the individual anglers because if a boat with 3 anglers catches one fish, the actual person who caught the fish is not recorded. Similarly, if there are only two life jackets, does it matter which angler didn’t have the jacket? Under this interpretation, the design is a simple random sample of boats returning to the landing.

What is the frame? The frame for a simple random sample is a listing of ALL the units in the population. This list is then used to randomly select which units will be measured. In this case, there is no physical list and the frame is conceptual. A random number table was used to decide which fishing parties to interview.

What is the sampling design and sampling unit? The sampling design will be treated as if it were a simple random sample from all boats (fishing parties) returning, but in actual fact was likely a systematic sample or variant. As you will see later, this may or may not be a problem. In many cases, special attention should be paid to identify the correct sampling unit. Here the sampling unit is a fishing party or boat, i.e. the boats were selected, not individual anglers. This mistake is often made when the data are presented on an individual basis rather than on a sampling unit basis. As you will see in later chapters, this is an example of pseudo-replication.

Excel analysis As mentioned earlier, Excel should be used with caution in statistical analysis. However, for very simple surveys, it is an adequate tool. A copy of a sample Excel worksheet called creel is available in the AllofData workbook in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Here is a condensed view of the spreadsheet within the workbook:

c

2015 Carl James Schwarz

127

2015-08-20

CHAPTER 3. SAMPLING

c

2015 Carl James Schwarz

128

2015-08-20

CHAPTER 3. SAMPLING The analysis proceeds in a series of logical steps as illustrated for the number of anglers in each party variable. Enter the data on the spreadsheet The metadata (information about the survey) is entered at the top of the spreadsheet. The actual data is entered in the middle of the sheet. One row is used to list the variables recorded for each angling party. Obtain the required summary statistics. At the bottom of the data, the summary statistics needed are computed using the Excel built-in functions. This includes the sample size, the sample mean, and the sample standard deviation. Obtain estimates of the population quantity Because the sample mean is the estimator for the population mean in if the design is a simple random sample, no further computations are needed. In order to estimate the total number of angler, we multiply the average number of anglers in each fishing party (1.533 angler/party) by the POPULATION SIZE (the number of fishing parties for the entire day = 168) to get the estimated total number of anglers (257.6). Obtain estimates of precision - standard errors The se for the sample mean is computed using the formula presented earlier. The estimated standard error OF THE MEAN is 0.128 anglers/party. Because we found the estimated total by multiplying the estimates of the mean number of anglers/boat trip times the number of boat trips (168), the estimated standard error of the POPULATION TOTAL is found by multiplying the standard error of the sample mean by the same value, 0.128x168 = 21.5 anglers. Hence, a 95% confidence interval for the total number of anglers fishing this day is found as 257.6 ± 2(21.5). Estimating total catch The next column uses a similar procedure is followed to estimate the total catch. Estimating proportion of parties with sufficient life-jackets First, the character values yes/no are translated into 0,1 variables using the IF statement of Excel. Then the EXACT same formula as used for estimating the total number of anglers or the total catch is applied to the 0,1 data! We estimate that 73.3% of boats have sufficient life-jackets with a se of 7.4 percentage points.

c

2015 Carl James Schwarz

129

2015-08-20

CHAPTER 3. SAMPLING SAS analysis SAS (Version 8 or higher) has procedures for analyzing survey data. Copies of the sample SAS program called creel.sas and the output called creel.lst are available from the Sample Program Library at http: //www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The program starts with the Data step that reads in the data and creates the metadata so that the purpose of the program and how the data were collected etc are not lost.

data creel; /* read in the survey data */ infile ’creel.csv’ dlm=’,’ dsd missover; input angler catch lifej $; enough = 0; if lifej = ’yes’ then enough = 1;

The first section of code reads the data and computes the 0,1 variable from the life-jacket information. A Proc Print (not shown) lists the data so that it can be verified that it was read correctly. Most program for dealing with survey data require that sampling weights be available for each observation. data creel; set creel; sampweight = 168/30; run;

A sampling weight is the weighting factor representing how many people in the population this observation represents. In this case, each of the 30 parties represents 168/30=5.6 parties in the population. The sampling weights need not be specified if totals are not being estimated (only means); SAS then assigns equal weight to each observations which is appropriate for a simple random sample. Finally, Proc SurveyMeans is used to estimate the quantities of interest.

proc surveymeans data=creel total=168 /* total population size */ mean clm /* find estimates of mean, its se, and a 95% confidence interval */ sum clsum /* find estimates of total,its se, and a 95% confidence interval */ ; var angler catch lifej ; /* estimate mean and total for numeric variables, proporti weight sampweight; /* Note that it is not necessary to use the coded 0/1 variables in this procedure * ods output statistics=creelresults; run;

It is not necessary to code any formula as these are builtin into the SAS program. So how does the SAS program know this is a simple random sample? This is the default analysis - more complex designs require additional statements (e.g. a CLUSTER statement) to indicate a more complex design. As well, equal sampling weights indicate that all items were selected with equal probability. c

2015 Carl James Schwarz

130

2015-08-20

CHAPTER 3. SAMPLING Here are portions of the SAS output Variable Name

Variable Level

Mean

SE Mean

LCL Mean

UCL Mean

Sum

SE sum

LCL Sum

UCL Sum

angler

1.533

0.128

1.272

1.795

257.600

21.496

213.635

301.565

catch

0.667

0.139

0.382

0.951

112.000

23.382

64.177

159.823

lifej

Suff.Jac

0.032

0.029

0.000

0.092

5.600

5.057

0.000

15.928

lifej

no

0.258

0.072

0.111

0.405

44.800

12.524

19.223

70.377

lifej

yes

0.710

0.075

0.557

0.863

123.200

12.992

96.667

149.733

All of the results match that from the Excel spreadsheet.

3.5

Sample size determination for a simple random sample

I cannot emphasize too strongly, the importance of planning in advance of the survey. There are many surveys where the results are disappointing. For example, a survey of anglers may show that the mean catch per angler is 1.3 fish but that the standard error is .9 fish. In other words, a 95% confidence interval stretches from 0 to well over 4 fish per angler, something that is known with near certainty even before the survey was conducted. In many cases, a back of the envelope calculation has showed that the precision obtained from a survey would be inadequate at the proposed sample size even before the survey was started. In order to determine the appropriate sample size, you will need to first specify some measure of precision that is required to be obtained. For example, a policy decision may require that the results be accurate to within 5% of the population value. This precision requirement usually occurs in one of two formats: • an absolute precision, i.e. you wish to be 95% confident that the sample mean will not vary from the population mean by a pre-specified amount. For example, a 95% confidence interval for the total number of fish captured should be ± 1,000 fish. • a relative precision, i.e. you wish to be 95% confident that the sample mean will be within 10% of the population mean. The latter is more common than the former, but both are equivalent and interchangeable. For example, if the actual estimate is around 200, with a se of about 50, then the 95% confidence interval is ± 100 and the relative precision is within 50% of the population answer (± 100 / 200). Conversely, a 95% confidence interval that is within ± 40% of the estimate of 200, turns out to be ± 80 (40% of 200), and consequently, the se is around 40 (=80/2). A common question is: What is the difference between se/est and 2se/est? When is the relative standard error divided by 2? Does se/est have anything to do with a 95 % ci?

c

2015 Carl James Schwarz

131

2015-08-20

CHAPTER 3. SAMPLING Precision requirements are stated in different ways (replace blah below by mean/total/proportion etc). Expression

Mathematics

- within xxx of the blah

se = xxx

- margin of error of xxx

2se = xxx

- within xxx of the population value 19 times out of 20

2se = xxx

- within xxx of the population value 95% of the time

2se = xxx

- the width of the 95% confidence interval is xxx

4se = xxx

- within 10% of the blah

se/est = .10

- a rse of 10%

se/est = .10

- a relative error of 10%

se/est = .10

- within 10% of the blah 95% of the time

2se/est = .10

- within 10% of the blah 19 times out of 20

2se/est = .10

- margin of error of 10%

2se/est = .10

- width of 95% confidence interval = 10% of the blah

4se/est = .10

As a rough rule of thumb, the following are often used as survey precision guidelines: • For preliminary surveys, the 95% confidence interval should be ± 50% of the estimate. This implies that the target rse is 25%. • For management surveys, the 95% confidence interval should be ± 25% of the estimate. This implies that the target rse is 12.5%. • For scientific work, the 95% confidence interval should be ± 10% of the estimate. This implies that the target rse is 5%. Next, some preliminary guess for the standard deviation of individual items in the population (S) needs to be taken along with an estimate of the population size (N ) and possibly the population mean (µ) or population total (τ ). These are not too crucial and can be obtained by: • taking a pilot study. • previous sampling of similar populations • expert opinion A very rough estimate of the standard deviation can be found by taking the usual range of the data/4. If the population proportion is unknown, the value of 0.5 is often used as this leads to the largest sample size requirement as a conservative guess. These are then used with the formulae for the confidence interval to determine the relevant sample size. Many text books have complicated formulae to do this - it is much easier these days to simply code c

2015 Carl James Schwarz

132

2015-08-20

CHAPTER 3. SAMPLING the formulae in a spreadsheet (see examples) and use either trial and error to find a appropriate sample size, or use the “GOAL SEEKER” feature of the spreadsheet to find the appropriate sample size. This will be illustrated in the example. √ As an approximated answer, recall that se usally vary by n. Suppose that the present rse is .07. A rse of 5%, is smaller by a factor of .075/.05 = 1.5 which will require an increase of 1.52 = 2.25 in the sample size. If the raw data are available, you can also do a “bootstrap” selection (with replacement) to investigate the effect of sample size upon the se. For each different bootstrap sample size, estimate the parameter, the rse and then increase the sample size until the require rse is obtained. This is relatively easy to do in SAS using the Proc SurveySelect that can select samples of arbitrary size. In some packages, such as JMP, sampling is without replacement so a direct sampling of 3x the observed sample size is not possible. In this case, create a pseudo-data set by pasting 19 copies of the raw data after the original data. Then use the Table →Subset →Random Sample Size to get the approximate bootstrap sample. Again compute the estimate and its rse, and increase the sample size until the required precision is obtained. The final sample size is not to be treated as the exact sample size but more as a guide to the amount of effort that needs to be expended. Remember that “guesses” are being made for the standard deviation, the require precision, the approximate value of the estimate etc. Consequently, there really isn’t a defensible difference between a required sample size of 30 and 40. What really is of interest is the order of magnitude of effort required. For example, if your budget allows for a sample size of 20, and the sample size computation show that a sample size of 200 is required, then doing the survey with a sample size of 20 is a waste of time and money. If the required sample size is about 30, then you may be ok with an actual sample size of 20. If more than one item is being surveyed, these calculations must be done for each item. The largest sample size needed is then chosen. This may lead to conflict in which case some response items must be dropped or a different sampling method must be used for this other response variable.

Precision essentially depends only the absolute sample size, not the relative fraction of the population sampled. For example, a sample of 1000 people taken from Canada (population of 33,000,000) is just as precise as a sample of 1000 people taken from the US (population of 333,000,000)! This is highly counter-intuitive and will be explored more in class.

3.5.1

Example - How many angling-parties to survey

We wish to repeat the angler creel survey next year. • How many angling-parties should be interviewed to be 95% confident of being with 10% of the population mean catch? • What sample size would be needed to estimate the proportion of boats within 3 percentage points 19 times out of 20? In this case we are asking that the 95% confidence interval be ±0.03 or that the se = 0.015. The sample size spreadsheet is available in an Excel workbook called SurveySampleSize.xls which can be downloaded from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/ Stat-650/Notes/MyPrograms.

c

2015 Carl James Schwarz

133

2015-08-20

CHAPTER 3. SAMPLING A SAS program to compute sample size is also available, but in my opinion, is neither user-friendly nor as flexible for the general user. The code and output is also available in the Sample Program Library referred to above. Here is a condensed view of the spreadsheet:

c

2015 Carl James Schwarz

134

2015-08-20

CHAPTER 3. SAMPLING

c

2015 Carl James Schwarz

135

2015-08-20

CHAPTER 3. SAMPLING First note that the computations for sample size require some PRIOR information about population size, the population mean, or the population proportion. We will use information from the previous survey to help plan future studies. For example, about 168 boats were interviewed last year. The mean catch per angling party was about .667 fish/boat. The standard deviation of the catch per party was .844. These values are entered in the spreadsheet in column C. A preliminary sample size of 40 (in green in Column C) was tried. This lead to a 95% confidence interval of ± 35% which did not meet the precision requirements. Now vary the sample size (in green) in column C until the 95% confidence interval (in yellow) is below ± 10%. You will find that you will need to interview almost 135 parties - a very high sampling fraction indeed. The problem for this variable is the very high variation of individual data points. If you are familiar with Excel, you can use the Goal Seeker function to speed the search. Similarly, the proportion of people wearing lifejackets last year was around 73%. Enter this in the blue areas of Column E. The initial sample size of 20 is too small as the 95% confidence interval is ± .186 (18 percentage points). Now vary the sample size (in green) until the 95% confidence interval is ± .03. Note that you need to be careful in dealing with percentages - confidence limits are often specified in terms of percentage points rather than percents to avoid problems where percents are taken of percents. This will be explained further in class. Try using the spreadsheet to compare the precision of a poll of 1000 people taken from Canada (population 33,000,000) and 1000 people taken from the US (population 330,000,000) if both polls have about 40% in favor of some issue. Technical notes If you really want to know how the sample size numbers are determined, here is the lowdown. Suppose that you wish to be 95% sure that the sample mean is within 10% of the population mean. q We must solve z √Sn NN−n ≤ εµ for n where z is the term representing the multiplier for a particular confidence level (for a 95% c.i. use z = 2) and ε is the ‘closeness’ factor (in this case ε = 0.10). Rearranging this equation gives n =

3.6

N εµ 2 1+N ( zS )

Systematic sampling

Sometimes, logistical considerations make a true simple random sample not very convenient to administer. For example, in the previous creel survey, a true random sample would require that a random number be generated for each boat returning to the marina. In such cases, a systematic sample could be used to select elements. For example, every 5th angler could be selected after a random starting point.

3.6.1

Advantages of systematic sampling

The main advantages of systematic sampling are: c

2015 Carl James Schwarz

136

2015-08-20

CHAPTER 3. SAMPLING • it is easier to draw units because only one random number is chosen • if a sampling frame is not available but there is a convenient method of selecting items, e.g. the creel survey where every 5th angler is chosen. • easier instructions for untrained staff • if the population is in random order relative to the variable being measured, the method is equivalent to a SRS. For example, it is unlikely that the number of anglers in each boat changes dramatically over the period of the day. This is an important assumption that should be investigated carefully in any real life situation! • it distributes the sample more evenly over the population. Consequently if there is a trend, you will get items selected from all parts of the trend.

3.6.2

Disadvantages of systematic sampling

The primary disadvantages of systematic sampling are: • Hidden periodicities or trends may cause biased results. In such cases, estimates of mean and standard errors may be severely biased! See Section 4.2.2 for a detailed discussion. • Without making an assumption about the distribution of population units, there is no estimate of the standard error. This is an important disadvantage of a systematic sample! Many studies very casually make the assumption that the systematic sample is equivalent to a simple random sample without much justification for this.

3.6.3

How to select a systematic sample

There are several methods, depending if you know the population size, etc. Suppose we need to choose every k th record, where k is chosen to meet sample size requirements. - an example of choosing k will be given in class. All of the following methods are equivalent if k divides N exactly. These are the two most common methods. • Method 1 Choose a random number j from 1 · · · k.. Then choose the j, j + k, j + 2k, · · · records. One problem is that different samples may be of different size - an example will be given in class where n doesn’t divide N exactly. This causes problems in sampling theory, but not too much of a problem if n is large. • Method 2 Choose a random number from 1 · · · N . Choose very k th item and continue in a circle when you reach the end until you have selected n items. This will always give you the same sized sample, however, it requires knowledge of N

3.6.4

Analyzing a systematic sample

Most surveys casually assume that the population has been sorted in random order when the systematic sample was selected and so treat the results as if they had come from a SRSWOR. This is theoretically not correct and if your assumption is false, the results may be biased, and there is no way of examining the biases from the data at hand. c

2015 Carl James Schwarz

137

2015-08-20

CHAPTER 3. SAMPLING Before implementing a systematic survey or analyzing a systematic survey, please consult with an expert in sampling theory to avoid problems. This is a case where an hour or two of consultation before spending lots of money could potentially turn a survey where nothing can be estimated, into a survey that has justifiable results.

3.6.5

Technical notes - Repeated systematic sampling

To avoid many of the potential problems with systematic sampling, a common device is to use repeated systematic samples on the same population. For example, rather than taking a single systematic sample of size 100 from a population, you can take 4 systematic samples (with different starting points) of size 25. An empirical method of obtaining a standard error from a systematic sample is to use repeated systematic sampling. Rather than choosing one systematic subsample of every k th unit, choose, m independent systematic subsample of size n/m. Then estimate the mean of each sub-systematic sample. Treat these means as a simple random sample from the population of possible systematic samples and use the usual sampling theory. The variation of the estimate among the sub-systematic samples provides an estimate of the standard error (after an appropriate adjustment). This will be illustrated in an example.

Example of replicated subsampling within a systematic sample A yearly survey has been conducted in the Prairie Provinces to estimate the number of breeding pairs of ducks. One breeding area has been divided into approximately 1000 transects of a certain width, i.e. the breeding area was divided into 1000 strips. What is the population of interest? As noted in class, the definition of a population depends, in part, upon the interest of the researcher. Two possible definitions are: • The population is the set of individual ducks on the study area. However, no frame exists for the individual birds. But a frame can be constructed based on the 1000 strips that cover the study area. In this case, the design is a cluster sample, with the clusters being strips. • The population consists of the 1000 strips that cover the study area and the number of ducks in each strip is the response variable. The design is then a simple random sample of the strips. In either case, the analysis is exactly the same and the final estimates are exactly the same. Approximately 100 of the transects are flown by an aircraft and spotters on the aircraft count the number of breeding pairs visible from the aircraft. For administrative convenience, it is easier to conduct systematic sampling. However, there is structure to the data; it is well known that ducks do not spread themselves randomly through out the breeding area. After discussions with our Statistical Consulting Service, the researchers flew 10 sets of replicated systematic samples; each set consisted of 10 transects. As each transect is flown, the scientists also classify each transect as ‘prime’ or ‘non-prime’ breeding habitat. Here is the raw data reporting the number of nests in each set of 10 transects:

c

2015 Carl James Schwarz

138

2015-08-20

CHAPTER 3. SAMPLING

Set

Prime

Non-Prime

Habitat

Habitat

Total

NonALL

Prime

prime

mean

mean

n

Total

n

Total

(b)

Diff

(a)

(c)

(d)

(e)

1

123

3

345

7

468

41.0

49.3

-8.3

2

57

2

36

8

93

28.5

4.5

24.0

3

85

5

46

5

131

17.0

9.2

7.8

4

97

2

131

8

228

48.5

16.4

32.1

5

34

5

43

5

77

6.8

8.6

-1.8

6

85

3

67

7

152

28.3

9.6

18.8

7

56

7

64

3

120

8.0

21.3

-13.3

8

46

2

65

8

111

23.0

8.1

14.9

9

37

4

43

6

80

9.3

7.2

2.1

10

93

2

104

8

197

46.5

13.0

33.5

Avg

71.3

165.7

10.97

s

29.5

117.0

16.38

n

10

10

10 Est

Est total

7130

16570

mean

10.97

Est se

885

3510

se

4.91

Several different estimates can be formed. 1. Total number of nests in the breeding area (refer to column (a) above). The total number of nests in the breeding area for all types of habitat is of interest. Column (a) in the above table is the data that will be used. It represents the total number of nests in the 10 transects of each set. The principle behind the estimator is that the 1000 total transects can be divided into 100 sets of 10 transects, of which a random sample of size 10 was chosen. The sampling unit is the set of transects – the individual transects are essentially ignored. Note that this method assumes that the systematic samples are all of the same size. If the systematic samples had been of different sizes (e.g. some sets had 15 transects, other sets had 5 transects), then a ratio-estimator (see later sections) would have been a better estimator. • compute the total number of nests for each set. This is found in column (a). • Then the sets selected are treated as a SRSWOR sample of size 10 from the 100 possible sets. An estimate of the mean number of nests per set of 10 transects q is found as: µ b =  n s2 (468 + 93 + · · · + 197)/10 = 165.7 with an estimated se of se(b µ) = n 1 − 100 = q  117.02 10 1 − 100 = 35.1 10 • The average number of nests per set is expanded to cover all 100 sets τb = 100b µ = 16570 and se(b τ ) = 100se(b µ) = 3510 2. Total number of nests in the prime habitat only (refer to column (b) above). This is formed in exactly the same way as the previous estimate. This is technically known as estimation in a c

2015 Carl James Schwarz

139

2015-08-20

CHAPTER 3. SAMPLING domain. The number of elements in the domain in the whole population (i.e. how many of the 1000 transects are in prime-habitat) is unknown but is not needed. All that you need is the total number of nests in prime habitat in each set – you essentially ignore the non-prime habitat transects within each set. The average number of nests per set in prime habitats is found as before: µ b = 123+···+93 = 10 q q s2 29.52 n 10 71.3 with an estimated se of se(b µ) = n (1 − 100 ) = 10 (1 − 100 ) = 8.85. • because there are 100 sets of transects in total, the estimate of the population total number of nests in prime habitat and its estimated se is τb = 100b µ = 7130 with a se(b τ ) = 100se(b µ) = 885 • Note that the total number of transects of prime habitat is not known for the population and so an estimate of the density of nests in prime habitat cannot be computed from this estimated total. However, a ratio-estimator (see later in the notes) could be used to estimate the density. 3. Difference in mean density between prime and non-prime habitats The scientists suspect that the density of nests is higher in prime habitat than in non-prime habitat. Is there evidence of this in the data? (refer to columns (c)-(e) above). Here everything must be transformed to the density of nest per transect (assuming that the transects were all the same size). Also, pairing (refer to the section on experimental design) is taking place so a difference must be computed for each set and the differences analyzed, rather than trying to treat the prime and non-prime habitats as independent samples. Again, this is an example of what is known as domain-estimation. • Compute the domain means for type of habitat for each set (columns (c) and (d)). Note that the totals are divided by the number of transects of each type in each set. • Compute the difference in the means for each set (column (e)) • Treat this difference as a simple random sample of size 10 taken from the 100 possible sets of transects. What does the final estimated mean difference and se imply?

3.7

Stratified simple random sampling

A simple modification to a simple random sample can often lead to dramatic improvements in precision. This is known as stratification. All survey methods can potentially benefit from stratification (also known as blocking in the experimental design literature). Stratification will be beneficial whenever variability in the response variable among the survey units can be anticipated and strata can be formed that are more homogeneous than the original set of survey units. All stratified designs will have the same basic steps as listed below regardless of the underlying design. • Creation of strata. Stratification begins by grouping the survey units into homogeneous groups (strata) where survey units within strata should be similar and strata should be different. For example, suppose you wished to estimate the density of animals. The survey region is divided into a large number of quadrats based on aerial photographs. The quadrats can be stratified into high and low quality habitat because it is thought that the density within the high quality quadrats may be similar but different from the density in the low quality habitats. The strata do not have to be physically contiguous – for example, the high quality habitats could be scattered through out the survey region and can be grouped into one single stratum. c

2015 Carl James Schwarz

140

2015-08-20

CHAPTER 3. SAMPLING • Determine total sample size. Use the methods in previous sections to determine the total sample size (number of survey units) to select. At this stage, some sort of “average” standard deviation will be used to determine the sample size. • Allocate effort among the strata. there are several ways to allocate the total effort among the strata. – Equal allocation. In equal allocation, the total effort is split equally among all strata. Equal allocation is preferred when equally precise estimates are required for each stratum. 2 – Proportional allocation. In proportional allocation, the total effort is allocated to the strata in proportion to stratum importance. Stratum importance could be related to stratum size (e.g. when allocating effort among the U.S. and Canada, then because the U.S. is 10 times larger in Canada, more effort should be allocated to surveying the U.S.). But if density is your measure of importance, allocate more effort to higher density strata. Proportional allocation is preferred when more precise estimates are required in more important strata. – Neyman allocation. Neyman determined that if you also have information on the variability within each stratum, then more effort should be allocated to strata that are more important and more variable to give you the most precise overall estimate for a given sample size. This rarely is performed in ecology because often information on intra-stratum variability is unknown. 3 – Cost allocation. In general, effort should be allocated to more important strata, more variable strata, or strata where sampling is cheaper to give the best overall precision for the entire survey. As in the previous allocation method, ecologists rarely have sufficiently detailed cost information to do this allocation method. • Conduct separate surveys in each stratum Separate independent surveys are conducted in each stratum. It is not necessary to use the same survey method in all strata. For example, low density quadrats could be surveyed using aerial methods, while high density strata may require ground based methods. Some strata may use simple random samples, while other strata may use cluster samples. Many textbooks show examples were the same survey method is used in all strata, but this is NOT required. The ability to use different sampling methods in the different strata often leads to substantial cost savings and is a very good reason to use stratified sampling! • Obtain stratum specific estimates. Use the appropriate estimators to estimate stratum means and the se for EACH stratum. Then expand the estimated mean to get the estimated total (and se) in the usual way. • Rollup The individual stratum estimates of the TOTAL are then combined to give an overall Grand Total value for the entire survey region. The se of the Grand Total is found as: p d ) = se(τb1 )2 + se(τb2 )2 + . . . se(GT Finally, if you want the overall grand average, simply divide the grand total (and its se) by the appropriate divisor. Stratification is normally carried out prior to the survey (pre- stratification), but can also be done after the survey (post-stratification) – refer to a later section for details. Stratification can be used with any type of sampling design – the concepts introduced here deal with stratification applied to simple random samples but are easily extended to more complex designs. The advantages of stratification are: 2 Recall

from previous sections that the absolute sample size is one of the drivers for precision.

3 However, in many cases, higher means per survey unit are accompanied by greater variances among survey units so allocations

based on stratum means often capture this variation as well. c

2015 Carl James Schwarz

141

2015-08-20

CHAPTER 3. SAMPLING • standard errors of the mean or of the total will be smaller (i.e. estimates are more precise) when compared to the respective standard errors from an unstratified design if the units within strata are more homogenous (i.e., less variable) compared to the variability over the entire unstratified population. • different sampling methods may be used in each stratum for cost or convenience reasons. [In the detail below we assume that each stratum has the same sampling method used, but this is only for simplification.] This can often lead to reductions in cost as the most appropriate and cost effective sampling method can be used in each straum. • because randomization occurs independently in each stratum, corruption of the survey design due to problems experienced in the field may be confined. • separate estimates for each stratum with a given precision can be obtained • it may be more convenient to take a stratified random sample for administrative reasons. For example, the strata may refer to different district offices.

3.7.1

A visual comparison of a simple random sample vs. a stratified simple random sample

You may find it useful to compare a simple random sample of 24 vs. a stratified random sample of 24 using the following visual plans: Select a sample of 24 in each case.

c

2015 Carl James Schwarz

142

2015-08-20

CHAPTER 3. SAMPLING Simple Random Sampling

Describe how the sample was taken.

c

2015 Carl James Schwarz

143

2015-08-20

CHAPTER 3. SAMPLING Stratified Simple Random Sampling Suppose that there is a gradient in habitat quality across the population. Then a more efficient (i.e. leading to smaller standard errors) sampling design is a stratified design. Three strata are defined, consisting of the first 3 rows, the next 5 rows, and finally, the last two rows. In many cases, the sample sample design is used in all strata. For example, suppose it was decided to conduct a simple random sample within each stratum, with sample sizes of 8, 10, and 6 in the three strata respectively. [The decision process on allocating samples to strata will be covered later.]

c

2015 Carl James Schwarz

144

2015-08-20

CHAPTER 3. SAMPLING

c

2015 Carl James Schwarz

145

2015-08-20

CHAPTER 3. SAMPLING Stratified Sampling with a different method in each stratum It is quite possible, and often desirable, to use different methods in the different strata. For example, it may be more efficient to survey desert areas using a fixed-wing aircraft, while ground surveys need to be used in heavily forested areas. For example, consider the following design. In the first (top most) stratum, a simple random sample was taken; in the second stratum a cluster sample was taken; in the third stratum a cluster sample (via transects) was also taken.

c

2015 Carl James Schwarz

146

2015-08-20

CHAPTER 3. SAMPLING

c

2015 Carl James Schwarz

147

2015-08-20

CHAPTER 3. SAMPLING

3.7.2

Notation

Common notation is to use h as a stratum index and i or j as unit indices within each stratum. Characteristic

Population quantities

sample quantities

number of strata

H

H

stratum sizes

N1 , N2 , · · · , NH

n1 , n2 , · · · , nH

population units

Yhj h=1,· · · ,H, j=1,· · · ,NH

yhj h=1,· · · ,H, j=1,· · · ,nH

stratum totals

τh

yh

stratum means

µh

Population total

yh

τ =N

H P

Wh µh where Wh =

h=1

Population mean

µ=

H P

Nh N

Wh µh

h=1

Variance

Sh2

s2h

Standard deviation

Sh

sh

3.7.3

Summary of main results

It is assumed that from each stratum, a SRSWOR of size nh is selected independently of ALL OTHER STRATA! The results below summarize the computations that can be more easily thought as occurring in four steps: 1. Compute the estimated mean and its se for each stratum. In this chapter, we use a SRS design in each stratum, but it not necessary to use this design in a stratum and each stratum could have a different design. In the case of an SRS, the estimate of the mean for each stratum is found as: µ bh = y h with associated standard error:

s se(b µh ) =

s2h (1 − fh ) nh

where the subscript h refers to each stratum. 2. Compute the estimated total and its se for each stratum. In many cases this is simply the estimated mean for the stratum multiplied by the STRATUM POPULATION size. In the case of an SRS in each stratum this gives:: τbh = Nh × µ bh = Nh × y h .

s se(b τh ) = Nh × se(b µh ) = Nh ×

s2h (1 − fh ) nh

3. Compute the grand total and its se over all strata. This is the sum of the individual totals. The se is computed in a special way. τb = τb1 + τb2 + . . . p se(b τ ) = se(τb1 )2 + se(τb2 )2 + . . . c

2015 Carl James Schwarz

148

2015-08-20

CHAPTER 3. SAMPLING 4. Occasionally, the grand mean over all strata is needed. This is found by dividing the estimated grand total by the total POPULATION sizes: µ b=

τb N1 + N2 + . . . se(b τ) N1 + N2 + . . .

se(b µ) =

This can be summarized in a succinct form as follows. Note that the stratum weights Wh are formed as Nh /N and are often used to derive weighted means etc: Quantity

Pop value

Mean

µ=

H P

Estimator

Wh µh

h=1

µ bstr =

H P

se s

H P

Wh y h

h=1

sh=1 H P h=1

Total

H P

τ =N

Wh µh or

h=1

τ=

H P

τ=

h=1 H P

τh or

τbstr = N τbstr =

H P

s

h=1

H P

Nh y h

h=1

H P

Wh y h or

sh=1 H P h=1

Wh2 se2 (y h ) = s2

Wh2 nhh (1 − fh )

Nh2 se2 (y h ) or s2

Nh2 nhh (1 − fh )

Nh µh

h=1

Notes • The estimator for the grand population mean is a weighted average of the individual stratum means using the POPULATION weights rather than the sample weights. This is NOT the same as the simple unweighted average of the estimated stratum means unless the nh /n equal the Nh /N such a design is known as proportional allocation in stratified sampling. p • The estimated standard error for the grand total is found as se21 + se22 + · · · + se2h , i.e. the square root of the sum of the individual se2 of the strata TOTALS. • The estimators for a proportion are IDENTICAL to that of the mean except replace the variable of interest by 0/1 where 1=character of interest and 0=character not of interest. • Confidence intervals Once the se has been determined, the usual ±2se will give approximate 95% confidence intervals if the sample sizes are relatively large in each stratum. If the sample sizes are small in each stratum some authors suggest using a t-distribution with degrees of freedom determined using a Satterthwaite approximation - this will not be covered in this course.

3.7.4

Example - sampling organic matter from a lake

[With thanks to Dr. Rick Routledge for this example]. Suppose that you were asked to estimate the total amount of organic matter suspended in a lake just after a storm. The first scheme that might occur to you could be to cruise around the lake in a haphazard fashion and collect a few sample vials of water which you could then take back to the lab. If you knew c

2015 Carl James Schwarz

149

2015-08-20

CHAPTER 3. SAMPLING the total volume of water in the lake, then you could obtain an estimate of the total amount of organic matter by taking the product of the average concentration in your sample and the total volume of the lake. The accuracy of your estimate of course depends critically on the extent to which your sample is representative of the entire lake. If you used the haphazard scheme outlined above, you have no way of objectively evaluating the accuracy of the sample. It would be more sensible to take a properly randomized sample. (How might you go about doing this?) Nonetheless, taking a randomized sample from the entire lake would still not be a totally sensible approach to the problem. Suppose that the lake were to be fed by a single stream, and that most of the organic matter were concentrated close to the mouth of the stream. If the sample were indeed representative, then most of the vials would contain relatively low concentrations of organic matter, whereas the few taken from around the mouth of the stream would contain much higher concentration levels. That is, there is a real potential for outliers in the sample. Hence, confidence limits based on the normal distribution would not be trustworthy. Furthermore, the sample mean is not as reliable as it might be. Its value will depend critically on the number of vials sampled from the region close to the stream mouth. This source of variation ought to be controlled. Finally, it might be useful to estimate not just the total amount of organic matter in the entire lake, but the extent to which this total is concentrated near the mouth of the stream. You can simultaneously overcome all three deficiencies by taking what is called a stratified random sample. This involves dividing the lake into two or more parts called strata. (These are not the horizontal strata that naturally form in most lakes, although these natural strata might be used in a more complex sampling scheme than the one considered here.) In this instance, the lake could be divided into two parts, one consisting roughly of the area of high concentration close to the stream outlet, the other comprising the remainder of the lake. Then if a simple random sample of fixed size were to be taken from within each of these “strata”, the results could be used to estimate the total amount of organic matter within each stratum. These subtotals could then be added to produce an estimate of the overall total for the lake. This procedure, because it involves constructing separate estimates for each stratum, permits us to assess the extent to which the organic matter is concentrated near the stream mouth. It also permits the investigator to control the number of vials sampled from each of the two parts of the lake. Hence, the chance variation in the estimated total ought to be sharply reduced. Finally, we shall soon see that the confidence limits that one can construct are free of the outlier problem that invalidated the confidence limits based on a simple random sampling scheme. A randomized sample is to be drawn independently from within each stratum. How can we use the results of a stratified random sample to estimate the overall total? The simplest way is to construct an estimate of the totals within each of the strata, and then to sum these estimates. A sensible estimate of the average within the h’th stratum is y h . Hence, a sensible estimate of the total PH within the h’th stratum is τbh = Nh y h , and the overall total can be estimated by τb = bh = h=1 τ PH h=1 Nh y h . If we prefer to estimate the overall average, we can merely divide the estimate of the overall total by the size of the population, N . The resulting estimator PH is called the stratified random sampling estimator of the population average, and is given by µ b = h=1 Nh y h /N .

c

2015 Carl James Schwarz

150

2015-08-20

CHAPTER 3. SAMPLING This can be expressed as a fancy average if we adjust the order of operations in the above expression. If, instead of dividing the sum by N , we divide each term by N and then sum the results, we shall obtain the same result. Hence, µ bstratif ied

=

H X

(Nh /N )y h

h=1

=

H X

Wh y h ,

h=1

where Wh = Nh /N . These Wh -values can be thought of as weighting factors, and µ bstratif ied can then be viewed as a weighted average of the within-stratum sample averages. The estimated standard error is found as: ( se(b µstratif ied )

= se

H X

) Wh y h

h=1

v uH uX Wh2 [se(y h )]2 , = t h=1

where the estimated se(y h ) is given by the formulas for simple random sampling: se(y h ) =

q

s2h nh (1

− fh ).

A Numerical Example Suppose that for the lake sampling example discussed earlier the lake were subdivided into two strata, and that the following results were obtained. (All readings are in mg per litre.) Nh

Stratum 1 2

nh

yh

sh

40.4

41.52

4.23

403

369.4

25.7

Sample Observations

7.5 × 10

8

5

37.2

46.6

45.3

38.1

2.5 × 10

7

5

365

344

388

347

We begin by computing the estimated mean for each stratum and its associated standard error. The nh sampling fraction N is so close to 0 it can be safely ignored. For example, the standard error of the h mean for stratum 1 is found as: s r s21 4.232 se(b µ1 ) = (1 − f1 ) = = 1.89 n1 5 . This gives the summary table: Stratum

nh

µ bh

se(b µh )

1

5

41.52

1.8935

2

5

369.4

11.492

Next, we estimate the total organic matter in each stratum. This is found by multiplying the mean concentration and se of each stratum by the total volume: τbh = Nh × µ bh se(b τh ) = Nh se(b µh ) c

2015 Carl James Schwarz

151

2015-08-20

CHAPTER 3. SAMPLING For example, the estimated total organic matter in stratum 1 is found as: τb1 = N1 × µ b1 = 7.5 × 108 × 41.52 = 311.4 × 108 se(b τ1 ) = N1 se(b µ1 ) = 7.5 × 108 × 1.89 = 14.175 × 108 This gives the summary table: Stratum

nh

µ bh

se(b µh )

τbh

se(b τh )

8

14.175 ×108

92.3 ×108

2.873 ×108

1

5

41.52

1.8935

2

5

369.4

11.492

311.4 ×10

Next, we total the organic content of the two strata and find the se of the grand total as 108 to give the summary table: Stratum

nh

µ bh

se(b µh )

τbh

se(b τh )

8

14.175 ×108

92.3 ×108

2.873 ×108

403.7 ×108

14.46 ×108

1

5

41.52

1.8935

2

5

369.4

11.492

Total

311.4 ×10



14.1752 + 2.8732 ×

Finally, the overall grand mean is found by dividing by the total volume of the lake 7.75 × 108 to give: 403.7 × 108 µ b= = 52.09mg/L 7.75 × 108 se(b µ) =

14.46 × 108 = 1.87mg/L 7.75 × 108

The calculations required to compute the stratified estimate can also be done using the method of weighted averages as shown in the following table:

Stratum

Nh

Wh

yh

Wh y h

se(y h )

Wh2 [se(y h )]2

(= Nh /N ) 1 2 Totals

8

0.9677

41.52

40.180

1.8935

3.3578

7

0.0323

369.4

11.916

11.492

0.1374

7.5 × 10 2.5 × 10

7.75 × 10

8

1.0000

52.097

3.4952 √ se = 3.4952

√Hence the estimate of the overall average is 52.097 mg/L, and the associated estimated standard error is 3.4963 = 1.870 mg/L and an approximate 95% confidence interval is then found in the usual fashion. As expected these match the previous results. This discussion swept a number of practical difficulties under the carpet. These include (a) estimating the volume of each of the two portions of the lake, (b) taking properly randomized samples from within each stratum, (c) selecting the appropriate size of each water sample, (d) measuring the concentration for each water sample, and (e) choosing the appropriate number of water samples from each stratum. None of these difficulties is simple to do. Estimating the volume of a portion of a lake, for example, typically involves taking numerous depth readings and then applying a formula for approximating integrals. This problem is beyond the scope of these notes.

c

2015 Carl James Schwarz

152

2015-08-20

CHAPTER 3. SAMPLING The standard error in the estimator of the overall average is markedly reduced in this example by the stratification. The standard error was just estimated for the stratified estimator to be around 2. This result was for a sample of total size 10. By contrast, for an estimator based on a simple random sample of the same size, the standard error can be found to be about 20. [This involves methods not covered in this class.] Stratification has reduced the standard error by an order of magnitude. It is also possible that we could reduce the standard error even further without increasing our sampling effort by somehow allocating this effort more efficiently. Perhaps we should take fewer water samples from the region far from the outlet, and take more from the other stratum. This will be covered later in this course. One can also read in more comprehensive accounts how to construct estimates from samples that are stratified after the sample is selected. This is known as post-stratification. These methods are useful if, e.g., you are sampling a population with a known sex ratio. If you observe that your sample is biased in favor of one sex, you can use this information to build an improved estimate of the quantity of interest through stratifying the sample by sex after it is collected. It is not necessary that you start out with a plan for sampling some specified number of individuals from each sex (stratum). Nonetheless, in any survey work, it is crucial that you begin with a plan. There are many examples of surveys that produced virtually useless results because the researchers failed to develop an appropriate plan. This should include a statement of your main objective, and detailed descriptions of how you plan to generate the sample, collect the data, enter them into a computer file, and analyze the results. The plan should contain discussion of how you propose to check for and correct errors at each stage. It should be tested with a pilot survey, and modified accordingly. Major, ongoing surveys should be reassessed continually for possible improvements. There is no reason to expect that the survey design will be perfect the first time that it is tried, nor that flaws will all be discovered in the first round. On the other hand, one should expect that after many years experience, the researchers will have honed the survey into a solid instrument. George Gallup’s early surveys were seriously biased. Although it took over a decade for the flaws to come to light, once they did, he corrected his survey design promptly, and continued to build a strong reputation. One should also be cautious in implementing stratified survey designs for long-term studies. An efficient stratification of the Fraser Delta in 1994, e.g., might be hopelessly out of date 50 years from now, with a substantially altered configuration of channels and islands. You should anticipate the need to revise your stratification periodically.

3.7.5

Example - estimating the total catch of salmon

DFO needs to monitor the catch of sockeye salmon as the season progresses so that stocks are not overfished. The season in one statistical sub-area in a year was a total of 2 days (!) and 250 vessels participated in the fishery in these 2 days. A census of the catch of each vessel at the end of each day is logistically difficult. In this particular year, observers were randomly placed on selected vessels and at the end of each day the observers contacted DFO managers with a count of the number of sockeye caught on that day. Here is the raw data - each line corresponds to the observers’ count for that vessel for that day. On the second day, a new random sample of vessels was selected. On both days, 250 vessels participated in the fishery.

c

2015 Carl James Schwarz

153

2015-08-20

CHAPTER 3. SAMPLING Date 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98

Sockeye 337 730 458 98 82 28 544 415 285 235 571 225 19 623 180

30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98

97 311 45 58 33 200 389 330 225 182 270 138 86 496 215

What is the population of interest? The population of interest is the set of vessels participating in the fishery on the two days. [The fact that each vessel likely participated in both days is not really relevant.] The population of interest is NOT the salmon captured - this is the response variable for each boat whose total is of interest.

What is the sampling frame? It is not clear how the list of fishing boats was generated. It seems unlikely that the aerial survey actually had a picture of the boats on the water from which DFO selected some boats. More likely, the observers were taken onto the water in some systematic fashion, and then the observer selected a boat at random from those seen at this point. Hence the sampling frame is the set of locations chosen to drop off the observers and the set of boats visible from these points.

c

2015 Carl James Schwarz

154

2015-08-20

CHAPTER 3. SAMPLING What is the sampling design? The sampling unit is a boat on a day. The strata are the two days. On each day, a random sample was selected from the boats participating in the fishery. This is a stratified design with a simple random sample selected each day. Note in this survey, it is logistically impossible to do a simple random sample over both the days as the number of vessels participating really isn’t known for any day until the fishery starts. Here, stratification takes the form of administrative convenience.

Excel analysis A copy of an Excel spreadsheet is available in the sockeye tab of the AllofData workbook available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/ MyPrograms. A summary of the page appears below:

c

2015 Carl James Schwarz

155

2015-08-20

CHAPTER 3. SAMPLING

14:43 Saturday, July 4, 2015

Number of sockeye caught - example of stratified simple random sampling raw data from the survey Obs date

1

sockeye sampweight

1 29-Jul

337

16.6667

2 29-Jul

730

16.6667

3 29-Jul

458

16.6667

4 29-Jul

98

16.6667

5 29-Jul

82

16.6667

6 29-Jul

28

16.6667

7 29-Jul

544

16.6667

8 29-Jul

415

16.6667

9 29-Jul

285

16.6667

10 29-Jul

235

16.6667

11 29-Jul

571

16.6667

12 29-Jul

225

16.6667

13 29-Jul

19

16.6667

14 29-Jul

623

16.6667

15 29-Jul

180

16.6667

16 30-Jul

97

16.6667

17 30-Jul

311

16.6667

18 30-Jul

45

16.6667

19 30-Jul

58

16.6667

20 30-Jul

33

16.6667

21 30-Jul

200

16.6667

22 30-Jul

389

16.6667

23 30-Jul

330

16.6667

24 30-Jul

225

16.6667

25 30-Jul

182

16.6667

26 30-Jul

270

16.6667

27 30-Jul

138

16.6667

28 30-Jul

86

16.6667

29 30-Jul

496

16.6667

30 30-Jul

215

16.6667

The data are listed on the spreadsheet on the left. Summary statistics The Excel builtin functions are used to compute the summary statistics (sample size, sample mean, and sample standard deviation) for each stratum. Some caution needs to be exercised that the range of each function covers only the data for that stratum. 4 4 If

you are proficient with Excel, Pivot-Tables are an ideal way to compute the summary statistics for each stratum. An

c

2015 Carl James Schwarz

156

2015-08-20

CHAPTER 3. SAMPLING You will also need to specify the stratum size (the total number of sampling units in each stratum), i.e. 250 vessels on each day. Find estimates of the mean catch for each stratum Because the sampling design in each stratum is a simple random sample, the same formulae as in the previous section can be used. The mean and its estimated se for each day of the opening is reported in the spreadsheet. Find the estimates of the total catch for each stratum The estimated total catch is found by multiplying the average catch per boat by the total number of boats participating in the fishery. The estimated standard error for the total for that day is found by multiplying the standard error for the mean by the stratum size as in the previous section. For example, in the first stratum (29 July), the estimated total catch is found by multiplying the estimated mean catch per boat (322) by the number of boats participating (250) to give an estimated total catch of 80,500 salmon for the day. The se for the total catch is found by multiplying the se of the mean (57) by the number of boats participating (250) to give the se of the total catch for the day of 14,200 salmon. Find estimate of grand total Once an estimated total is found for each stratum, the estimated grand total is found by summing the individual stratum estimated totals. The estimated standard error of the grand total is found by the square root of the sum of the squares of the standard errors in each stratum - the Excel function sumsq is useful for this computation. Estimates of the overall grand mean This was not done in the spreadsheet, but is easily computed by dividing the total catch by the total number of boat days in the fishery (250+250=500). The se is found by dividing the se of the total catch also by 500. Note this is interpreted as the mean number of fish captured per day per boat.

SAS analysis As noted earlier, some care must be used when standard statistical packages are used to analyze survey data as many packages ignore the design used to select the data. A sample SAS program for the analysis of the sockeye example called sockeye.sas and its output called sockeye.lst is available from the Sample Program Library at http://www.stat.sfu.ca/ ~cschwarz/Stat-650/Notes/MyPrograms. The program starts with reading in the raw data and the computation of the sampling weights. data sockeye; /* read in the data */ infile ’sockeye.csv’ dlm=’,’ dsd missover firstobs=2; application of Pivot-Tables is demonstrated in the analysis of a cluster sample where the cluster totals are needed for the summary statistics. c

2015 Carl James Schwarz

157

2015-08-20

CHAPTER 3. SAMPLING length date $8.; input date $ sockeye; /* compute the sampling weight. In general, these will be different for each stratum */ if date = ’29-Jul’ then sampweight = 250/15; if date = ’30-Jul’ then sampweight = 250/15;

Because the population size and sample size are the same for each stratum, the sampling weights are common to all boats. In general, this is not true, and a separate sampling weight computation is required for each stratum. A separate file is also constructed with the population sizes for each stratum so that estimates of the population total can be constructed.

data n_boats; /* you need to specify the stratum sizes if you want stratum totals */ length date $8.; date = ’29-Jul’; _total_=250; output; /* the stratum sizes must be variable _total date = ’30-Jul’; _total_=250; output; run;

Proc SurveyMeans then uses the STRATUM statement to identify that this is a stratified design. The default design and analysis for each stratum is again a simple random sample. It is not necessary that a simple random sample be done in each stratum, nor that the same design be used in each stratum - in these cases the analysis will be more complex.

proc surveymeans data=sockeye N = n_boats /* dataset with the stratum population sizes present */ mean clm /* average catch/boat along with standard error */ sum clsum ; /* request estimates of total */ ; strata date / list; /* identify the stratum variable */ var sockeye; /* which variable to get estimates for */ weight sampweight; ods output stratainfo=stratainfo; ods output statistics=sockeyeresults; run;

The SAS output consists of two parts. First is information about each stratum:

Stratum Index

c

2015 Carl James Schwarz

date

Number of Observations in a Stratum

Population Totals

Sampling Rate

Variable Name

N

1

29-Jul

15

250

6.00%

sockeye

15

2

30-Jul

15

250

6.00%

sockeye

15

158

2015-08-20

CHAPTER 3. SAMPLING Then are the actual results Variable Name

Mean

SE Mean

LCL Mean

UCL Mean

Sum

SE sum

LCL Sum

UCL Sum

263.5

33.1

195.7

331.3

131750

16541

97867

165633

sockeye

which match the results from Excel (as they must). The only thing of “interest” is to note that by default, SAS labels the precision of the estimated grand means as a Standard error while it labels the precision of the estimated total as a standard deviation! Both are correct - a standard error is a standard deviation - not of individual units in the population - but of the estimates over repeated sampling from the same population. I think it is clearer to label both as standard errors to avoid any confusion, as I did using the label statements in the Proc Print that created the output (see code for details). If separate analyses are wanted for each stratum, the SurveyMeans procedure has to be run twice, one time with a BY statement to estimate the means and totals in each stratum. proc surveymeans data=sockeye N = n_boats /* dataset with the stratum population sizes present */ mean clm /* average catch/boat along with standard error */ sum clsum ; /* request estimates of total */ ; by date; var sockeye; /* which variable to get estimates for */ weight sampweight; ods output stratainfo=stratainfosep; ods output statistics=sockeyeresultssep; run;

This gives separate results for each stratum date

Variable Name

Mean

SE Mean

LCL Mean

UCL Mean

Sum

SE sum

LCL Sum

UCL Sum

29-Jul

sockeye

322.0

56.8

200.2

443.8

80500

14195

50055

110945

30-Jul

sockeye

205.0

34.0

132.1

277.9

51250

8492

33036

69464

Again, it is likely easiest to do planning for future experiments in an Excel spreadsheet rather than using SAS.

When should the various estimates be used? In a stratified sample, there are many estimates that are obtained with different standard errors. It can sometimes be confusion as to which estimate is used for which purpose. Here is a brief review of the four possible estimates and the level of interest in each estimate.

c

2015 Carl James Schwarz

159

2015-08-20

se q

Stratum mean

µ bh = Y h

Stratum total

τbh = Nh Y h

Grand total.

τb = τb1 + τb2

Grand average

µ b=

Nh µ bh

=

s2h nh (1

Nh se(b q µh ) Nh

p

s2j nh (1

= − fh )

se(b τ1 )2 + se(b τ1 )2

160 τb N

− fh )

se(b τ) N

Example and Interpretation

Who would be interested in this quantity?

Stratum 1. Estimate is 322; se of 56.8 (not shown). The estimated average catch per boat was 322 fish (se 56.8 fish) on 29 July

A fisher who wishes to fish ONLY the first day of the season and wants to know if it will meet expenses.

Stratum 1. Estimate is 80,500=250x322; se of 14195=250x56.8.

The estimated total catch overall boats on 29 July was 80,500 (se 14,195) DFO who wishes to estimate TOTAL catch overall ALL boats on this single day so that quota for next day can be set. Grand Total

Estimate 131,750=80,500+51,250; se is √ 1419522 + 849222 = 16541. The estimated total catch overall all boats over all days is 132,000 fish (se 17,000 fish).

DFO who wishes to know total catch over entire fishing season so that impacts on stock can be examined.

Grand mean (not shown). N=500 vessel-days. Estimate is 131,750/500=263.5; se is 16541/500=33.0. The estimated catch per boat per day over the entire season was 263 fish (se 33 fish).

A fisher who want to know average catch per boat per day for the entire season to see if it will meet expenses.

CHAPTER 3. SAMPLING

c

2015 Carl James Schwarz

Parameter Estimator

2015-08-20

CHAPTER 3. SAMPLING ‘

3.7.6

Sample Size for Stratified Designs

As before, the question arises as how many units should be selected in stratified designs. This has two questions that need to be answered. First, what is the total sample size required? Second how should these be allocated among the strata. The total sample size can be determined using the same methods as for a simple random sample. I would suggest that you initially ignore the fact that the design will be stratified when finding the initial required total sample size. If stratification proves to be useful, then your final estimate will be more precise than you anticipated (always a nice thing to happen!) but seeing as you are making guesses as to the standard deviations and necessary precision required, I wouldn’t worry about the extra cost in sampling too much. If you must, it is possible to derive formulae for the overall sample sizes when accounting for stratification, but these are relatively complex. It is likely easier to build a general spreadsheet where the single cell is the total sample size and all other cells in the formula depend upon this quantity depending upon the allocation used. Then the total sample size can be manipulated to obtain the desired precision. The following information will be required: • The sizes (or relative sizes) of each stratum (i.e. the Nh or Wh ). • The standard deviation of measurements in each stratum. This can be obtained from past surveys, a literature search, or expert opinion. • The desired precision – overall – and if needed, for each stratum. Again refer to the sockeye worksheet.

c

2015 Carl James Schwarz

161

2015-08-20

CHAPTER 3. SAMPLING

14:43 Saturday, July 4, 2015

Number of sockeye caught - example of stratified simple random sampling raw data from the survey Obs date

1

sockeye sampweight

1 29-Jul

337

16.6667

2 29-Jul

730

16.6667

3 29-Jul

458

16.6667

4 29-Jul

98

16.6667

5 29-Jul

82

16.6667

6 29-Jul

28

16.6667

7 29-Jul

544

16.6667

8 29-Jul

415

16.6667

9 29-Jul

285

16.6667

10 29-Jul

235

16.6667

11 29-Jul

571

16.6667

12 29-Jul

225

16.6667

13 29-Jul

19

16.6667

14 29-Jul

623

16.6667

15 29-Jul

180

16.6667

16 30-Jul

97

16.6667

17 30-Jul

311

16.6667

18 30-Jul

45

16.6667

19 30-Jul

58

16.6667

20 30-Jul

33

16.6667

21 30-Jul

200

16.6667

22 30-Jul

389

16.6667

23 30-Jul

330

16.6667

24 30-Jul

225

16.6667

25 30-Jul

182

16.6667

26 30-Jul

270

16.6667

27 30-Jul

138

16.6667

28 30-Jul

86

16.6667

29 30-Jul

496

16.6667

30 30-Jul

215

16.6667

The standard deviations from this survey will be used as ‘guesses’ for what might happen next year. As in this year’s survey, the total sample size will be allocated evenly between the two days. In this case, the total sample size must be allocated to the two strata. You will see several methods in a later section to do this, but for now, assume that the total sample will be allocated equally among both strata. Hence the proposed sample size of 75 is split in half to give a proposed sample size of 37.5 in each stratum. Don’t worry about the fractional sample size - this is only a planning exercise. We create one cell that has the total sample size, and then use the formulae to allocate the total sample size equally to the

c

2015 Carl James Schwarz

162

2015-08-20

CHAPTER 3. SAMPLING two strata. The total and the se of the overall total are found as before, and the relative precision (denoted as the relative standard error (rse), and, unfortunately, in some books at the coefficient of variation cv ) is found as the estimated standard error/estimated total. Again, this portion of the spreadsheet is setup so that changes in the total sample size are propagated throughout the sheet. If you change the total sample size from 75 to some other number, this is automatically split among the two strata, which then affects the estimated standard error for each stratum, which then affects the estimated standard error for the total, which then affects the relative standard error. Again, the proposed total sample size can be varied using trial and error, or the Excel Goal-Seek option can be used. Here is what happens when a sample size of 75 is used. Don’t be alarmed by the fractional sample sizes in each stratum – the goal is again to get a rough feel for the required effort for a certain precision. Total n=75 se Est

Est

n

Mean

std dev

vessels

total

total

29-Jul

37.5

322

226.8

250

80500

8537

30-Jul

37.5

205

135.7

250

51250

5107

131750

9948

rse

7.6%

Stratum

Total

A sample size of 75 is too small. Try increasing the sample size until the rse is 5% or less. Alternatively, once could use the GOAL SEEK feature of Excel to find the sample size that gives a relative standard error of 5% or less as shown below: Total n=145 se Est

Est

n

Mean

std dev

vessels

total

total

29-Jul

72.5

322

226.8

250

80500

5611

30-Jul

72.5

205

135.7

250

51250

3357

131750

6539

rse

5.0%

Stratum

Total

3.7.7

Allocating samples among strata

There are number of ways of allocating a sample of size n among the various strata. For example, 1. Equal allocation. Under an equal allocation scheme, all strata get the same sample size, i.e. nh = n/H This allocation is best if variances of strata are roughly equal, equally precise estimates are required for each stratum, and you wish to test for differences in means among strata (i.e. an analytical survey discussed in previous sections). 2. Proportional allocation. Under proportional allocation, sample sizes are allocated to be proi PNi portional to the number of sampling units in the strata, i.e ni = n × N N = n × Nh = c

2015 Carl James Schwarz

163

2015-08-20

CHAPTER 3. SAMPLING i n× N1 +N2N+···+N = n×Wi This allocation is simple to plan and intuitively appealing. However, H it is not the best design. This design may waste effort because large strata get large sample sizes but precision is determined by sample size not the ratio of sample size to population size. For example, if one stratum is 10 times larger than any other stratum, it is not necessary to allocate 10 times the sampling effort to get the same precision in that stratum.

3. Neyman allocation In Neyman allocation (named after the statistician Neyman), the sample is allocated to minimize the overall standard error for a given total sample size. Tedious algebra gives that the sample should be allocated proportional to the product of the stratum size and the stratum i Si standard deviation, i.e. ni = n × PWWi Sh Si h = n × PNNi Sh Si h = n × N1 S1 +N2N S2 +···+NH SH . This allocation will be appropriate if the costs of measuring units are the same in all strata. Intuitively, the strata that have the most of sampling units should be weighted larger; strata with larger standard deviations must have more samples allocated to them to get the se of the sample mean within the stratum down to a reasonable level. A key assumption of this allocation is that the cost to sample a unit is the same in all strata. 4. Optimal Allocation when costs are involved In some cases, the costs of sampling differ among the strata. Suppose P that it costs Ci to sample each unit in a stratum i. Then the total cost of the survey is C = nh Ch . The allocation rule is that sample sizes should be proportional to the product to stratum sizes, stratum standard deviations, and the inverse of the square root of the cost of sampling, i.e. ni = n ×

P

√ Wi Si / Ci √ (Wh Sh / Ch )

= n×

Ni Si √ Ci P Nh Sh (√ )

Ni Si



= n×

Ch

N1 S1 √ C1

N2 S2

+√

C2

Ci N S +···+ √H H

This

CH

implies that large samples are found in strata that are larger, more variable, or cheaper to sample.

In practice, most of the gain in precision occurs from moving from equal to proportional allocation, while often only small improvements in precision are gained from moving from proportional allocation to Neyman allocation. Similarly, unless cost differences are enormous, there isn’t much of an improvement in precision to moving to an allocation based on costs. Example - estimating the size of a caribou herd This section is based on the paper: Siniff, D.B. and Skoog, R.O. (1964). Aerial Censusing of Caribou Using Stratified Random Sampling. The Journal of Wildlife Management, 28, 391-401. http://dx.doi.org/10.2307/3798104 Some of the values have been modified slightly for illustration purposes. The authors wished to estimate the size of a caribou herd. The density of caribou differs dramatically based on the habitat type. The survey area was was divided into six strata based on habitat type. The survey design is to divide each stratum in 4 km2 quadrats that will be randomly selected. The number of caribou in the quadrats will be counted from an aerial photograph. The computations are available in the caribou tab in the Excel workbook ALLofData.xls available in Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/ MyPrograms. The key point to examining different allocations is to make a single cell represent the total sample size and then make a formula in each of the stratum sample sizes a function of the total. The total sample size can be found by varying the sample total until the desired precision is found. Results from previous year’s survey: Here are the summary statistics from the survey in a previous year:

c

2015 Carl James Schwarz

164

2015-08-20

CHAPTER 3. SAMPLING Map-squares

sampled

Stratum

Nh

nh

y

s

Est total

se(total)

1

400

98

24.1

74.7

9640

2621

2

40

10

25.6

63.7

1024

698

3

100

37

267.6

589.5

26760

7693

4

40

6

179

151.0

7160

2273

5

70

39

293.7

351.5

20559

2622

6

120

21

33.2

99.0

3984

2354

Total

770

211

69127

9172

The estimated size of the herd is 69,127 animals with an estimated se of 9,172 animals. Equal allocation What would happen if an equal allocation were used? We now split the 211 total sample size equally among the 6 strata. In this case, the sample sizes are ‘fractional’, but this is OK as we are interested only in planning to see what would have happened. Notice that the estimate of the overall population would NOT change, but the se changes. Stratum

Nh

nh

y

s

Est total

se(total)

1

400

35.2

24.1

74.7

9640

4810

2

40

35.2

25.6

63.7

1024

149

3

100

35.2

267.6

589.5

26760

8005

4

40

35.2

179

151.0

7160

354

5

70

35.2

293.7

351.5

20559

2927

6

120

35.2

33.2

99.0

3984

1684

Total

770

211

69127

9938

An equal allocation gives rise to worse precision than the original survey. Examining the table in more detail, you see that far too many samples are allocated in an equal allocation to strata 2 and 4 and not enough to strata 1 and 3. Proportional allocation What about proportional allocation? Now the sample size is proportional to the stratum population sizes. For example, the sample size for stratum 1 is found as 211×400/770. The following results are obtained: Stratum

Nh

nh

y

s

Est total

se(total)

1

400

109.6

24.1

74.7

9640

2431

2

40

11.0

25.6

63.7

1024

656

3

100

27.4

267.6

589.5

26760

9596

4

40

11.0

179

151.0

7160

1554

5

70

19.2

293.7

351.5

20559

4787

6

120

32.9

33.2

99.0

3984

1765

Total

770

211

69127

11263

This has an even worse standard error! It looks like not enough samples are placed in stratum 3 or 5. c

2015 Carl James Schwarz

165

2015-08-20

CHAPTER 3. SAMPLING Optimal allocation What if both the stratum sizes and the stratum variances are to be used in allocating the sample? We create a new column (at the extreme right) which is equal to Nh Sh . Now the sample sizes are proportional to these values, i.e. the sample size for the first stratum is now found as 211 × 29866.4/133893.8. Again the estimate of the total doesn’t change but the se is reduced. Stratum

Nh

nh

y

s

Est total

se(total)

Nh Sh

1

400

47.1

24.1

74.7

9640

4089

29866.4

2

40

4.0

25.6

63.7

1024

1206

2550.0

3

100

92.9

267.6

589.56

26760

1629

58953.9

4

40

9.5

179

151.0

7160

1709

6039.6

5

70

38.8

293.7

351.5

20559

2639

24607.6

6

120

18.7

33.2

99.0

3984

2522

11876.4

Total

770

211

69127

6089

133893.8

3.7.8

Example: Estimating the number of tundra swans.

The Tundra Swan Cygnus columbianus, formerly known as the Whistling Swan, is a large bird with white plumage and black legs, feet, and beak. 5 The USFWS is responsible for conserving and protecting tundra swans as a migratory bird under the Migratory Bird Treaty Act and the Fish and Wildlife Conservation Act of 1980. As part of these responsibilities, it conducts regular aerial surveys at one of their prime breeding areas in Bristol Bay, Alaska. And, the Bristol Bay population of tundra swans is of particular interest because suitable habitat for nesting is available earlier than most other nesting areas. This example is based on one such survey. 6 Tundra swans are highly visible on their nesting grounds making them easy to monitor during aerial surveys. The Bristol Bay refuge has been divided into 186 survey units, each being a quarter section. These survey units have been divided into three strata based on density, and previous years’ data provide the following information about the strata: Density

Total

Past

Past

Stratum

Survey Units

Density

Std Dev

High

60

20

10

Medium

68

10

6

Low

58

2

3

Total

186

Based on past years’ results and budget considerations, approximately 30 survey units can be sampled. The three strata are all approximately the same total area (number of survey units) so allocations based on stratum area will be approximately equal across strata. However, that would place about 1/3 of the effort into the low density strata which typically have fewer birds. 5 Additional 6 Doster,

information about the tundra swan is available at http://www.hww.ca/hww2.asp?id=78&cid=7 J. (2002). Tundra Swan Population Survey in Bristol Bay, Northern Alaska Peninsula, June 2002.

c

2015 Carl James Schwarz

166

2015-08-20

CHAPTER 3. SAMPLING It is felt that stratum density is a suitable measure of stratum importance (notice that close relationship between stratum density and stratum standard deviations which is often found in biological surveys). Consequently, an allocation based on stratum density was used. The sum of the density values is 20 + 20 = 18 units in the high density 10 + 2 = 32. A proportional allocation would then place about 30 × 32 10 stratum; about 30 × 32 = 9 units in the medium density stratum; and the remainder (3 units) in the low density stratum. The survey was conducted with the following results: Survey

Area 2

(km )

Swans in

Single

All

flocks

Birds

Pairs

birds

Unit

Stratum

dilai2

h

148

12

6

24

naka41

h

137

13

15

43

naka43

h

137

6

16

38

naka51

h

16

3

2

17

nakb32

h

137

10

10

30

nakb44

h

135

6

18

12

48

nakc42

h

83

4

5

6

21

nakc44

h

109

17

15

47

nakd33

h

134

11

11

33

ugac34

h

65

2

10

22

ugac44

h

138

28

15

58

ugad5/63

h

159

9

20

49

dugad56/4

m

102

7

4

15

guad43

m

137

6

4

14

ugad42

m

137

11

15

46

low1

l

143

2

2

low3

l

138

1

1

10

5

The first thing to notice from the table above is that not all survey units could be surveyed because of poor weather. As always with missing data, it is important to determine if the data are Missing Completely at Random (MCAR). In this case, it seems reasonable that swans did not adjust their behavior knowing that certain survey units would be sampled on the poor weather days and so there is no impact of the missing data other than a loss of precision compared to a survey with a full 30 survey units chosen. Also notice that “blanks” in the table (missing values) represent zeros and not really missing data. Finally, not all of the survey units are the same area. This could introduce additional variation into our data which may affect our final standard errors. Even though the survey units are of different areas, the survey units were chosen as a simple random sample so ignoring the area will NOT introduce bias into the estimates (why). You will see in later sections how to compute a ratio estimator which could take the area of each survey units into account and potentially lead to more precise estimates. The data are read into SAS in the usual fashion with the code fragment: data swans; infile ’tundra.csv’ dlm=’,’ dsd missover firstobs=2; length survey_unit $10 stratum $1;; c

2015 Carl James Schwarz

167

2015-08-20

CHAPTER 3. SAMPLING input survey_unit $ stratum $ area num_flocks num_single num_pairs; num_swans = num_flocks + num_single + 2*num_pairs;

The total number of survey units in each stratum is also read into SAS using the code fragment.

data total_survey_units; length stratum $1.; input stratum $ _total_; /* must use _total_ as variable name */ datalines; h 60 m 68 l 58 ;;;;

Notice that the variable that has the number of stratum units must be called _total_ as required by the SurveyMeans procedure. Next the data are sorted by stratum (not shown), the number of actual survey units surveyed in each stratum is found using Proc Means:

proc means data=swans noprint; by stratum; var num_swans; output out=n_units n=n; run;

Most survey procedures in SAS require the use sampling weights. These are the reciprocal of the probability of selection. In this case, this is simply the number of units in the stratum divided by the number sampled in each stratum:

data swans; merge swans total_survey_units n_units; by stratum; sampling_weight = _total_ / n; run;

Now the individual stratum estimates are obtained using the code fragment:

/* first estimate the numbers in each stratum */ proc surveymeans data=swans total=total_survey_units /* inflation factors */ sum clsum mean clm; by stratum; /* separate estimates by stratum */ var num_swans; weight sampling_weight; c

2015 Carl James Schwarz

168

2015-08-20

CHAPTER 3. SAMPLING ods output statistics=tundraresultssep; ods output stratinfo =tundrastratainfo; run;

This gives the output: stratum

Variable Name

Mean

SE Mean

LCL Mean

UCL Mean

Sum

SE sum

LCL Sum

UCL Sum

h

num_swans

35.8

3.4

28.3

43.4

2150

206

1697

2603

l

num_swans

1.5

0.5

-4.7

7.7

87

28

-275

449

m

num_swans

25.0

10.3

-19.2

69.2

1700

698

-1305

4705

The estimates in the L and M strata are not very precise because of the small number of survey units selected. SAS has incorporated the finite population correction factor when estimating the se for the individual stratum estimates. We estimate that about 2000 swans are present in the H and M strata, but just over 100 in the L stratum. The grand total is found by adding the estimated totals from √ the strata 2150+87+1700=3937, and the standard error of the grand total is found in the usual way 2062 + 282 + 6982 = 729. Proc SurveyMeans can be used to estimate the grand total number of units overall strata using the code fragment::

/* now to estimate the grand total */ proc surveymeans data=swans total=total_survey_units /* inflation factors for each stratum */ sum clsum mean clm; /* want to estimate grand totals */ title2 ’Estimate total number of swans’; strata stratum /list; /* which variable define the strata */ var num_swans; /* which variable to analyze */ weight sampling_weight; /* sampling weight for each obs */ ods output statistics=tundraresults; run;

This gives the output: Variable Name num_swans

Mean

SE Mean

LCL Mean

UCL Mean

Sum

SE sum

LCL Sum

UCL Sum

21.2

3.9

12.8

29.6

3937

729

2374

5500

The standard error is larger than desired, mostly because of the very small sample size in the M stratum where only 3 of the 9 proposed survey units could be surveyed.

c

2015 Carl James Schwarz

169

2015-08-20

CHAPTER 3. SAMPLING

3.7.9

Post-stratification

In some cases, it is inconvenient or impossible to stratify the population elements into strata before sampling because the value of a variable used for stratification is only available after the unit is sampled. For example, • we wish to stratify a sample of baby births by birth weight to estimate the proportion of birth defects; • we wish to stratify by family size when looking at day care costs. • we wish to stratify by soil moisture but this can only be measured when the plot is actually visited. We don’t know the birth weight, the family-size, or the soil moisture until after the data are collected. There is nothing formally wrong with post-stratification and it can lead to substantial improvements in precison. How would post-stratification work in practise? Suppose than 20 quadrats (each 1 m2 ) were sampled out of a 100 m2 survey area using a simple random sample, and the number of insect grubs counted in each quadrat. When the units were sampled, the soil was classified into high or low quality habitit for these grubs: Grubs

Post-strat

10

h

2

l

3

l

8

h

1

l

3

l

11

h

2

l

2

l

11

h

17

h

1

l

0

l

11

h

15

h

2

l

2

l

4

l

2

l

1

l

The data are available in the post-stratify.csv file in the Sample Program Library at http://www. stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. They are imported into SAS in the usual way: c

2015 Carl James Schwarz

170

2015-08-20

CHAPTER 3. SAMPLING

data grubs; infile ’post-stratify.csv’ dlm=’,’ dsd missover firstobs=2; length post_stratum $1.; input grubs post_stratum; run;

The first few lines of the raw data are shown below: Obs

post_stratum

grubs

1

h

10

2

l

2

3

l

3

4

h

8

5

l

1

6

l

3

7

h

11

8

l

2

9

l

2

h

11

10

If stratification is ignored, then the usual analysis using Proc SurveyMeans can be used after creating the appropriate survey weights (not shown, but see code):

proc surveymeans data=grubs total=100 /* inflation factors */ sum clsum mean clm; var grubs; weight sampling_weight_overall; ods output statistics=poststratifyresultssimple; run;

which gives: Variable Name grubs

Mean

SE Mean

LCL Mean

UCL Mean

Sum

SE sum

LCL Sum

UCL Sum

5.40

1.05

3.21

7.59

540

105

321

759

The overall mean density is estimated to be 5.40 insects/m2 with a se of 1.17 m−2 (ignoring any fpc)7 . The estimated total number of insects over all 100 m2 of the study area is 100 × 5.40 = 540 insects with a se of 100 × 1.17 = 117 insects8 . 7 The 8 The

se is 1.05 m−2 if the fpc is used se of the total is 1.05 insects if the fpc is used

c

2015 Carl James Schwarz

171

2015-08-20

CHAPTER 3. SAMPLING Now suppose we look at the summary statistics by the post-stratification variable. Proc Tabulate:

proc tabulate data=grubs; class post_stratum; var grubs; table post_stratum, grubs*(n*f=5.0 mean*f=5.2 std*f=5.2); run;

gives some simple summary statistics about each stratum: grubs N

Mean

Std

7

11.86

3.08

13

1.92

1.04

post_stratum h l

If the area of the post-strata are known (and this is NOT always possible), you can use standard rollup for a stratified design. Suppose that there were 30 m2 of high quality habitat and 70 m2 of low quality habitat. Then the roll-up proceeds as before and is summarized as: The usual stratified analysis can be then be done:

proc surveymeans data=grubs total=total_survey_units /* inflation factors for each stratum */ sum clsum mean clm; /* want to estimate grand totals */ strata post_stratum /list; /* which variable define the strata */ var grubs; /* which variable to analyze */ weight sampling_weight_post_strata; /* sampling weight for each obs */ ods output statistics=poststratifyresults;; run;

giving: Variable Name grubs

Mean

SE Mean

LCL Mean

UCL Mean

Sum

SE sum

LCL Sum

UCL Sum

4.9

0.4

4.2

5.7

490

36

416

565

Now the estimated total grubs is 490 with a se of 409 – a substantial improvement over the non-stratified analysis. The difference in the estimates (i.e. 540 vs. 490) is well within the range of uncertainty summarized by the standard errors. There are several potential problems when using post-stratification. • The sample size in each post-stratum cannot be controlled. This implies it is not possible to use 9 se

of 36 if fpc is used

c

2015 Carl James Schwarz

172

2015-08-20

CHAPTER 3. SAMPLING any of the allocation methods to improve precision that were discussed earlier. As well, the survey may end up with a very small sample size in some strata. • The reported se must be increased to account for the fact that the sample size in each stratum is no longer fixed. This introduces an additional source of variation for the estimate, i.e. estimates will vary from sample to sample not only because a new sample is drawn each time, but also because the sample size within a stratum will change. However in practice, this is rarely a problem because the actual increase in the se is usually small and this additional adjustment is rarely every done. • In the above example, the area of each stratum in the ENTIRE study area could be found after the fact. But in some cases, it is impossible to find the area of each stratum in the entire study area and so the rollup could not be done. In these cases, you could use the results from the post-stratification to also estimate the area of each stratum, but now the expansion factor for each stratum also has a se and this must also be taken into account Please consult a standard book on sampling theory for details.

3.7.10

Allocation and precision - revisited

A student wrote: I’m a little confused about sample allocation in stratified sampling. Earlier in the course, you stated that precision is independent of sample size, i.e. a sample of 1000 gave estimates that were equally precise for Canada and the US (assuming a simple random sample). Yet in stratified sampling, you also said that precision is improved by proportional allocation where larger strata get larger sample sizes. Both statements are correct. If you are interested in estimates for individual populations, then absolute sample size is important. If you wanted equally precise estimates for BOTH Canada and the US then you would have equal sample sizes from both populations, say 1000 from both population even though their overall population size differs by a factor of 10:1. However, in stratified sampling designs, you may also be interested in the OVERALL estimate, over both populations. In this case, a proportional allocation where sample size is allocated proportion to population size often performs better. In this, the overall sample of 2000 people would be allocated proportional to the population sizes as follows: Stratum US Canada Total

Population

Fraction of total population

Sample size

300,000,000

91%

91% x2000=1818

30,000,000

9%

9% x2000=181

330,000,000

100%

2000

Why does this happen? Well if you are interested in the overall population, then the US results essentially drives everything and Canada has little effect on the overall estimate. Consequently, it doesn’t matter that the Canadian estimate is not as precise as the US estimate.

c

2015 Carl James Schwarz

173

2015-08-20

CHAPTER 3. SAMPLING

3.8

Ratio estimation in SRS - improving precision with auxiliary information

An association between the measured variable of interest and a second variable of interest can be exploited to obtain more precise estimates. For example, suppose that growth in a sample plot is related to soil nitrogen content. A simple random sample of plots is selected and the height of trees in the sample plot is measured along with the soil nitrogen content in the plot. A regression model is fit (Thompson, 1992, Chapters 7 and 8) between the two variables to account for some of the variation in tree height as a function of soil nitrogen content. This can be used to make precise predictions of the mean height in stands if the soil nitrogen content can be easily measured. This method will be successful if there is a direct relationship between the two variables, and, the stronger the relationship, the better it will perform. This technique is often called ratio-estimation or regression-estimation. Notice that multi-phase designs often use an auxiliary variable but this second variable is only measured on a subset of the sample units and should not be confused with ratio estimators in this section. Ratio estimation has two purposes. First, in some cases, you are interested in the ratio of two variables, e.g. what is the ratio of wolves to moose in a region of the province. Second, a strong relationship between two variables can be used to improve precision without increasing sampling effort. This is an alternative to stratification when you can measure two variables on each sampling unit. Y Y = µµX . Here Y is the variable of interest; X is a secondary We define the population ratio as R = ττX variable not really of interest. Note that notation differs among books - some books reverse the role of X and Y .

Why is the ratio defined in this way? There are two common ratio estimators, traditionally called the mean-of-ratio and the ratio-of-mean estimators. Suppose you had the following data for Y and X which represent the counts of animals of species 1 and 2 taken on 3 different days: Sample 1

2

3

Y

10

100

20

X

3

20

1

The mean-of-ratios estimator would compute the estimated ratio between Y and X as: Rmean−of −ratio =

10 3

+

100 20

3

+

20 1

= 9.44

while the ratio-of-means would be computed as: Rratio−of −means =

(10 + 100 + 20)/3 10 + 100 + 20 = = 5.41 (3 + 20 + 1)/3 3 + 20 + 1

Which is ”better”? The mean-of-ratio estimator should be used when you wish to give equal weight to each pair of numbers regardless of the magnitude of the numbers. For example, you may have three plots of land, and you measure Y and X on each plot, but because of observer efficiencies that differ among plots, the raw numbers cannot be compared. For example, in a cloudy, rainy day it is hard to see animals (first case), but in a clear, sunny day, it is easy to see animals (second case). The actual numbers themselves cannot be combined directly. c

2015 Carl James Schwarz

174

2015-08-20

CHAPTER 3. SAMPLING The ratio-of-means estimator (considered in this chapter) gives every value of Y and X equal weight. Here the fact that unit 2 has 10 times the number of animals as unit 1 is important as we are interested in the ratio over the entire population of animals. Hence, by adding the values of Y and X first, each animals is given equal weight. When is a ratio estimator better - what other information is needed? The higher the correlation between Xi and Yi , the better the ratio estimator is compared to a simple expansion estimator. It turns out that the ratio estimator is the ‘best’ linear estimator if • the relation between Yi and Xi is linear through the origin • the variation around the regression line is proportional to the X value, i.e. the spread around the regression line increases as X increases unlike an ordinary regression line where the spread is assumed to be constant in all parts of the line. In practice, plot yi vs. xi from the sample and see what type of relation exists. When can a ratio estimator be used? A ratio estimator will require that another variable (the X variable) be measured on the selected sampling units. Furthermore, if you are estimating the overall mean or total, the total value of the X-variable over the entire population must also be known. For example, as see in the examples to come, the total area must be known to estimate the total animals once the density (animals/ha) is known.

3.8.1

Summary of Main results

Quantity

Population value

Ratio

R=

τY τX

µY µX

=

Sample estimate r=

y x

=

y x

se r

2 1 sdiff µ2X n

r Total

τY = RτX

τbratio = rτX

τX ×

2 1 sdiff µ2X n

r Mean

µY = RµX

µc Y ratio = rµX

µX ×

(1 − f )

2 1 sdiff µ2X n

(1 − f ) (1 − f )

Notes Don’t be alarmed by the apparent complexity of the formulae above. They are relatively simple to implement in spreadsheets.

• The term s2diff =

n P i=1

(yi −rxi )2 n−1

is computed by creating a new column yi − rxi and finding the

(sample standard deviation)2 of this new derived variable. This will be illustrated in the examples. • In some cases the µ2X in the denominator may or may not be known and it or its estimate x2 can be used in place of it. There doesn’t seem to be any empirical evidence that either is better. 2 • The term τX /µ2X reduces to N 2 .

• Confidence intervals Confidence limits are found in the usual fashion. In general, the distribution of R is positively skewed and so the upper bound is usually too small. This skewness is caused by the variation in the denominator of the the ratio. For example, suppose that a random variable (Z) has a uniform distribution between 0.5 and 1.5 centered on 1. The inverse of the random variable c

2015 Carl James Schwarz

175

2015-08-20

CHAPTER 3. SAMPLING (i.e. 1/Z) now ranges between 0.666 and 2 - no longer symmetrical around 1. So if a symmetric confidence interval is created, the width will tend not to match the true distribution. This skewness is not generally a problem if the sample size is at least 30 and the relative standard error of y and x are both less than 10%. • Sample size determination: The appropriate sample size to obtain a specified size of confidence interval can be found by iniert ingthe formulae for the se for the ratio. This can be done on a spread sheet using trial and error or the goal seek feature of the spreadsheet as illustated in the examples that follow.

3.8.2

Example - wolf/moose ratio

[This example was borrowed from Krebs, 1989, p. 208. Note that Krebs interchanges the use of x and y in the ratio.] Wildlife ecologists interested in measuring the impact of wolf predation on moose populations in BC obtained estimates by aerial counting of the population size of wolves and moose on 11 sub-areas (all roughly equal size) selected as SRSWOR from a total of 200 sub-areas in the game management zone. In this example, the actual ratio of wolves to moose is of interest. Here are the raw data: Sub-areas 1 2 3 4 5 6 7 8 9 10 11

Wolves 8 15 9 27 14 3 12 19 7 10 16

Moose 190 370 460 725 265 87 410 675 290 370 510

What is the population and parameter of interest? As in previous situations, there is some ambiguity: • The population of interest is the 200 sub-areas in the game-management zone. The sampling units are the 11 sub-areas. The response variables are the wolf and moose populations in the game management sub-area. We are interested in the wolf/moose ratio. • The populations of interest are the moose and wolves. If individual measurements were taken of each animal, then this definition would be fine. However, only the total number of wolves and moose within each sub-area are counted - hence a more proper description of this design would be a cluster design. As you will see in a later section, the analysis of a cluster design starts by summing to the cluster level and then treating the clusters as the population and sampling unit as is done in this case.

c

2015 Carl James Schwarz

176

2015-08-20

CHAPTER 3. SAMPLING Having said this, do the number of moose and wolves measured on each sub-area include young moose and young wolves or just adults? How will immigration and emigration be taken care of? What was the frame? Was it complete? The frame consists of the 200 sub-areas of the game management zone. Presumably these 200 subareas cover the entire zone, but what about emigration and immigration? Moose and wolves may move into and out of the zone. What was the sampling design? It appears to be an SRSWOR design - the sampling units are the sub-areas of the zone. How did they determine the counts in the sub-areas? Perhaps they simply looked for tracks in the snow in winter - it seems difficult to get estimates from the air in summer when there is lots of vegetation blocking the view.

Excel analysis A copy of the worksheet to perform the analysis of this data is called wolf and is available in the Allofdata workbook from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/ Stat-650/Notes/MyPrograms. Here is a summary shot of the spreadsheet:

c

2015 Carl James Schwarz

177

2015-08-20

CHAPTER 3. SAMPLING

Assessing conditions for a ratio estimator The ratio estimator works well if the relationship between Y and X is linear, through the origin, with increasing variance with X. Begin by plotting Y (wolves) vs. X (moose).

c

2015 Carl James Schwarz

178

2015-08-20

CHAPTER 3. SAMPLING

The data appears to satisfy the conditions for a ratio estimator. Compute summary statistics for both Y and X Refer to the screen shot of the spreadsheet. The Excel builtin functions are used to compute the sample size, sample mean, and sample standard deviation for each variable. Compute the ratio The ratio is computed using the formula for a ratio estimator in a simple random sample, i.e. r=

y x

Compute the difference variable Then for each observation, the difference between the observed Y (the actual number of wolves) and the predicted Y based on the number of moose (Ybi = rXi ) is found. Notice that the sum of the differences must equal zero. The standard deviation of the differences will be needed to compute the standard error for the estimated ratio. Estimate the standard error of the estimated ratio Use the formula given at the start of the section. c

2015 Carl James Schwarz

179

2015-08-20

CHAPTER 3. SAMPLING Final estimate Our final result is that the estimated ratio is 0.03217 wolf/moose with an estimated se of 0.00244 wolf/moose. An approximate 95% confidence interval would be computed in the usual fashion. Planning for future surveys Our final estimate has an approximate rse of 0.00244/.03217 = 7.5% which is pretty good. You could try different n values to see what sample size would be needed to get a rse of better than 5% or perhaps this is too precise and you only want a rse of about 10%. √ As an approximated answer, recall that se usally vary by n. A rse of 5%, is smaller by a factor of .075/.05 = 1.5 which will require an increase of 1.52 = 2.25 in the sample size, or about nnew = 2.25 × 11 = 25 units (ignoring the fpc). If the raw data are available, you can also do a “bootstrap” selection (with replacement) to investigate the effect of sample size upon the se. For each different bootstrap sample size, estimate the ratio, the se and then increase the sample size until the require se is obtained. This is relatively easy to do in SAS using the Proc SurveySelect that can select samples of arbitrary size. In some packages, such as JMP, sampling is without replacement so a direct sampling of 3x the observed sample size is not possible. In this case, create a pseudo-data set by pasting 19 copies of the raw data after the original data. Then use the Table →Subset →Random Sample Size to get the approximate bootstrap sample. Again compute the ratio and its se, and increase the sample size until the required precision is obtained. If you want to be more precise about this, notice that the formula for the se of a ratio is found as: s 1 s2diff (1 − f ) µ2X n From the spreadsheet we extract various values and find that the se of the ratio is r 3.292 n 1 (1 − ) 2 395.64 n 200 Different value of n can be tried until the rse is 5%. This gives a sample size of about 24 units. If the actual raw data are not available, all is not lost. You would require the approximate MEAN of X (µX ), the standard DEVIATION of Y , the standard DEVIATION of X, the CORRELATION between Y and X, the approximate ratio (R), and the approximate number of total sample units (N ). The correlation determines how closely Y can be predicted from X and essentially determines how much better you will do using a ratio estimator. If the correlation is zero, there is NO gain in precison using a ratio estimator over a simple mean. The se of r is then found as: s p 1 V (y) + R2 V (x) − 2Rcorr(y, x) V (x)V (y) n se(r) = (1 − ) µ2X n N Different values of n can be tried to obtain the desired rse. This is again illustrated on the spreadsheet.

SAS Analysis The above computations can also be done in SAS with the program wolf.sas available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. It uses Proc SurveyMeans which gives the output contained in wolf.lst. The SAS program again starts with the DATA step to read in the data. c

2015 Carl James Schwarz

180

2015-08-20

CHAPTER 3. SAMPLING

data wolf; infile ’wolf.csv’ dlm=’,’ dsd missover firstobs=2; input subregion wolf moose; run;

Because the sampling weights are equal for all observation, it is not necessary to include them when estimating a ratio (the weights cancel out in the formula used by SAS). Proc SGplot procedure creates the plot similar to that in the Excel spreadsheet. proc sgplot data=wolf; title2 ’plot to assess assumptions’; scatter x=wolf y=moose; run;

giving:

There appears to be a linear relationship between the two variables that goes through the origin which the condition under which a ratio estimator is sensible. Finally, the Proc SurveyMeans procedure does the actual computation: proc surveymeans data=wolf c

2015 Carl James Schwarz

ratio clm 181

N=200; 2015-08-20

CHAPTER 3. SAMPLING title2 ’Estimate of wolf to moose ratio’; /* ratio clm - request a ratio estimator with confidence intervals */ /* N=200 specifies total number of units in the population */ var moose wolf; ratio wolf/moose; /* this statement ask for ratio estimator */ ods output statistics=wolfresults ods output Ratio =wolfratio;

Estimates are obtained for each variable: Variable Name moose wolf

Mean

SE Mean

LCL Mean

UCL Mean

395.6

56.5

269.8

521.4

12.7

1.9

8.4

17.0

The RATIO statement in the SurveyMeans procedure requests the computation of the ratio estimator. Here is the output: Numerator Variable

Denominator Variable

wolf

moose

Ratio

LowerCL

StdErr

UpperCL

0.032169

0.02673676

0.002438

0.03760148

The results are identical to those from the spreadsheet. Again, it is easier to do planning in the Excel spreadsheet rather than in the SAS program. CAUTION. Ordinary regression estimation from standard statistical packages provide only an APPROXIMATION to the correct analysis of survey data. There are two problems in using standard statistical packages for regression and ratio estimation of survey data: • Assumes a simple random sample. If your data is NOT collected using a simple random sample, then ordinary regression methods should NOT be used. • Unable to use a finite population correction factor. This is usually not a problem unless the sample size is large relative to the population size. • Wrong error structure. Standard regression analyses assume that the variance around the regression or ratio line is constant. In many survey problems this is not true. This can be partially alleviated through the use of weighted regression, but this still does not completely fix the problem. For further information about the problems of using standard statistical software packages in survey sampling please refer to the article at http://www.fas.harvard.edu/~stats/ survey-soft/donna_brogan.html. Using ordinary regression Because the ratio estimator assumes that the variance of the response increases with the value of X, a new column representing the inverse of the X variable (i.e. 1/the number of moose) has been created. We start by plotting the data to assess if the relationship is linear and through the origin. The Y c

2015 Carl James Schwarz

182

2015-08-20

CHAPTER 3. SAMPLING variable is the number of wolves; the X variable is the number of moose. If the relationship is not through the origin, then a more complex analysis called Regression estimation is required. The graph looks like it is linear through the origin which is one of the assumptions of the ratio estimator. Now we wish to fit a straight line THROUGH THE ORIGIN. By default, most computer packages include the intercept which we want to force to zero. We must also specify that the inverse of the X variable (1/X) is the weighting variable. We see that the estimated ratio (.032 wolves/moose) matches the Excel output, the estimated standard error (.0026) does not quite match Excel. The difference is a bit larger than can be accounted for not using the finite population correction factor. As a matter of interest, if you repeat the analysis WITHOUT using the inverse of the X variable as the weighting variable, you obtain an estimated ratio of .0317 (se .0022). All of these estimates are similar and it likely makes very little difference which is used. Finding the required sample size is trickier because of the weighted regression approach used by the packages, the slightly different way the se is computed, and the lack of a f pc. The latter two issues are usually not important in determining the approximate sample size, but the first issue is crucial. Start by REFITTING Y vs. X WITHOUT using the weighting variable. This will give you roughly the same estimate and se, but now it is much easier to extract the necessary information for sample size determination. When the UNWEIGHTED model is fit, you will see that Root Mean Square Error has the value of 3.28. This is the value of sdif f that is needed. The approximate se for r (ignoring the fpc) is se(r) ≈

3.28 sdif f √ =≈ √ µx n 395.64 n

Again different value of n can be tried to get the appropriate rse. This gives an n of about 25 or 26 which is sufficient for planning purposes‘

Post mortem No population numbers can be estimated using the ratio estimator in this case because of a lack of suitable data. In particular, if you had wanted to estimate the total wolf population, you would have to use the simple inflation estimator that we discussed earlier unless you had some way of obtaining the total number of moose that are present in the ENTIRE management zone. This seems unlikely. However, refer to the next example, where the appropriate information is available.

3.8.3

Example - Grouse numbers - using a ratio estimator to estimate a population total

In some cases, a ratio estimator is used to estimate a population total. In these cases, the improvement in precision is caused by the close relationship between two variables.

c

2015 Carl James Schwarz

183

2015-08-20

CHAPTER 3. SAMPLING Note that the population total of the auxiliary variable will have to be known in order to use this method. Grouse Numbers A wildlife biologist has estimated the grouse population in a region containing isolated areas (called pockets) of bush as follows: She selected 12 pockets of bush at random, and attempted to count the numbers of grouse in each of these. (One can assume that the grouse are almost all found in the bush, and for the purpose of this question, that the counts were perfectly accurate.) The total number of pockets of bush in the region is 248, comprising a total area of 3015 hectares. Results are as follows: Area

Number

(ha)

Grouse

8.9

24

2.7

3

6.6

10

20.6

36

3.7

8

4.1

8

25.8

60

1.8

5

20.1

35

14.0

34

10.1

18

8.0

22

The data is available in the grouse.csv file in the Sample Program Library at http://www.stat. sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The SAS program starts in the usual fashion to read the data: data grouse; infile ’grouse.csv’ dlm=’,’ dsd missover firstobs=2; input area grouse; /* sampling weights not needed */ run;

The first few lines of the raw data are shown below:

c

2015 Carl James Schwarz

Obs

area

grouse

_TYPE_

_FREQ_

n

sampweight

1

8.9

24

0

12

12

20.6667

2

2.7

3

0

12

12

20.6667

3

6.6

10

0

12

12

20.6667

4

20.6

36

0

12

12

20.6667

5

3.7

8

0

12

12

20.6667

6

4.1

8

0

12

12

20.6667

7

25.8

60

0

12

12

20.6667

8

1.8

5

0

12

12

20.6667

184

2015-08-20

CHAPTER 3. SAMPLING Obs

area

grouse

_TYPE_

_FREQ_

n

sampweight

9

20.1

35

0

12

12

20.6667

10

14.0

34

0

12

12

20.6667

11

10.1

18

0

12

12

20.6667

12

8.0

22

0

12

12

20.6667

What is the population of interest and parameter to be estimated? As before, the is some ambiguity: • The population of interest are the pockets of brush in the region. The sampling unit is the pocket of brush. The number of grouse in each pocket is the response variable. • The population of interest is the grouse. These happen to be clustered into pockets of brush. This leads back to the previous case. What is the frame Here the frame is explicit - the set of all pockets of bush. It isn’t clear if all grouse will be found in these pockets - will some be itinerant and hence missed? What about movement between looking at the pockets of bush? Summary statistics n

mean

std dev

area

12

10.53

7.91

grouse

12

21.92

16.95

Variable

Simple inflation estimator ignoring the pocket areas Proc SurveyMeans can be used to compute the simple inflation estimator based on the pockets surveyed ignoring any relationship between the number of grouse and the area of the pockets. Don’t forget that in SAS we need to provide the survey weight, which in the case of a simple random sample is the inverse of the sampling fraction.

proc means data=grouse noprint; var grouse; output out=sampsize n=n; run; data grouse; /* get the survey weights */ set grouse; one=1; set sampsize point=one; sampweight = 248 / n; run; proc surveymeans data=grouse mean sum clm clsum N=248; title2 ’Estimation using a simple expansion estimator estimator’; var grouse; c

2015 Carl James Schwarz

185

2015-08-20

CHAPTER 3. SAMPLING weight sampweight; ods output statistics=grousesimpleresults; run;

giving: Variable Name

Mean

SE Mean

LCL Mean

UCL Mean

Sum

SE sum

LCL Sum

UCL Sum

grouse

21.917

4.772

11.413

32.420

5435.333

1183.488

2830.493

8040.173

If we wish to adjust for the sampling fraction, we can use our earlier results for the simple inflation estimator, our estimate of q the total number of grouse q is τb = N y = 248 × 21.92 = 5435.33 with an

estimated se of se = N ×

s2 n (1

− f ) = 248 ×

16.952 12 (1



12 248 )

= 1183.4.

The estimate isn’t very precise with a rse of 1183.4/5435.3 = 22%. Ratio estimator - why? Why did the inflation estimator do so poorly? Part of the reason is the relatively large standard deviation in the number of grouse in the pockets. Why does this number vary so much? It seems reasonable that larger pockets of brush will tend to have more grouse. Perhaps we can do better by using the relationship between the area of the bush and the number of grouse through a ratio estimator.

Excel analysis An Excel worksheet is available in the grouse tab in the AllofData workbook from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Preliminary plot to assess if ratio estimator will work First plot numbers of grouse vs. area and see if this has a chance of succeeding.

c

2015 Carl James Schwarz

186

2015-08-20

CHAPTER 3. SAMPLING

The graph shows a linear relationship, through the origin. There is some evidence that the variance is increasing with X (area of the plot). The spreadsheet is set up similarly to the previous example:

c

2015 Carl James Schwarz

187

2015-08-20

CHAPTER 3. SAMPLING

The total of the X variable (area) will need to be known. As before, you find summary statistics for X and Y , compute the ratio estimate, find the difference variables, find the standard deviation of the difference variable, and find the se of the estimated ratio. The estimated ratio is: r = y/x = 21.82/10.53 = 2.081 grouse/ha.

c

2015 Carl James Schwarz

188

2015-08-20

CHAPTER 3. SAMPLING The se of r is found as s r s2diff 1 1 4.74642 12 se(r) = × × × (1 − f ) = × (1 − ) = 0.1269 n 10.5332 12 248 x2 grouse/ha. In order to estimate the population total of Y , you now multiply the estimated ratio by the population total of X. We know the pockets cover 3015 ha, and so the estimated total number of grouse is found by τc Y = τX × r = 3015 × 2.081 = 6273.3 grouse. To estimate the se of the total, multiply the se of r by 3015 as well: se(c τY ) = τX × se(r) = 3015 × 0.1269 = 382.6 grouse. The precision is much improved compared to the simple inflation estimator. This improvement is due to the very strong relationship between the number of grouse and the area of the pockets. CAUTION. Ordinary regression estimation from standard statistical packages provide only an APPROXIMATION to the correct analysis of survey data. It is tempting to use ordinary regression methods to compute the ratio and then expand this ratio to estimate the total. There are two problems in using standard statistical packages for regression and ratio estimation of survey data: • Assumes that a simple random sample was taken. If the sampling design is not a simple random sample, then regular regression cannot be used. • Unable to use a finite population correction factor. This is usually not a problem unless the sample size is large relative to the population size. • Wrong error structure. Standard regression analyses assume that the variance around the regression or ratio line is constant. In many survey problems this is not true. This can be partially alleviated through the use of weighted regression, but this still does not completely fix the problem. For further information about the problems of using standard statistical software packages in survey sampling please refer to the article at http://www.fas.harvard.edu/~stats/ survey-soft/donna_brogan.html. However, in a simple random sample, the correct analysis can be created using ordinary regression if a weighted regression is used. Because the ratio estimator assumes that the variance of the response increases with the value of X, a new column representing the inverse of the X variable (i.e. 1/area of pocket) has been created. This is the method that is used when JMP is used to analyze the data. Because SAS and R have procedures for the analysis of survey sampling, it is not necessary to do the weighted regression. Nor, it it necessary to include a computation of the sampling weight if the data are collected in a simple random sample for a ratio estimator – the weights will cancel out when these packages are used.

SAS Analysis Proc SGplot creates the standard plot of numbers of grouse vs. the area of each grove.

proc sgplot data=grouse; title2 ’plot to assess assumptions’; c

2015 Carl James Schwarz

189

2015-08-20

CHAPTER 3. SAMPLING scatter y=grouse x=area; run;

giving:

The relationship between the number of grouse and the pocket area is linear through the origin – the conditions under which a ratio estimator will perform well. Proc SurveyMeans procedure can estimate the ratio of grouse/ha but cannot directly estimate the population total.

proc surveymeans data=grouse ratio clm N=248; /* the ratio clm keywords request a ratio estimator and a confidence interval. */ title2 ’Estimation using a ratio estimator’; var grouse area; ratio grouse / area; ods output statistics=grouseresults; ods output ratio =grouseratio; /* extract information so that total can be es run;

The ODS statement redirects the results from the RATIO statement to a new dataset that is processed further to multiply by the total area of the pockets. data outratio; /* compute estimates of the total */ c

2015 Carl James Schwarz

190

2015-08-20

CHAPTER 3. SAMPLING set grouseratio; Est_total = ratio * 3015; Se_total = stderr* 3015; UCL_total = uppercl*3015; LCL_total = lowercl*3015; format est_total se_total ucl_total lcl_total 7.1; format ratio stderr lowercl uppercl 7.3; run;

The output is as follows: Numerator Variable

Denominator Variable

grouse

area

Ratio

LowerCL

StdErr

UpperCL

2.080696

1.80140636

0.126893

2.35998605

Obs

Ratio

StdErr

LowerCL

UpperCL

Est total

Se total

LCL total

UCL total

1

2.081

0.127

1.801

2.360

6273.3

382.6

5431.2

7115.4

The results are exactly the same as computed using Excel. Again, it is easiest to do the sample size computations in Excel. The ratio estimator is much more precise than the inflation estimator because of the strong relationship between the number of grouse and the area of the pocket. Sample size for future surveys If you wish to investigate different sample sizes, the simplest way would be to modify the cell corresponding to the count of the differences. This will be left as an exercise for the reader. The final ratio estimate has a rse of about 6% - quite good. It is relatively straight forward to investigate the sample size needed for a 5% rse. We find this to be about 17 pockets.

Post mortem - a question to ponder What if it were to turn out that grouse population size tended to be proportional to the perimeter of a pocket of bush rather than its area? Would using the above ratio estimator based on a relationship with area introduce serious bias into the ratio estimate, increase the standard error of the ratio estimate, or do both?

3.9

Additional ways to improve precision

This section will not be examined on the exams or term tests

c

2015 Carl James Schwarz

191

2015-08-20

CHAPTER 3. SAMPLING

3.9.1

Using both stratification and auxiliary variables

It is possible to use both methods to improve precision. However, this comes at a cost of increased computational complexity. There are two ways of combining ratio estimators in stratified simple random sampling. 1. combined ratio estimate: Estimate the numerator and denominator using stratified random sampling and then form the ratio of these two estimates: rstratif ied,combined = and τc Y stratif ied,combined =

µc Y stratif ied µc X stratif ied µc Y stratif ied τX µc X stratif ied

We won’t consider the estimates of the se in this course, but it can be found in any textbook on sampling. 2. separate ratio estimator- make a ratio total for each stratum, and form a grand ratio by taking a weighted average of these estimates. Note that we weight by the covariate total rather than the stratum sizes. We get the following estimators for the grand ratio and grand total: rstratif ied,separate =

H 1 X τXh rh τX h=1

and τc Y stratif ied,separate =

H X

τXh rh

h=1

Again, we won’t worry about the estimates of the se. Why use one over the other? • You need stratum total for separate estimate, but only population total for combined estimate • combined ratio is less subject to risk of bias. (see Cochran, p. 165 and following). In general, the biases in separate estimator are added together and if they fall in the same direction, then trouble. In the combined estimator these biases are reduced through stratification for numerator and denominator • When the ratio estimate is appropriate (regression through the origin and variance proportional to covariate), the last term vanishes. Consequently, the combined ratio estimator will have greater standard error than the separate ratio estimator unless R is relatively constant from stratum to stratum. However, see above, the bias may be more severe for the separate ratio estimator. You must consider the combined effects of bias and precision, i.e. MSE.

3.9.2

Regression Estimators

A ratio estimator works well when the relationship between Yi and Xi is linear, through the origin, with the variance of observations about the ratio line increasing with X. In some cases, the relationship may be linear, but not through the origin. c

2015 Carl James Schwarz

192

2015-08-20

CHAPTER 3. SAMPLING In these cases, the ratio estimator is generalized to a regression estimator where the linear relationship is no longer constrained to go through the origin. We won’t be covering this in this course. Regression estimators are also useful if there is more than one X variable. Whenever you use a regression estimator, be sure to plot y vs. x to assess if the assumptions for a ratio estimator are reasonable. CAUTION: If ordinary statistical packages are used to do regression analysis on survey data, you could obtain misleading results because the usual packages ignore the way in which the data were collected. Virtually all standard regression packages assume you’ve collected data under a simple random sample. If your sampling design is more complex, e.g. stratified design, cluster design, multi-state design, etc, then you should use a package specifically designed for the analysis of survey data, e.g. SAS and the Proc SurveyReg procedure.

3.9.3

Sampling with unequal probability - pps sampling

All of the designs discussed in previous sections have assumed that each sample unit was selected with equal probability. In some cases, it is advantageous to select units with unequal probabilities, particularly if they differ in their contribution to the overall total. This technique can be used with any of the sampling designs discussed earlier. An unequal probability sampling design can lead to smaller standard errors (i.e. better precision) for the same total effort compared to an equal probability design. For example, forest stands may be selected with probability proportional to the area of the stand (i.e. a stand of 200 ha will be selected with twice the probability that a stand of 100 ha in size) because large stands contribute more to the overall population and it would be wasteful of sampling effort to spend much effort on smaller stands. The variable used to assign the probabilities of selection to individual study units does not need to have an exact relationship with an individual contributions to the total. For example, in probability proportional to prediction (3P sampling), all trees in a small area are visited. A simple, cheap characteristic is measured which is used to predict the value of the tree. A sub-sample of the trees is then selected with probability proportional to the predicted value, remeasured using a more expensive measuring device, and the relationship between the cheap and expensive measurement in the second phase is used with the simple measurement from the first phase to obtain a more precise estimate for the entire area. This is an example of two-phase sampling with unequal probability of selection. Please consult with a sampling expert before implementing or analyzing an unequal probability sampling design.

3.10

Cluster sampling

In some cases, units in a population occur naturally in groups or clusters. For example, some animals congregate in herds or family units. It is often convenient to select a random sample of herds and then measure every animal in the herd. This is not the same as a simple random sample of animals because individual animals are not randomly selected; the herds are the sampling unit. The strip-transect example in the section on simple random sampling is also a cluster sample; all plots along a randomly selected transect are measured. The strips are the sampling units, while plots within each strip are sub-sampling

c

2015 Carl James Schwarz

193

2015-08-20

CHAPTER 3. SAMPLING units. Another example is circular plot sampling; all trees within a specified radius of a randomly selected point are measured. The sampling unit is the circular plot while trees within the plot are sub-samples. Some examples of cluster samples are: • urchin estimation - transects are taken perpendicular to the shore and a diver swims along the transect and counts the number of urchins in each m2 along the line. • aerial surveys - a plane flies along a line and observers count the number of animals they see in a strip on both sides of the aircraft. • forestry surveys - often circular plots are located on the ground and ALL tree within that plot are measured. Pitfall A cluster sample is often mistakenly analyzed using methods for simple random surveys. This is not valid because units within a cluster are typically positively correlated. The effect of this erroneous analysis is to come up with an estimate that appears to be more precise than it really is, i.e. the estimated standard error is too small and does not fully reflect the actual imprecision in the estimate. Solution: You will pleased to know that, in fact, you already know how to design and analyze cluster samples! The proper analysis treats the clusters as a random sample from the population of clusters, i.e. treat the cluster as a whole as the sampling unit, and deal only with cluster total as the response measure.

3.10.1

Sampling plan

In simple random sampling, a frame of all elements was required in order to draw a random sample. Individual units are selected one at a time. In many cases, this is impractical because it may not be possible to list all of the individual units or may be logistically impossible to do this. In many cases, the individual units appear together in clusters. This is particularly true if the sampling unit is a transect - almost always you measure things on a individual quadrat level, but the actual sampling unit is the cluster. This problem is analogous to pseudo-replication in experimental design - the breaking of the transect into individual quadrats is like having multiple fish within the tank. A visual comparison of a simple random sample vs. a cluster sample You may find it useful to compare a simple random sample of 24 vs. a cluster sample of 24 using the following visual plans: Select a sample of 24 in each case.

c

2015 Carl James Schwarz

194

2015-08-20

CHAPTER 3. SAMPLING Simple Random Sampling

Describe how the sample was taken. Cluster Sampling First, the clusters must be defined. In this case, the units are naturally clustered in blocks of size 8. The following units were selected.

c

2015 Carl James Schwarz

195

2015-08-20

CHAPTER 3. SAMPLING

Describe how the sample was taken. Note the differences between stratified simple random sampling and cluster sampling!

3.10.2

Advantages and disadvantages of cluster sampling compared to SRS

• Advantage It may not be feasible to construct a frame for every elemental unit, but possible to construct frame for larger units, e.g. it is difficult to locate individual quadrats upon the sea floor, but easy to lay out transects from the shore. • Advantage Cluster sampling is often more economical. Because all units within a cluster are close together, travel costs are much reduced. c

2015 Carl James Schwarz

196

2015-08-20

CHAPTER 3. SAMPLING • Disadvantage Cluster sampling has a higher standard error than an SRSWOR of the same total size because units are typically homogeneous within clusters. The cluster itself serves as the sampling unit. For the same number of units, cluster sampling almost always gives worse precision. This is the problem that we have seen earlier of pseudo-replication. • Disadvantage A cluster sample is more difficult to analyze, but with modern computing equipment, this is less of a concern. The difficulties are not arithmetic but rather being forced to treat the clusters as the survey unit - there is a natural tendency to think that data are being thrown away. The perils of ignoring a cluster design The cluster design is frequently used in practice, but often analyzed incorrectly. For example, when ever the quadrats have been gathered using a transect of some sort, you have a cluster sampling design. The key thing to note is that the sampling unit is a cluster, not the individual quadrats. The biggest danger of ignoring the clustering aspects and treating the individual quadrats as if they came from an SRS is that, typically, your reported se will be too small. That is, the true standard error from your design may be substantially larger than your estimated standard error obtained from a SRS analysis. The precision is (erroneously) thought to be far better than is justified based on the survey results. This has been seen before - refer to the paper by Underwood where the dangers of estimation with positively correlated data were discussed.

3.10.3

Notation

The key thing to remember is to work with the cluster TOTALS. Traditionally, the cluster size is denoted by M rather than by X, but as you will see in a few moment, estimation in cluster sampling is nothing more than ratio estimation performed on the cluster totals. Population

Sample

Attribute

value

value

Number of clusters

N

n

Cluster totals

τi

yi

Cluster sizes

Mi

mi

Total area

M

3.10.4

NOTE τi and yi are the cluster i TOTALS

Summary of main results

The key concept in cluster sampling is to treat the cluster TOTAL as the response variable and ignore all the individual values within the cluster. Because the clusters are a simple random sample from the population of clusters, simply apply all the results you had before for a SRS to the CLUSTER TOTALS. The analysis of a cluster design will require the size of each cluster - this is simply the number of sub-units within each cluster. If the clusters are roughly equal in size, a simple inflation estimator can be used. But, in many cases, there is strong relationship between the size of the cluster and cluster total – in these cases a ratio estimator would likely be more suitable (i.e. will give you a smaller standard error), c

2015 Carl James Schwarz

197

2015-08-20

CHAPTER 3. SAMPLING where the X variable is the cluster size. If there is no relationship between cluster size and the cluster total, a simple inflation estimator can be used as well even in the case of unequal cluster sizes. You should do a preliminary plot of the cluster totals against the cluster sizes to see if this relationship holds. Extensions of cluster analysis - unequal size sampling In some cases, the clusters are of quite unequal sizes. A better design choice may to be select clusters with an unequal probability design rather than using a simple random sample. In this case, clusters that are larger, typically contribute more to the population total, and would be selected with a higher Computational formulae Parameter

Population value N P

Overall mean

Mi

i=1

Overall total

n P

τi

i=1 N P

µ=

Estimator

τ =M ×µ

µ b=

yi

i=1 n P

mi

estimated se q 2 1 sdiff 2 n (1 − f ) m

i=1

τb = M × µ b

q M2 ×

2 1 sdiff m2 n

(1 − f )

• You never use the mean per unit within a cluster. • The term s2diff =

n P i=1

(yi −b µmi )2 n−1

is again found in the same fashion as in ratio estimation - create a

new variable which is the difference between yi − µ bmi , find the sample standard deviation2 of it, and then square the standard deviation. • Sometimes the ratio of two variables measured within each cluster is required, e.g. you conduct aerial surveys to estimate the ratio of wolves to moose - this has already been done in an earlier example! In these cases, the actual cluster length is not used. Confidence intervals As before, once you have an estimator for the mean and for the se, use the usual ±2se rule. If the number of clusters is small, then some text books advise using a t-distribution for the multiplier – this is not covered in this course. Sample size determination Again, this is no real problem - except that you will get a value for the number of CLUSTERS, not the individual quadrats within the clusters.

3.10.5

Example - estimating the density of urchins

Red sea urchins are considered a delicacy and the fishery is worth several millions of dollars to British Columbia. In order to set harvest quotas and in order to monitor the stock, it is important that the density of sea urchins be determined each year.

c

2015 Carl James Schwarz

198

2015-08-20

CHAPTER 3. SAMPLING To do this, the managers lay out a number of transects perpendicular to the shore in the urchin beds. Divers then swim along the transect, and roll a 1 m2 quadrat along the transect line and count the number of legal sized and sub-legal sized urchins in the quadrat. The number of possible transects is so large that the correction for finite population sampling can be ignored. SAS v.8 has procedures for the analysis of survey data taken in a cluster design. A program to analyze the data is urchin.sas and is available from the Sample Program Library at http://www.stat.sfu. ca/~cschwarz/Stat-650/Notes/MyPrograms. The SAS program starts by reading in the data at the individual quadrat level:

data urchin; infile ’urchin.csv’ dlm=’,’ dsd firstobs=2 missover; /* the first record has the v input transect quadrat legal sublegal; /* no need to specify sampling weights because transects are an SRS */ run;

The dataset contains variables for the transect, the quadrat within each transect, and the number of legal and sub-legal sized urchins counted in that quadrat: Obs

transect

quadrat

legal

sublegal

1

1

1

0

0

2

1

2

0

0

3

1

3

0

0

4

1

4

0

1

5

1

5

0

0

6

1

6

0

0

7

1

7

0

0

8

1

8

0

0

9

1

9

0

0

10

1

10

0

0

What is the population of interest and the parameter? The population of interest is the sea urchins in the harvest area. These happened to be (artificially) “clustered” into transects which are sampled. All sea urchins within the cluster are measured. The parameter of interest is the density of legal sized urchins. What is the frame? The frame is conceptual - there is no predefined list of all the possible transects. Rather they pick random points along the shore and then lay the transects out from that point. What is the sampling design? c

2015 Carl James Schwarz

199

2015-08-20

CHAPTER 3. SAMPLING The sampling design is a cluster sample - the clusters are the transect lines while the quadrats measured within each cluster are similar to pseudo-replicates. The measurements within a transect are not independent of each other and are likely positively correlated (why?). As the points along the shore were chosen using a simple random sample the analysis proceeds as a SRS design on the cluster totals.

Excel Analysis An Excel worksheet with the data and analysis is called urchin and is available in the AllofData workbook from then Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/ Notes/MyPrograms. A reduced view appears below:

c

2015 Carl James Schwarz

200

2015-08-20

CHAPTER 3. SAMPLING

The key, first step in any analysis of a cluster survey is to first summarize the data to the cluster level. You will need the cluster total and the cluster size (in this case the length of the transect). The Pivot Table feature of Excel is quite useful for doing this automatically. Unfortunately, you still have to play around with the final table in order to get the data displayed in a nice format. Note that there was no transect numbered 5, 12, 17, 19, or 32. Why are these transects missing? According to the records of the survey, inclement weather caused cancellation of the missing transects. It seems reasonable to treat the missing transects as missing completely at random (MCAR). In this case,

c

2015 Carl James Schwarz

201

2015-08-20

CHAPTER 3. SAMPLING there is no problem in simply ignoring the missing data – all that happens is that the precision is reduced compared to the design with all data present. We compare the maximum(quadrat) number to the number of quadrat values actually recorded and see that they all match indicating that it appears no empty quadrats were not recorded. In many transect studies, there is a tendency to NOT record quadrats with 0 counts as they don’t affect the cluster sum. However, you still have to know the correct size of the cluster (i.e. how many quadrats), so you can’t simply ignore these ‘missing’ values. In this case, you could examine the maximum of the quadrat number and the number of listed quadrats to see if these agree (why?). Plot the cluster totals vs. the cluster size to see if a ratio estimator is appropriate, i.e. linear relationship through the origin with variance increasing with cluster size. The plot (not shown) shows a weak relationship between the two variables. Compute the summary statistics on the cluster TOTALS. You will need the totals over all sampled clusters of both variables. sum(legal) sum(quad) 1507 1120

n(transect) 28

d = The estimated density is then found as a ratio estimator using the cluster totals: density 2 1507/1120 = 1.345536 urchins/m .

sum(legal) sum(quad)

=

To compute the se, create the diff column as in the ratio find its standard deviation. r estimation section and q s2diff 2 d × 4012 = 0.2272 The estimated se is then found as: se(density) = ntransects × 1 2 = 48.09933 28 quad

urchins/m2 . In order to estimate the total number of urchins in the harvesting area, you simply multiply the estimated ratio and its standard error by the area to be harvested.

SAS Analysis SAS can use the raw data directly – it is not necessary to compute the cluster totals. However, this is a good first step to check the assumptions of the (ratio) analysis, i.e. that the cluster total is approximately linear though the origin. The total on the urchins and length of urchins are computed using Proc Means:

proc sort data=urchin; by transect; run; proc means data=urchin noprint; by transect; var quadrat legal; output out=check min=min max=max n=n sum(legal)=tlegal; run;

and then plotted: c

2015 Carl James Schwarz

202

2015-08-20

CHAPTER 3. SAMPLING

proc sgplot data=check; title2 ’Plot the relationship between the cluster total and cluster size’; scatter y=tlegal x=n / datalabel=transect; /* use the transect number as the plott run;

Because we are computing a ratio estimator from a simple random sample of transects, it is not necessary to specify the sampling weights for the individual quadrats or the transect. The key feature of the Proc SurveyMeans is the use of the CLUSTER statement to identify the clusters in the data.

proc surveymeans data=urchin; /* do not specify a pop size as fpc is negligble */ cluster transect; var legal; ods output statistics=urchinresults; run;

The population number of transects was not specified as the finite population correction is negligible. Here are the results: Variable Name legal c

2015 Carl James Schwarz

Mean

SE Mean

LCL Mean

UCL Mean

1.346

0.227

0.879

1.812

203

2015-08-20

CHAPTER 3. SAMPLING The results are identical to those obtained via Excel.

Planning for future experiments The rse of the estimate is 0.2274/1.3455 = 17% - not terrific. The determination of sample size is done in the same manner as in the ratio estimator case dealt with in earlier sections except that the number of CLUSTERS is found. If we wanted to get a rse near to 5%, we would need almost 320 transects - this is likely too costly.

3.10.6

Example - estimating the total number of sea cucumbers

Sea cucumbers are considered a delicacy among some, and the fishery is of growing importance. In order to set harvest quotas and in order to monitor the stock, it is important that the number of sea cucumbers in a certain harvest area be estimated each year. The following is an example taken from Griffith Passage in BC 1994. To do this, the managers lay out a number of transects across the cucumber harvest area. Divers then swim along the transect, and while carrying a 4 m wide pole, count the number of cucumbers within the width of the pole during the swim. The number of possible transects is so large that the correction for finite population sampling can be ignored. Here is the summary information up the transect area (the preliminary raw data is unavailable):

c

2015 Carl James Schwarz

204

2015-08-20

CHAPTER 3. SAMPLING Transect

Sea

Area

Cucumbers

260

124

220

67

200

6

180

62

120

35

200

3

200

1

120

49

140

28

400

1

120

89

120

116

140

76

800

10

1460

50

1000

122

140

34

180

109

80

48

The total harvest area is 3,769,280 m2 as estimated by a GIS system. The transects were laid out from one edge of the bed and the length of the edge is 51,436 m. Note that because each transect was 4 m wide, the number of transects is 1/4 of this value. The SAS program is available in cucumber.sas available Sample Program Library at http://www. stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The (summarized) data are read in the usual way The data are read in the Data step already sumarized to the cluster level:

data cucumber; infile ’cucumber.csv’ dlm=’,’ dsd missover firstobs=2; input area cucumbers; transect = _n_; /* number the transects */ run;

There is no explicit transect number, so one was created based on the row number in the data file. What is the population of interest and the parameter? The population of interest is the sea cucumbers in the harvest area. These happen to be (artificially) “clustered” into transects which are the sampling unit. All sea cucumbers within the transect (cluster) are measured. c

2015 Carl James Schwarz

205

2015-08-20

CHAPTER 3. SAMPLING The parameter of interest is the total number of cucumbers in the harvest area. What is the frame? The frame is conceptual - there is no predefined list of all the possible transects. Rather they pick random points along the edge of the harvest area, and then lay out the transect from there. What is the sampling design? The sampling design is a cluster sample - the clusters are the transect lines while the quadrats measured within each cluster are similar to pseudo-replicates. The measurements within a transect are not independent of each other and are likely positively correlated (why?). The worksheet cucumber is available in the AllofData.xls workbook from the Sample Program Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms illustrates the computations in Excel. There are three different surveys illustrated. It also computes the two estimators when two potential outliers are deleted and for a second harvest area. The key, first step in any analysis of a cluster survey is to first summarize the data to the cluster level. You will need the cluster total and the cluster size (in this case the area of the transect). This has already been done in the above data and so you don’t need to use a pivot table. Now this summary table is simply an SRSWOR from the set of all transects. We first estimate the density, and then multiply by the area to estimate the total. Note that after summarizing up to the transect level, this example proceeds in an analogous fashion as the grouse in pockets of brush example that we looked at earlier. A plot of the cucumber total vs. the transect size shows a very poor relationship between the two variables. It will be interesting to compare the results from the simple inflation estimator and the ratio estimator. Simple Inflation Estimator First, estimate the number ignoring the area of the transects by using a simple inflation estimator. The summary statistics that we need are: n Mean std Dev

19 transects 54.21 cucumbers/transect 42.37 cucumbers/transect

We compute an estimate of the total as τb = N y = (51, 436/4) × 54.21 = 697, 093 sea cucumbers. [Why did we use 51,436/4 rather than 51,436?] We compute an estimate of the se of the total as: se(b τ) = 124, 981 sea cucumbers.

p

N 2 s2 /n × (1 − f ) =

p

(51, 436/4)2 × 42.372 /19 =

The finite population correction factor is so small we simply ignore it. This gives a relative standard error (se/est) of 18%. Ratio Estimator c

2015 Carl James Schwarz

206

2015-08-20

CHAPTER 3. SAMPLING We use the methods outlined earlier for ratio estimators from SRSWOR to get the following summary table:

Mean

area 320.00

cucumbers 54.21 per transect

d = The estimated density of sea cucumbers is then density 2 cucumber/m .

mean(cucumbers) mean(area)

= 54.21/320.00 = 0.169

To compute the se, create the diff column as in the ratio estimation section and find deviaq its standard s2diff 1 d tion as sdiff = 73.63. The estimated se of the ratio is then found as: se(density) = ntransects × area 2 q 73.632 1 2 = × 3202 = 0.053 cucumbers/m . 19 We once again ignore the finite population correction factor. In order to estimate the total number of cucumbers in the harvesting area, you simply multiply the above by the area to be harvested: d = 3, 769, 280 × 0.169= 638,546 sea cucumbers. τbratio = area × density d The se is found as: se(b τratio ) = area×se(density) = 3, 769, 280×0.053 = 198,983 sea cucumbers for an overall rse of 31%.

SAS Analysis Because only the summary data is available, you cannot use the CLUSTER statement of Proc SurveyMeans. Rather, as noted earlier in the notes, you form a ratio estimator based on the cluster totals. We begin with a plot to see the relationship between transect area and numbers of cucumbers:

proc sgplot data=cucumber; title2 ’plot the relationship between the cluster total and cluster size’; scatter y=cucumbers x=area / datalabel=transect; /* use the transect number as the run;

c

2015 Carl James Schwarz

207

2015-08-20

CHAPTER 3. SAMPLING

Because the relationship between the number of cucumbers and transect area is not very strong, A simple inflation estimator will be tried first. The sample weights must be computed. This is equal to the total area of the cucumber bed divided by the number of transects taken: /* First compute the sampling weight and add to the dataset */ /* The sampling weight is simply the total pop size / # sampling units in an SRS */ /* In this example, transects were an SRS from all possible transects */ proc means data=cucumber n mean std ; var cucumbers; /* get the total number of transects */ output out=weight n=samplesize; run; data cucumber; merge cucumber weight; retain samplingweight; /* we divide the shore length by 4 because each transect is 4 m wide */ if samplesize > . then samplingweight = 51436/4 / samplesize; run;

And then the simple inflation estimator is used via Proc SurveyMeans: proc surveymeans data=cucumber mean clm sum clsum cv ; /* N not specified as we ignore the fpc in this problem */ c

2015 Carl James Schwarz

208

2015-08-20

CHAPTER 3. SAMPLING /* mean clm - find estimate of mean and confidence intervals */ /* sum clsum - find estimate of grand total and confidence intervals */ title2 ’Simple inflation estimator using cluster totals’; var cucumbers; weight samplingweight; ods output statistics=cucumberresultssimple; run;

Variable Name

Mean

SE Mean

LCL Mean

UCL Mean

Sum

SE sum

LCL Sum

UCL Sum

cucumbers

54.211

9.719

33.791

74.630

697093.158

124980.863

434518.107

959668.208

Now for the ratio estimator. First use Proc SurveyMeans to compute the density, and then inflated the density by the total area of cucumber area;

proc surveymeans data=cucumber ratio clm ; /* the ratio clm keywords request a ratio estimator and a confidence interval. */ title2 ’Estimation using a ratio estimator’; var cucumbers area; ratio cucumbers / area; ods output ratio=cucumberratio; /* extract information so that total can be estim run; data cucumbertotal; /* compute estimates of the total */ set cucumberratio; cv = stderr / ratio; /* the relative standard error of the estimate */ Est_total = ratio * 3769280; Se_total = stderr* 3769280; UCL_total = uppercl*3769280; LCL_total = lowercl*3769280; format est_total se_total ucl_total lcl_total 7.1; format cv 7.2; format ratio stderr lowercl uppercl 7.3; run;

This gives the final results: Numerator Variable

Denominator Variable

cucumbers

area

Ratio

LowerCL

StdErr

UpperCL

0.169408

0.05849883

0.052791

0.28031696

Obs

Ratio

StdErr

LowerCL

UpperCL

Est total

Se total

LCL total

UCL total

1

0.169

0.053

0.058

0.280

638546

198983

220498

1056593

c

2015 Carl James Schwarz

209

2015-08-20

CHAPTER 3. SAMPLING Comparing the two approaches Why did the ratio estimator do worse in this case than the simple inflation estimator in Griffiths Passage? The plot the number of sea cucumbers vs. the area of the transect:

shows virtually no relationship between the two - hence there is no advantage to using a ratio estimator. In more advanced courses, it can be shown that the ratio estimator will do better than the inflation estimator if the correlation between the two variables is greater than 1/2 of the ratio of their respective relative variation (std dev/mean). Advanced computations shows that half of the ratio of their relative variations is 0.732, while the correlation between the two variables is 0.041. Hence the ratio estimator will not do well. The Excel worksheet also repeats the analysis for Griffith Passage after dropping some obvious outliers. This only makes things worse! As well, at the bottom of the worksheet, a sample size computation shows that substantially more transects are needed using a ratio estimator than for a inflation estimator. It appears that in Griffith Passage, that there is a negative correlation between the length of the transect and the number of cucumbers found! No biological reason for this has been found. This is a cautionary example to illustrate the even the best laid plans can go astray - always plot the data. A third worksheet in the workbook analyses the data for Sheep Passage. Here the ratio estimator outperforms the inflation estimator, but not by a wide factor.

c

2015 Carl James Schwarz

210

2015-08-20

CHAPTER 3. SAMPLING

3.11

Multi-stage sampling - a generalization of cluster sampling

3.11.1

Introduction

All of the designs considered above select a sampling unit from the population and then do a complete measurement upon that item. In the case of cluster sampling, this is facilitated by dividing the sampling unit into small observational units, but all of the observational units within the sampled cluster are measured. If the units within a cluster are fairly homogeneous, then it seems wasteful to measure every unit. In the extreme case, if every observational unit within a cluster was identical, only a single observational unit from the cluster needs to be selected in order to estimate (without any error) the cluster total. Suppose then that the observational units within a cluster were not identical, but had some variation? Why not take a sub-sample from each cluster, e.g. in the urchin survey, count the urchins in every second or third quadrat rather than every quadrat on the transect. This method is called two-stage sampling. In the first stage, larger sampling units are selected using some probability design. In the second stage, smaller units within the selected first-stage units are selected according to a probability design. The design used at each stage can be different, e.g. first stage units selected using a simple random sample, but second stage units selected using a systematic design as proposed for the urchin survey above. This sampling design can be generalized to multi-stage sampling. Some example of multi-stage designs are: • Vegetation Resource Inventory. The forest land mass of BC has been mapped using aerial methods and divided into a series of polygons representing homogeneous stands of trees (e.g. a stand dominated by Douglas-fir). In order to estimate timber volumes in an inventory unit, a sample of polygons is selected using a probability-proportional-to-size design. In the selected polygons, ground measurement stations are selected on a 100 m grid and crews measure standing timber at these selected ground stations. • Urchin survey Transects are selected using a simple random sample design. Every second or third quadrat is measured after a random starting point. • Clam surveys Beaches are divided into 1 ha sections. A random sample of sections is selected and a series of 1 m2 quadrats are measured within each section. • Herring spawns biomass Schweigert et al. (1985, CJFAS, 42, 1806-1814) used a two-stage design to estimate herring spawn in the Strait of Georgia. • Georgia Strait Creel Survey The Georgia Strait Creel Survey uses a multi-stage design to select landing sites within strata, times of days to interview at these selected sites, and which boats to interview in a survey of angling effort on the Georgia Strait. Some consequences of simple two-stage designs are: • If the selected first-stage units are completely enumerated then complete cluster sampling results. • If every first-stage unit in the population is selected, then a stratified design results. • A complete frame is required for all first-stage units. However, a frame of second-stage and lowerstage units need only be constructed for the selected upper-stage units. c

2015 Carl James Schwarz

211

2015-08-20

CHAPTER 3. SAMPLING • The design is very flexible allowing (in theory) different selection methods to be used at each stage, and even different selection methods within each first stage unit. • A separate randomization is done within each first-stage unit when selecting the second-stage units. • Multi-stage designs are less precise than a simple random sample of the same number of final sampling units, but more precise than a cluster sample of the same number of final sampling units. [Hint: think of what happens if the second-stage units are very similar.] • Multi-stage designs are cheaper than a simple random sample of the same number of final sampling units, but more expensive than a cluster sample of the same number of final sampling units. [Hint: think of the travel costs in selecting more transects or measuring quadrats within a transect.] • As in all sampling designs, stratification can be employed at any level and ratio and regression estimators are available. As expected, the theory becomes more and more complex, the more "variations" are added to the design. The primary incentives for multi-stage designs are that 1. frames of the final sampling units are typically not available 2. it often turns out that most of the variability in the population occurs among first-stage units. Why spend time and effort in measuring lower stage units that are relatively homogeneous within the first-stage unit

3.11.2

Notation

A sample of n first-stage units (FSU) is selected from a total of N first-stage units. Within the ith first-stage unit, mi second-stage units (SSU) are selected from the Mi units available. Item

Population

Sample

Value

Value

N

n

Mi

mi

First stage units Second stage units SSUs in population

M=

P

Mi

Value of SSU

Yij

Total of FSU

τi P

Total in pop

τ=

Mean in pop

µ = τ /M

3.11.3

yij τbi = Mi /mi

mi P

yij

j=1

τi

Summary of main results

We will only consider the case when simple random sampling occurs at both stages of the design. c

2015 Carl James Schwarz

212

2015-08-20

CHAPTER 3. SAMPLING The intuitive explanation for the results is that a total is estimated for each FSU selected (based on the SSU selected). These estimated totals are then used in a similar fashion to a cluster sample to estimate the grand total. Parameter

Population value

Total Mean

τ=

P

µ=

τ M

Estimated Estimate

τi

N n

n P i=1

µ b=

se s

τbi

τb M

se (ˆ τ) = se (ˆ µ) =

N 2 (1 − f1 ) q

s21 n

+

N 2 f1 n2

n P i=1

s2

Mi2 (1 − f2 ) m2ii

se2 (ˆ τ) M2

where n  P

s21 =

i=1

n−1 mi P

s22i =

2 τbi − τb

2

(yij − yi )

j=1

mi − 1 n

τb =

1X τbi n i=1

f1 = n/N and f2i = mi /Mi Notes: • There are two contributions to the estimated se - variation among first stage totals (s21 ) and variation 2 among second stage units (S2i ). • If the FSU vary considerably in size, a ratio estimator (not discussed in these notes) may be more appropriate. Confidence Intervals The usual large sample confidence intervals can be used.

3.11.4

Example - estimating number of clams

A First Nations wished to develop a wild oyster fishery. As first stage in the development of the fishery, a survey was needed to establish the current stock in a number of oyster beds. This example looks at the estimate of oyster numbers from a survey conducted in 1994. The survey was conducted by running a line through the oyster bed – the total length was 105 m. Several random location were located along the line. At each randomly chosen location, the width of the bed was measured and about 3 random location along the perpendicular transect at that point were taken. A 1 m2 quadrat was applied, and the number of oysters of various sizes was counted in the quadrat.

c

2015 Carl James Schwarz

213

2015-08-20

CHAPTER 3. SAMPLING

Location

tran-

width

quad-

sect

width

rat

(m)

(m) 3

Lloyd

5

17

Lloyd

5

17

5

Lloyd

5

17

10

Lloyd

7

18

5

Lloyd

7

18

Lloyd

7

Lloyd

seed

xsmall

small

med

large

total

Net

count

weight (kg)

18

18

41

48

14

139

14.6

6

4

30

9

4

53

5.2

15

21

44

13

11

104

8.2

8

10

14

5

3

40

6.0

12

10

38

36

16

4

104

10.2

18

13

0

15

12

3

3

33

4.6

18

14

1

11

8

5

9

19

52

7.8

Lloyd

18

14

5

13

23

68

18

11

133

12.6

Lloyd

18

14

8

1

29

60

2

1

93

10.2

Lloyd

30

11

3

17

1

13

13

2

46

5.4

Lloyd

30

11

8

12

16

23

22

14

87

6.6

Lloyd

30

11

10

23

15

19

17

1

75

7.0

Lloyd

49

9

3

10

27

15

1

0

53

2.0

Lloyd

49

9

5

13

7

14

11

4

49

6.8

Lloyd

49

9

8

10

25

17

16

11

79

6.0

Lloyd

76

21

4

3

3

11

7

0

24

4.0

Lloyd

76

21

7

15

4

32

26

24

101

12.4

Lloyd

76

21

11

2

19

14

19

0

54

5.8

Lloyd

79

18

1

14

13

7

9

0

43

3.6

Lloyd

79

18

4

0

32

32

27

16

107

12.8

Lloyd

79

18

11

16

22

43

18

8

107

10.6

Lloyd

84

19

1

14

32

25

39

7

117

10.2

Lloyd

84

19

8

25

43

42

17

3

130

7.2

Lloyd

84

19

15

5

22

61

30

13

131

14.2

Lloyd

86

17

8

1

19

32

10

8

70

8.6

Lloyd

86

17

11

8

17

13

10

3

51

4.8

Lloyd

86

17

12

7

22

55

11

4

99

9.8

Lloyd

95

20

1

17

12

20

18

4

71

5.0

Lloyd

95

20

8

32

4

26

29

12

103

11.6

Lloyd

95

20

15

3

34

17

11

1

66

6.0

The data is available in the wildoyster.csv file in the Sample Program Library at http://www. stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are read in the usual way:

data oyster; infile ’wildoyster.csv’ dlm=’,’ dsd missover firstobs=2; input loc $ transect width quad small xsamll small med large total weight; sampweight = 105/10 * width/3; /* sampling weight = product of sampling fractions *

c

2015 Carl James Schwarz

214

2015-08-20

CHAPTER 3. SAMPLING run;

The sample weight is computed as the product of the sampling fraction at the first stage and the second stage. These multi-stage designs are complex to analyze. Rather than trying to implement the various formulae, I would suggest that a proper sampling package be used (such as SAS, or R) rather than trying to do these by hand. If using simple packages, the first step is to move everything up to the primary sampling unit level. We need to estimate the total at the primary sampling unit, and to compute some components of the variance from the second stage of sampling.

Excel Spreadsheet The analysis was done in Excel as shown in the wildoyster worksheet in the ALLofData.xls workbook from the Sample Program Library. Because Excel does not have any explicit functions for the analysis of survey data, we need to first estimate the cluster size and cluster totals at the first stage, and then use the standard ratio estimators on these estimated totals. As in the case of a pure cluster sample, the PivotTable feature can be used to compute summary statistics needed to estimate the various components.

SAS Analysis Proc SurveyMeans is used directly with the two-stage design. The cluster statement identifies the first stage of the sampling.

/* estimate the total biomass on the oyster bed */ /* Note that SurveyMeans only use a first stage variance in its computation of the standard error. As the first stage sampling fraction is usually quite small, this will tend to give only slight underestimates of the true standard error of the estimate */ proc surveymeans data=oyster total=105 /* length of first reference line */ mean clmean sum clsum ; /* interested in total biomass estimate */ cluster transect; /* identify the perpindicular transects */ var weight; weight sampweight; ods output statistics=oysterresults; run;

Note that the Proc SurveyMeans computes the se using only the first stage standard errors. As the c

2015 Carl James Schwarz

215

2015-08-20

CHAPTER 3. SAMPLING first stage sampling fraction is usually quite small, this will tend to give only slight underestimates of the true standard error of the estimate. The final results are: Variable Name

Mean

SE Mean

LCL Mean

UCL Mean

Sum

SE sum

LCL Sum

UCL Sum

8.2

0.5

7.1

9.2

14070.000

1444.920

10801.364

17338.636

weight

Our final estimate is a total biomass of 14,070 kg with an estimated se of 1484 kg. A similar procedure can be used for the other variables.

3.11.5

Some closing comments on multi-stage designs

The above example barely scratches the surface of multi-stage designs. Multi-stage designs can be quite complex and the formulae for the estimates and estimated standard errors fearsome. If you have to analyze such a design, it is likely better to invest some time in learning one of the statistical packages designed for surveys (e.g. SAS v.8) rather than trying to program the tedious formulae by hand. There are also several important design decisions for multi-stage designs. • Two-stage designs have reduced costs of data collection because units within the FSU are easier to collect but also have a poorer precision compared to a simple-random sample with the same number final sampling units. However, because of the reduced cost, it often turns out the more units can be sampled under a multi-stage design leading to an improved precision for the same cost as a simple-random sample design. There is a tradeoff between sampling more first stage units and taking a small sub-sample in the secondary stage. An optimal allocation strategy can be constructed to decide upon the best strategy – consult some of the reference books on sampling for details. • As with ALL sampling designs, stratification can be used to improve precision. The stratification usually takes place at the first sampling unit stage, but can take place at all stages. The details of estimation under stratification can be found in many sampling texts. • Similarly, ratio or regression estimators can also be used if auxiliary information is available that is correlated with the response variable. This leads to very complex formulae! One very nice feature of multi-stage designs is that if the first stage is sampled with replacement, then the formulae for the estimated standard errors simplify considerably to a single term regardless of the design used in the lower stages! If there are many first stage units in the population and if the sampling fraction is small, the chances of selecting the same first stage unit twice are very small. Even if this occurs, a different set of second stage units will likely be selected so there is little danger of having to measure the same final sampling unit more than once. In such situations, the design at second and lower stages is very flexible as all that you need to ensure is that an unbiased estimate of the first-stage unit total is available.

c

2015 Carl James Schwarz

216

2015-08-20

CHAPTER 3. SAMPLING

3.12

Analytical surveys - almost experimental design

In descriptive surveys, the objective was to simply obtain information about one large group. In observational studies, two deliberately selected sub-populations are selected and surveyed, but no attempt is made to generalize the results to the whole population. In analytical studies, sub-populations are selected and sampled in order to generalize the observed differences among the sub-population to this and other similar populations. As such, there are similarities between analytical and observational surveys and experimental design. The primary difference is that in experimental studies, the manager controls the assignment of the explanatory variables while measuring the response variables, while in analytical and observational surveys, neither set of variables is under the control of the manager. [Refer back to Examples B, C, and D in the earlier chapters] The analysis of complex surveys for analytical purposes can be very difficult (Kish 1987; Kish, 1984; Rao, 1973; Sedransk, 1965a, 1965b, 1966). As in experimental studies, the first step in analytical surveys is to identify potential explanatory variables (similar to factors in experimental studies). At this point, analytical surveys can be usually further subdivided into three categories depending on the type of stratification: • the population is pre-stratified by the explanatory variables and surveys are conducted in each stratum to measure the outcome variables; • the population is surveyed in its entirety, and post-stratified by the explanatory variables. • the explanatory variables can be used as auxiliary variables in ratio or regression methods. [It is possible that all three types of stratification take place - these are very complex surveys.] The choice between the categories is usually made by the ease with which the population can be pre-stratified and the strength of the relationship between the response and explanatory variables. For example, sample plots can be easily pre-stratified by elevation or by exposure to the sun, but it would be difficult to pre-stratify by soil pH. Pre-stratification has the advantage that the manager has control over the number of sample points collected in each stratum, whereas in post- stratification, the numbers are not controllable, and may lead to very small sample sizes in certain strata just because they form only a small fraction of the population. For example, a manager may wish to investigate the difference in regeneration (as measured by the density of new growth) as a function of elevation. Several cut blocks will be surveyed. In each cut block, the sample plots will be pre-stratified into three elevation classes, and a simple random sample will be taken in each elevation class. The allocation of effort in each stratum (i.e. the number of sample plots) will be equal. The density of new growth will be measured on each selected sample plot. On the other hand, suppose that the regeneration is a function of soil pH. This cannot be determined in advance, and so the manager must take a simple random sample over the entire stand, measure the density of new growth and the soil pH at each sampling unit, and then post-stratify the data based on measured pH. The number of sampling units in each pH class is not controllable; indeed it may turn out that certain pH classes have no observations. If explanatory variables are treated as a auxiliary variables, then there must be a strong relationship between the response and explanatory variables. Additionally, we must be able to measure the auxiliary variable precisely for each unit. Then, methods like multiple regression can also be used to investigate the relationship between the response and the explanatory variable. For example, rather than classifying elevation into three broad elevation classes or soil pH into broad pH classes, the actual elevation or soil c

2015 Carl James Schwarz

217

2015-08-20

CHAPTER 3. SAMPLING pH must be measured precisely to serve as an auxiliary variable in a regression of regeneration density vs. elevation or soil pH. If the units have been selected using a simple random sample, then the analysis of the analytical surveys proceeds along similar lines as the analysis of designed experiments (Kish, 1987; also refer to Chapter 2). In most analyses of analytical surveys, the observed results are postulated to have been taken from a hypothetical super-population of which the current conditions are just one realization. In the above example, cut blocks would be treated as a random blocking factor; elevation class as an explanatory factor; and sample plots as samples within each block and elevation class. Hypothesis testing about the effect of elevation on mean density of regeneration occurs as if this were a planned experiment. Pitfall: Any one of the sampling methods described in Section 2 for descriptive surveys can be used for analytical surveys. Many managers incorrectly use the results from a complex survey as if the data were collected using a simple random sample. As Kish (1987) and others have shown, this can lead to substantial underestimates of the true standard error, i.e., the precision is thought to be far better than is justified based on the survey results. Consequently the manager may erroneously detect differences more often than expected (i.e., make a Type I error) and make decisions based on erroneous conclusions. Solution: As in experimental design, it is important to match the analysis of the data with the survey design used to collect it. The major difficulty in the analysis of analytical surveys are: 1. Recognizing and incorporating the sampling method used to collect the data in the analysis. The survey design used to obtain the sampling units must be taken into account in much the same way as the analysis of the collected data is influenced by actual experimental design. A table of ‘equivalences’ between terms in a sample survey and terms in experimental design is provided in Table 1. Table 1 Equivalences between terms used in surveys and in experimental design. Survey Term

Experimental Design Term

Simple Random Sample

Completely randomized design

Cluster Sampling

(a) Clusters are random effects; units within a cluster treated as sub-samples; or (b) Clusters are treated as main plots; units within a cluster treated as sub-plots in a split-plot analysis.

Multi-stage sampling

(a) Nested designs with units at each stage nested in units in higher stages. Effects of units at each stage are treated as random effects, or (b) Split-plot designs with factors operating at higher stages treated as main plot factors and factors operating at lower stages treated as sub-plot factors.

Stratification

Fixed factor or random block depending on the reasons for stratification.

Sampling Unit

Experimental unit or treatment unit

Sub-sample

Sub-sample

There is no quick easy method for the analysis of complex surveys (Kish, 1987). The superpopulation approach seems to work well if the selection probabilities of each unit are known (these are used to weight each observation appropriately) and if random effects corresponding to the various strata or stages are employed. The major difficulty caused by complex survey designs is that the observations are not independent of each other. c

2015 Carl James Schwarz

218

2015-08-20

CHAPTER 3. SAMPLING 2. Unbalanced designs (e.g. unequal numbers of sample points in each combination of explanatory factors). This typically occurs if post- stratification is used to classify units by the explanatory variables but can also occur in pre-stratification if the manager decides not to allocate equal effort in each stratum. The analysis of unbalanced data is described by Milliken and Johnson (1984). 3. Missing cells, i.e., certain combinations of explanatory variables may not occur in the survey. The analysis of such surveys is complex, but refer to Milliken and Johnson (1984). 4. If the range of the explanatory variable is naturally limited in the population, then extrapolation outside of the observed range is not recommended. More sophisticated techniques can also be used in analytical surveys. For example, correspondence analysis, ordination methods, factor analysis, multidimensional scaling, and cluster analysis all search for post-hoc associations among measured variables that may give rise to hypotheses for further investigation. Unfortunately, most of these methods assume that units have been selected independently of each other using a simple random sample; extensions where units have been selected via a complex sampling design have not yet developed. Simpler designs are often highly preferred to avoid erroneous conclusions based on inappropriate analysis of data from complex designs. Pitfall: While the analysis of analytical surveys and designed experiments are similar, the strength of the conclusions is not. In general, causation cannot be inferred without manipulation. An observed relationship in an analytical survey may be the result of a common response to a third, unobserved variable. For example, consider the two following experiments. In the first experiment, the explanatory variable is elevation (high or low). Ten stands are randomly selected at each elevation. The amount of growth is measured and it appears that stands at higher elevations have less growth. In the second experiment, the explanatory variables is the amount of fertilizer applied. Ten stands are randomly assigned to each of two doses of fertilizer. The amount of growth is measured and it appears that stands that receive a higher dose of fertilizer have greater growth. In the first experiment, the manager is unable to say whether the differences in growth are a result of differences in elevation or amount of sun exposure or soil quality as all three may be highly related. In the second experiment, all uncontrolled factors are present in both groups and their effects will, on average, be equal. Consequently, the assignment of cause to the fertilizer dose is justified because it is the only factor that differs (on average) among the groups. As noted by Eberhardt and Thomas (1991), there is a need for a rigorous application of the techniques for survey sampling when conducting analytical surveys. Otherwise they are likely to be subject to biases of one sort or another. Experience and judgment are very important in evaluating the prospects for bias, and attempting to find ways to control and account for these biases. The most common source of bias is the selection of survey units and the most common pitfall is to select units based on convenience rather than on a probabilistic sampling design. The potential problems that this can lead to are analogous to those that occur when it is assumed that callers to a radio-phone- in show are representative of the entire population.

3.13

References

• Cochran, W.G. (1977). Sampling Techniques. New York:Wiley. One of the standard references for survey sampling. Very technical • Gillespie, G.E. and Kronlund, A.R. (1999). A manual for intertidal clam surveys, Canadian Technical Report of Fisheries and Aquatic Sciences 2270. A very nice summary of using sampling methods to estimate clam numbers. • Keith, L.H. (1988), Editor. Principles of Environmental Sampling. New York: American Chemical Society. c

2015 Carl James Schwarz

219

2015-08-20

CHAPTER 3. SAMPLING A series of papers on sampling mainly for environmental contaminants in ground and surface water, soils, and air. A detailed discussion on sampling for pattern. • Kish, L. (1965). Survey Sampling. New York: Wiley. An extensive discussion of descriptive surveys mostly from a social science perspective. • Kish, L. (1984). On Analytical Statistics from complex samples. Survey Methodology, 10, 1-7. An overview of the problems in using complex surveys in analytical surveys. • Kish, L. (1987). Statistical designs for research. New York: Wiley. One of the more extensive discussions of the use of complex surveys in analytical surveys. Very technical. • Krebs, C. (1989). Ecological Methodology. A collection of methods commonly used in ecology including a section on sampling • Kronlund, A.R., Gillespie, G.E., and Heritage, G.D. (1999). Survey methodology for intertidal bivalves. Canadian Technical Report of Fisheries and Aquatic Sciences 2214. An overview of how to use surveys for assessing intertidal bivalves - more technical than Gillespie and Kronlund (1999). • Myers, W.L. and Shelton, R.L. (1980). Survey methods for ecosystem management. New York: Wiley. Good primer on how to measure common ecological data using direct survey methods, aerial photography, etc. Includes a discussion of common survey designs for vegetation, hydrology, soils, geology, and human influences. • Sedransk, J. (1965b). Analytical surveys with cluster sampling. Journal of the Royal Statistical Society, Series B, 27, 264-278. • Thompson, S.K. (1992). Sampling. New York:Wiley. A good companion to Cochran (1977). Has many examples of using sampling for biological populations. Also has chapters on mark-recapture, line-transect methods, spatial methods, and adaptive sampling.

3.14

Frequently Asked Questions (FAQ)

3.14.1

Confusion about the definition of a population

What is the difference between the "population total" and the "population size"?

Population size normally refers to the number of “final sampling” units in the population. Population total refers to the total of some variable over these units. For example, if you wish to estimate the total family income of families in Vancouver, the “final” sampling units are families, the population size is the number of families in Vancouver, and the response variable is the income for this household, and the population total will be the total family income over all families in Vancouver. Things become a bit confusing when sampling units differ from “final” units that are clustered and you are interested in estimates of the number of “final” units. For example in the grouse/pocket bush example, the population consists of the grouse which are clustered into 248 pockets of brush. The grouse is the final sampling unit, but the sampling unit is a pocket of bush. In cluster sampling, you

c

2015 Carl James Schwarz

220

2015-08-20

CHAPTER 3. SAMPLING must expand the estimator by the number of CLUSTERS, not by the number of final units. Hence the expansion factor is the number of pockets (248), the variable of interest for a cluster is the number of grouse in each pocket, and the population total is the number of grouse over all pockets. Similarly, for the oysters on the lease. The population is the oysters on the lease. But you don’t randomly sample individual oysters – you randomly sample quadrats which are clusters of oysters. The expansion factor is now the number of quadrats. In the salmon example, the boats are surveyed. The fact that the number of salmon was measured is incidental - you could have measured the amount of food consumed, etc. In the angling survey problem, the boats are the sampling units. The fact that they contain anglers or that they caught fish is what is being measured, but the set of boats that were at the lake that day is of interest.

3.14.2

How is N defined

How is N (the expansion factor defined). What is the best way to find this value? This can get confusing in the case of cluster or multi-phase designs as there are different N ’s at each stage of the design. It might be easier to think of N as an expansion factor. The expansion factor will be known once the frame is constructed. In some cases, this can only be done after the fact - for example, when surveying angling parties, the total number of parties returning in a day is unknown until the end of the day. For planning purposes, some reasonable guess may have to done in order to estimate the sample size. If this is impossible, just choose some arbitrary large number - the estimated future sample size will be an overestimate (by a small amount) but close enough. Of course, once the survey is finished, you would then use the actual value of N in all computations.

3.14.3

Multi-stage vs. Multi-phase sampling

What is the difference between Multi-stage sampling and multi-phase sampling?

In multi-stage sampling, the selection of the final sampling units takes place in stages. For example, suppose you are interested in sampling angling parties as they return from fishing. The region is first divided into different landing sites. A random selection of landing sites is selected. At each landing site, a random selection of angling parties is selected. In multi-phase sampling, the units are NOT divided into larger groups. Rather a first phase selects some units and they are measured quickly. A second phase takes a sub-sample of the first phase and measures more intently. Returning back to the angling survey. A multi-phase design would select angling parties. All of the selected parties could fill out a brief questionnaire. A week later, a sample of the questionnaires is selected, and the angling parties RECONTACTED for more details. The key difference is that in multi-phase sampling, some units are measured TWICE; in multi-phase sampling, there are different sizes of sampling units (landing sites vs. angling parties), but each sampling unit is only selected once.

c

2015 Carl James Schwarz

221

2015-08-20

CHAPTER 3. SAMPLING

3.14.4

What is the difference between a Population and a frame?

Frame = list of sampling units from which a sample will be taken. The sampling units may not be the same as the “final” units that are measured. For example, in cluster sampling, the frame is the list of clusters, but the final units are the objects within the cluster. Population = list of all “final” units of interest. Usually the “final units” are the actual things measured in the field, i.e. what is the final object upon which a measurement is taken. In some cases, the frame doesn’t match the population which may cause biases, but in ideal cases, the frame covers the population.

3.14.5

How to account for missing transects.

What do you do if an entire cluster is “missing”?

Missing data can occur at various parts in a survey and for various reasons. The easiest data to handle is data ‘missing completely at random’ (MCAR). In this situation, the missing data provides no information about the problem that is not already captured by other data point and the ‘missingness’ is also non-informative. In this case, and if the design was a simple random sample, the data point is just ignored. So if you wanted to sample 80 transects, but were only able to get 75, only the 75 transects are used. If some of the data are missing within a transect - the problem changes from a cluster sample to a two-stage sample so the estimation formulae change slightly. If data is not MCAR, this is a real problem - welcome to a Ph.D. in statistics in how to deal with it!

c

2015 Carl James Schwarz

222

2015-08-20

Suggest Documents