Chapter Six. Hypothesis Testing: Applications

Chapter Six Hypothesis Testing: Applications Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
Author: Ellen Pierce
0 downloads 1 Views 285KB Size
Chapter Six Hypothesis Testing: Applications Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Measurements with a Fixed Value . . . . . . . . . . . . . . . . . . . . . . . . . . Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Two Measurement Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Independent Sets of Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paired Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Many Measurement Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Measurement Precision (F-test) . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hypothesis Testing in Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outlier Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Identification of Contaminants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Testing Single Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Testing Multiple Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Topic: Data Contamination Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Choosing Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

133 135 135 135 136 136 142 146 147 147 147 149 152 152 154 156 159 164 164 166

Introduction In the last chapter, we described the most important concepts of statistical hypothesis testing; in this chapter, we will apply the formal procedure of hypothesis testing to a variety of situations. Let’s very quickly review the steps in the testing procedure (as applied to measurement data) 1. Construct the null hypothesis, H0, and the alternate hypothesis, H1, in terms of measurement population parameters. 2. Choose the confidence level to give desired levels of false positive and false negative errors 3. Choose a test statistic and calculate its value from the measurements. 4. Determine the critical value(s) from the null distribution of the test statistic, and make the decision to accept H0 or H1. 5. State the conclusion in plain English (along with the confidence level). Remember that a decision to accept H0 does not imply proof. Before proceeding further, a word of caution is necessary. The limitations of hypothesis testing must be kept firmly in mind: • a positive conclusion (proof of the alternate hypothesis) is no more valid than the data on which it is based. As the saying goes, “garbage in, garbage out.” The presence of bias in the measurements may render meaningless the results of a hypothesis test. • the assumptions behind any specific hypothesis test must be kept firmly in mind. For example, in order to use the testing procedure outlined above, the form of the probability distribution of the test statistic must be known. This last point especially needs to be emphasized, because many statistical tests are used (and accepted as fact) even when the assumptions on which they are based are not met). Once you get the hang of them, hypothesis testing is deceptively easy, especially with the abundance of computer programs that are available for assistance. It is too easy to get caught up in number crunching without examining the validity of the test procedure. That is why it is important to understand the previous chapter, and to read the “fine print” in any testing procedure. In spite of these caveats, you should still appreciate the scope of the hypothesis testing procedure. Any assertion that can be stated in terms of population parameters can be tested; for some test procedures (called nonparametric tests), it is not even necessary to know the underlying distribution of the data. In this chapter, we will discuss the following hypothesis tests: 1. comparison of the mean of a set of measurements with a fixed value; 2. comparison of the mean of two sets of measurements; 3. comparison of the precision of two sets of measurements; 4. testing in regression analysis; and 5. testing of outliers.

Page 133

Chapter 6

Hypothesis Testing: Applications

In all of these cases, we are testing to see if there is a significant difference between two numbers. A difference is considered “significant” if it is greater than can be reasonably explained by the random variability of the measurements.

Page 134

Chapter 6

Hypothesis Testing: Applications

Comparison of Measurements with a Fixed Value We have covered this test in some detail in chapter 5, where we were attempting to compare the measured nicotine levels of cigarettes with the manufacturer’s claimed level. This test is used when we want to test of there is a significant difference between a group of measurements and some postulated value.

Hypotheses • H0: µx = k • H1: µx ≠ k (two-tailed), or µx > k (one-tailed), or µx < k (one-tailed)

Test Statistic The following table gives the test statistics for these types of hypothesis tests. Test Statistic

Null Distribution

σ is known

T= x−k ✤(x)

The z-distribution, assuming the measurement mean is normally distributed.

σ is unknown

T= x−k s(x)

A student’s t-distribution with n−1 degrees of freedom and a mean of zero, assuming the measurement mean is normally distributed

A hypothesis test is often referred to by the name of the null distribution of the test statistic. Thus, to test the significance between the measurement mean and a fixed value, we use a z-test if the standard deviation, σx, of the measurements is known, and a t-test if σx is unknown.

Page 135

Chapter 6

Hypothesis Testing: Applications

Comparison of Two Measurement Means Independent Sets of Measurements Hypotheses We want to compare two sets of measurements. The first set contains nx measurements, and is a sample of the population of a random variable x. The second set contains ny measurements of a random variable y. Almost certainly, since random measurement error will be present in all of the values, the two measurement means x and y will almost certainly not be equal; however, we want to see if the observed difference between them is significant. So our null and alternate hypotheses will be • H0: µx = µy • H1: µx ≠ µy (two-tailed), or µx > µy (one-tailed), or µx > µy (one-tailed)

Test Statistic We have two groups of measurements, group x and group y. Let’s define d as the difference between the sample means of the two groups: dhx−y If the sample means are normally-distributed variables, then d will also be normally distributed, with a mean given by ✙d = ✙x − ✙y and a standard deviation (assuming the groups of measurements are independent) ✤d =

✤ 2x ✤ 2y nx + ny

[6.1]

[Note: you should be able to derive this equation by propagation of error.] Equation 6.1 can be simplified under one important condition. If the population variance of the two sets of measurements is equal (i.e., homogeneous variance), then σx = σy, and the equation becomes homogeneous variance

✤ d = ✤ n1x + n1y

where σ is the common population variance of the two sets of measurements. Page 136

[6.2]

Chapter 6

Hypothesis Testing: Applications

The test statistic that we will use in the comparison of measurement means will either be the standardized or studentized value of d. Thus, if the variances of the groups is known (which is not very common), we use T = ✤dd

test statistic, known variances

[6.3]

In other words, in comparing the means of two independent sets of measurements, we can use the standardized difference of sample means. Example 6.1 The composition of a flake of paint found on the clothes of the victim of a hit-and-run accident was compared with that of paint from the car suspected of causing the accident. Do the following data for the spectroscopic determination of titanium in the paint suggest a difference in composition between the two materials? From previous experience, the standard deviation for the method is known to be 0.35% Ti. Paint from clothes: %Ti = 4.0, 4.6 Paint from car: %Ti = 4.5, 5.3, 5.5, 5.0, 4.9 The null hypothesis will be that the concentration of Ti in all the paint samples is the same; assuming no bias in the measurements, this can be stated in terms of the population parameters as H0: µclothes = µcar while the alternate hypothesis will simply be that they are different: H1: µclothes ≠ µcar (2-tailed test) Let’s set the confidence level at 80%, a fairly low value, because we are less concerned with the consequences of a false positive than with a false negative (which would result in wrongfully accusing an innocent man). clothes d

T ( 4.0 4.6 ) . % car

mean( clothes )

σd

T obs

1 σ. 2 d σd

1 5

T ( 4.5 5.3 5.5 5.0 4.9 ) . %

mean( car ) d = 0.7400 %

σ d = 0.2928 %

T obs = 2.5271

σ

0.35. %

difference in sample means

standard deviation of the difference in means

standardized difference in means; seems pretty large.

Looking in the z-tables (which describe the null distribution of T), we find the Tcrit = ±1.2816. Thus, we reject H0 at the 80% confidence level. Assuming no measurement bias, we can say with 80% confidence that the paints contain different levels of Ti. Note that the observed two-tailed P-value is Pobs = 0.0115; thus, we can still reject H0 at the 98% confidence level, if we so chose.

Page 137

Chapter 6

Hypothesis Testing: Applications

In the last example, since the measurement variance is known, and it is homogeneous, we use eqn 6.2 to calculate σd, the standard error of the difference of sample means. If the two sets of measurements had not shared the same variance, we would have had to use eqn. 6.1 Normally, of course, the standard deviation of the measurements will not be known. In this case, our test statistic will be the studentized difference in sample means: T = sdd

test statistic, unknown variances

[6.4]

where d is the difference in sample means. In the case of homogeneous (but unknown) variance of the two sets of measurements, we calculate the standard deviation, sd, of the difference of means by using an analog of eqn. 6.2: s d = s pool n1x + n1y

homogeneous (but unknown) variance

[6.5]

Since the two groups share the same variance, we may pool data to obtain a standard deviation with more degrees of freedom. In this case, the test statistic will follow a student’s t-distribution with ✚ = n x − n y − 2 degrees of freedom. Since T follows a t-distribution, this test is a type of t-test. The next example illustrates the t-test for the comparison of means in the case of homogenous variance. Example 6.2 A procedure is developed for the analysis of the trace iron content of water. Five measurements of the blank yield the following measurements (in ppm): 0.53, 0.56, 0.51, 0.53, 0.50 ppm Fe blank measurements A sample is analyzed three times, giving the following measurements: 0.56, 0.58, 0.56 ppm Fe sample measurements Can we say with some degree of certainty that there is iron in the sample? The null hypothesis is that there is no iron in the sample; assuming no measurement bias, we can state this as H0: µsample = µblank The alternate hypothesis is that there is indeed iron in the sample, which would result in measurements significantly larger than those on the blank: H1: µsample > µblank (1-tailed test) Let’s use a 95% confidence level. Calculating the difference between the sample and the blank, we obtain blank d

T ( 0.53 0.56 0.51 0.53 0.50 ) . ppm

mean( sample )

mean( blank )

sample

d = 0.0407 ppm

T ( 0.56 0.58 0.56 ) . ppm

we want to test whether this value is significantly larger than zero.

Page 138

Chapter 6

Hypothesis Testing: Applications

The test statistic is given by eqn. 6.4; we must obtain an estimate, sd, of the standard error of d. It is common in analytical chemistry to assume that the measurements on the sample and the blank, when they are similar in magnitude, are of the same precision. In other words, we may assume homogeneous variance, and calculate the pooled variance. 4. var( blank )

s pool

2. var( sample )

4

2

s pool = 0.0199 ppm

Now we can estimate the standard deviation of the difference in sample means: sd

T obs P obs

s pool .

1

1

5

3

d sd 0.0157

s d = 0.0146 ppm

T obs = 2.7920

The test statistic follows a t-distribution with 6 degrees of freedom

one-tailed P-value

The one-tailed critical value for testing at 95% confidence is obtained from the t-tables: Tcrit = 1.9432. Since Tobs > Tcrit, we can reject H0 and accept H1. We can say with 95% confidence that there is iron in the sample, assuming no measurement bias. Note that the one-tailed P-value is Pobs = 0.0157. Thus, we would be able to prove that there is iron in the sample even at the 98% confidence level. If two groups of measurements do not share the same variance, then we cannot pool the data. In this case, we say that the variance of the two populations is nonhomogeneous, and we must estimate σd with sd calculated from the following expression: nonhomogeneous (unknown) variance

sd =

s 2x s 2y nx + ny

[6.6]

Using this estimate, T will follow a t-distribution with ν degrees of freedom, where ν must be calculated using the following expression ✚x✚y ✚d = [6.7] 2 ✚ y c + v x (1 − c) 2 where νx and νy are the degrees of freedom in in sx and sy, respectively, and the value of c in this equation is calculated from s2 c = n1x x2 sd Note that sd in this expression is calculated by eqn. 6.6. If necessary, you should round down the value of ν calculated from eqn. 6.7 to the nearest integer (this gives a slightly more conservative test).

Page 139

Chapter 6

Hypothesis Testing: Applications

Example 6.3 Two barrels of wine were analyzed for alcohol content. On the basis of 6 analyses, the mean content of the first barrel was 12.61% ethanol (s = 0.065%). Four analyses of the second barrel gave a mean of 12.50% ethanol (s = 0.080%). Do the data indicate a difference between the wines? The null and alternate hypotheses are H0: µfirst = µsecond

“same alcohol content in both barrels (assuming no bias)”

H1: µfirst ≠ µsecond

“different alcohol content, 2-tailed test”

Let’s test at the 95% level. Normally in this case I would assume homogeneous variance: presumably, the same analytical technique was used in both cases, and the signal levels, and precisions, seem about the same. However, for illustrative purposes, let’s assume that the variance is nonhomogeneous. d

12.61. %

12.50. %

0.065. %

s first

2

sd

c

n first

n second

n first

sd

d sd

2

n first

6

n second

4

s d = 0.0480 %

2

c = 0.3056

ν

n first n second

T obs = 2.2916

0.0705

0.080. %

2

s second

s . first

P obs

s second

s first

1

T obs

d = 0.1100 %

1 . n second

2 1 .c

(1

1

2 c ) . n first

ν = 5.5739 1

This will (approximately) follow a t-distribution with 5 degrees of freedom.

2-tailed P-value

Based on the observed P-value, we cannot prove H1 at the 95% level. The 2-tailed critical values for testing at the 95% level are ±2.5706. Since Tobs=2.2916, we must accept H0. We cannot prove a significant difference in alcohol content in the barrels with 95% confidence.

Summary The following table summarizes the test statistics used in comparing the means of two independent sets of measurements. Note that the statements about null distribution assumes that both measurement means x and y are normally distributed.

Page 140

Chapter 6

Hypothesis Testing: Applications

std error

null distribution

yes

test statistic T = ✤dd

✤ d = ✤ n1x + n1y

z-distribution

yes

no

T = ✤dd

✤d =

no

yes

T = sdd

s d = s pool n1x + n1y

no

no

T = sdd

sd =

σ known?

σx = σy?

yes

2 ✤ 2x ✤ y + nx ny

2 s 2x s y + nx ny

z-distribution t-distribution with ν = nx+ny−2 t-distribution with ν given by eqn. 6.7

The most common situation is that you have two sets of measurements of unknown precision (i.e., neither σx nor σy are known). You then must decide whether to assume homogeneous variance (σx = σy?). The questions to be aware of are: 1. When can we assume homogeneous variance? 2. What are the consequences of incorrectly assuming homogeneous variance (or of incorrectly assuming nonhomogeneous variance)? In real life, the first question is usually a judgment call. Based on a knowledge of the experiment and of the likely sources of error, you must decide whether the magnitudes of random measurement error are comparable. In analytical chemistry, this is more likely to be the case when the same analytical measurement technique is used by the same analyst on samples that contain comparable quantities of analyte in similar sample matrices. Homogeneous variance is not likely when comparing two dissimilar measurement techniques, or in the analysis of very different types of samples. The assumption of homogeneous variance will always result in the greatest degrees of freedom, resulting in more powerful statistical tests (which are able to label smaller differences as “significant”). If sample sizes and the standard deviations are comparable (nx = ny and sx = sy), then eqn. 6.7 yields a value of ν that is comparable in magnitude to nx+ny−2, the value obtained when the data is pooled. The following rule of thumb expresses this idea • if the sample sizes are equal (nx = ny) and the population variances do not differ by a factor of more than three (in other words, ✤ x < 3 ✤ y , where group x is the group with the greater variability), then homogeneous variance may be assumed with reasonable justification. On the other hand, • when the sample size is different, the most serious case is when the group with the smaller sample size is associated with the larger variance: in other words, if nx < ny and σx > σy . In this case we must use equations 6.6 and 6.7 in calculating the sample statistic. If we make an incorrect decision (homogeneous vs nonhomogeneous) then our critical values, and the calculated P-value, are not quite what they should be. If we assume homogeneous variance incorrectly, then the confidence level of our test procedure is actually not as high as we think it is; in other words, the probability of false positive (α) is higher than we want. We can minimize this affect by designing the experiment so that nx = ny (if possible). In any event, this Page 141

Chapter 6

Hypothesis Testing: Applications

consequence is more serious than incorrectly assuming nonhomogeneous variance, which results in a more conservative test (lower α) than we expect. Based on this argument, a very conservative approach would be to always assume nonhomogeneous variance. However, this approach is needlessly restrictive, since there are many situations where we can be reasonably certain that the measurement precision of two data sets is the same. If we assumed nonhomogeneous variance in these cases, we are not making full use of our data, and our test will not be as powerful. In example 6.3, we assumed nonhomogeneous variance and accepted H0 for a test at the 95% level. Let’s try this example again, this time making the reasonable assumption of equal measurement precision. 2 5. s x

s pool

sd

T obs

P obs

2 3. s y

8

s pool .

d sd

1

1

nx

ny

s pool = 0.0710 %

s d = 0.0458 %

T obs = 2.4002

0.0432

we can pool our variances now

revised estimate of std error

follows a t-distribution with 8 degrees of freedom

two-tailed P-value indicates that we can reject H0 at the 95% level. Let's be sure.

Since we now have 8 degrees of freedom, the critical values are ±2.3060; since Tobs is outside of this range, we may reject H0 and accept H1 at the 95% level. For this particular case, the decision of homogeneous vs nonhomogeneous variance makes our conclusion different. Note that the P-value changed from 0.0705 to 0.0432, which doesn’t seem a remarkable change; it is enough, however, to make a different when testing at the 95% level.

Paired Measurements All throughout the last section, we assumed that there was no correlation between the two groups of measurements (i.e., that they were independent). Now consider the following example:

Page 142

Chapter 6

Hypothesis Testing: Applications

Example 6.4 A new flame atomic-absorption spectroscopic method of determining antimony in the atmosphere was compared with the recommended colorimetric method. For samples from an urban atmosphere, the following results were obtained: antimony found (mg/m3) sample #

new method

standard method

1

22.2

25.1

2

19.2

19.5

3

15.7

16.6

4

20.4

21.3

5

19.6

20.7

6

15.7

16.8

Do the results obtained by the two methods differ significantly? This is a good example of paired measurements, when two measurements are made on the same object. The two sets of values from paired measurements are often correlated, particularly when some changing property of the object strongly affects both measurements. In this particular example, we are comparing the ability of two methods to analyze Sb. Presumably the standard method has been well characterized, and is free of measurement bias. The new method may be cheaper, faster and/or more sensitive, but we need to test it against the standard method to ensure that it gives comparable (i.e., bias-free) results. This has been done by measuring [Sb] in a number of samples, each of which may contain different levels of analyte, using both methods. Each measurement contains random error, but it is also strongly affected (we hope!) by the concentration of analyte in the sample. This being the case, it is likely that there will be significant correlation between the two methods: when a measurement by one method gives a high result, a measurement by the other method on the same sample will also tend to give a high result, since [Sb] for the sample is presumably relatively high. The sample linear correlation coefficient between the two sets of measurements is r = 0.9716. Recall that truly independent variables will have ρ = 0, and that values of ρ close to one indicate a strong linear relationship between variables − a fact that is confirmed by a glance at the following plot of the measurements.

Page 143

Chapter 6

Hypothesis Testing: Applications

27

results from standard method, mg/m3

25

23

21

19

17

15 15

16

17

18

19

20

21

22

23

results from new method, mg/m3

Figure 6.1: a plot of the “paired” measurement data − a strong linear correlation between the measurement pairs is evident. Note that this is a good example of how correlation does not imply causation: just because two variables are correlated doesn’t mean that one variable “causes” the other to occur. In this case, these variables are affected by a third factor − the concentration of Sb in the sample.

So how do we test of measurements from the two methods differ significantly? Let’s define a variable e such that ei = xi − yi where xi and yi are the ith measurement pair. In other words, the e values are the difference between the paired measurements. Let’s calculate the values for e observed in example 6.4

x

22.2

25.1

2.9000

19.2

19.5

0.3000

15.7 mg . 20.4 m3

y

16.6 mg . 21.3 m3

e

x y

e=

0.9000 mg 0.9000 m3

19.6

20.7

1.1000

15.7

16.8

1.1000

Looking at the measurements, and the values of e, we see that the second set of measurements, using the standard analytical method, gave results that were consistently larger than those of the new method. This observation suggests that the methods give different results for the same samples, but we need to test whether the difference is statistically significant.

Page 144

Chapter 6

Hypothesis Testing: Applications

Even though the original two sets of data, x and y, are correlated, the difference between them is a random variable whose values are not correlated with one another. We have changed the problem back to one with which we are more familiar. The null and alternate hypothesis will be H0: µe = 0

“the methods give the same results”

H1: µe ≠ 0

“the methods give different results (2-tailed)”

Let’s test at the 90% confidence level. We can use T= e s(e) as our test statistic; by the central limit theorem, e will tend to be normally distributed, even if e does not follow a normal probability distributed. Note that if the random error in the x and y measurements are normally distributed, then e will be normally distributed. Let’s solve the problem now, shall we? First let’s calculate the observed value of the test statistic e bar

T obs

mean( e )

e bar std_err

e bar = 1.2000

mg 3

m

std_err

stdev ( e ) 6

std_err = 0.3606

mg 3

m

T obs = 3.3282

Remember that if e is normally distributed, then the null distribution of T will follow a Student’s t distribution; and we find our critical values from the t-table. From the t-tables, we find that the critical values of Tcrit = ± 2.0150 should be used for a two-tailed test at the 90% confidence level. From the observed value of Tobs = −3.3282, we may reject H0 and accept H1. Thus, we can state with 90% certainty that there is a significant difference in the measurements yielded by the two methods. Note that the observed P-value is Pobs = 0.02082, so that we can prove H1 at the 97.9% level from this data.

Page 145

Chapter 6

Hypothesis Testing: Applications

Comparison of Many Measurement Means Note to students We did not cover this in class, and you will not be expected to know it. This section is just a “placeholder” to introduce this (important) subject in a future version of this class. • introduction - mention the term analysis of variance (ANOVA), and that this is only the tip of the iceberg • multiple sets of measurements. Illustrate by example. Restrictions on this test: (i) homogeneous variance, and (ii) same # of measurements in each set • solve the problem, explaining the entire time

Page 146

Chapter 6

Hypothesis Testing: Applications

Comparison of Measurement Precision (F-test) Let’s say that we have two independent sets of measurements, x and y, and we want to compare the precision (i.e., the sample standard deviations) of the two groups. In this section, we will describe the test that can be used in this situation. Note that this test can be used to disprove (but never to prove!) homogeneity of variance.

Hypotheses H0: σx = σy

“two sets of measurements have the same precision”

H1: σx ≠ σy σx> σy σy < σy

(2-tailed) (1-tailed) (1-tailed)

Test Statistic We have two sets of measurements, x and y; we will use as our test statistic the ratio of sample variances of the two sets: T=

s 2x s 2y

If we assume that the x and y measurements follow a normal distribution, then the probability distribution of T is called an F-distribution. Actually, the actual distribution of T will depend on the degrees of freedom in sx and sy, so that there is a family of F-distributions, F(νx,νy). The null hypothesis is that σx = σy, so that the null distribution of T will be an F(νx,νy)-distribution with a mean µF = 1. The appendix contains tables with critical values for the F-test, which is used to compare measurement precision. These tables are for right-tailed areas, so that in constructing T, we must be sure to put the larger variance in the numerator: in the F-test, the test statistic, T, must be larger than one. Of course, for two-tailed tests, the area in the right-tail of the F-distribution must be α/2, so that (for example) we would use the 97.5% F-tables for two-tailed testing at the 95% confidence level. Example 6.5 One analyst who analyzed a sample 16 times calculated a sample standard deviation of 3.07. Another analyst measured the same sample 11 times and calculated a standard deviation of 1.20. Are we justified in assuming that the second analyst is more precise than the first? The null and alternate hypotheses are H0: σx = σy

“analysts have the same measurement precision”

H1: σx > σy

“analyst y is more precise than analyst x (1-tailed test)”

Let’s test at the 95% level. Our test statistic will be Page 147

Chapter 6

Hypothesis Testing: Applications

T=

s 2x s 2y

Note that the numerator must be larger than the denominator. Let’s calculate the observed value of the test statistic: sx

3.07

n = 16; 15 degrees of freedom

sy

1.20

n = 11; 10 degrees of freedom 2

T obs

sx sy

2

T obs = 6.5451

The null distribution will be an F(15,10)-distribution with a mean of one. Since we are interested in examining whether or not the y measurements are significantly more precise than the x measurements, we will obtain a one-tailed critical value on the right side of the null distribution. The critical value turns out to be Tcrit = 2.85. Thus, we may reject H0 and accept H1 at the 95% level. We can state with 95% certainty that analyst 2’s measurements are more precise than analyst 1’s measurements. For this example, the observed P-value (1-tailed) is Pobs = 0.002452. Thus, we could test at the 99.7% confidence level and still accept H1.

Page 148

Chapter 6

Hypothesis Testing: Applications

Hypothesis Testing in Regression Analysis We have applied the z-test and t-tests to measurement means; however, these tests can be used for any normally-distributed variable. The defining component of the z- and t-test is that the test statistic follows a z- or t-distribution, respectively. We can easily use either of these tests in regression analysis, as will now be demonstrated. Hypothesis testing in regression analysis usually involves the least-squares estimates in linear regression. In chemistry, one can easily imagine cases in which you would want to • compare the value of the slope to some theoretical or established value. • test to see if the slope is significantly different than zero. This is a way to eliminate extra independent variables: if the slope is not significantly different than zero, then you can make a case for stating that there is no discernible effect of x on µx. • test to see if the intercept is significantly different than zero. In quantitative analysis, for example, this test can be used to check for the presence of background (“additive”) interferences. If we can assume of homogeneous variance in the regression data, we can easily calculate the standard deviation of the least-squares estimates; the equations in chapter 4 can be used for the first-order linear model. Further, if we assume that the measurement error of the dependent variable is normally distributed, then the least-squares estimates themselves follow a normal distribution. Thus, we can form a test statistic that can be used in a t-test (or a z-test, if the magnitude of the homogeneous variance, σ, is known), as shown in the following example.

Page 149

Chapter 6

Hypothesis Testing: Applications

Example 6.6 Municipal drinking water is usually fluoridated during water treatment processes. A F− ISE electrode is to be used to measure the fluoride concentration in a drinking water sample, using a method based on the measurement of the voltage response of the ISE to a set of calibration standards (i.e., the calibration curve method). The following data was collected: [F−] (ppm)

log[F−]

signal (mV)

10.55

1.0232

87.63

50.21

1.7008

62.18

130.2

2.1146

29.63

361.2

2.5577

0.22

682.3

2.8339

-10.39

unknown

39.28

(a) Construct a calibration curve and obtain a confidence interval for the slope. Also use the calibration curve obtain an estimate of the measurement noise on each measurement. Remember that the Nernst equation predicts a linear relationship between voltage and the logarithm of the concentration. (b) The theoretical value of the slope of the is − 59.2 mV/decade. Is there a significant difference between the theoretical and measured sensitivities? part (a) xi .y i

S xy

5 .mean( x) .mean( y )

S xy = 117.1428

i S xx b1

res

s_b1

4 .var ( x) S xy

b 1 = 57.0682

S xx y

S xx = 2.0527

b 1 .x b 0 s res

s res

s_b1 = 4.3348

b0

mean( y )

1. 3

res i

2

b 1 .mean( x)

s res = 6.2105

i

b 0 = 150.6198

LS estimates of regression params

estimate of homogeneous measurement noise

std error of the slope

S xx t

qt ( .025, 3 )

t = 3.1824

t .s_b1 = 13.795

95% confidence, 3 degrees of freedom

95% CI for slope: -57 ± 14 mV/decade

Page 150

Chapter 6

Hypothesis Testing: Applications

part (b) Simply by examining the confidence interval of the slope from part (a) we can see that there is no significant difference between this value and the theoretical value. But let's do the test anyway. H0 :

59.2 H1 :

β1

β1

59.2

Let's test at the 90% level T obs

b1

( 59.2) s_b1

T obs = 0.4918

difference between the values, in units of std deviations

2-tailed critical value is ± 2.3534 (3 degrees of freedom). Since the observed test stat is between the two critical values, we must accept H0. Thus, we cannot prove with 90% certainty that the measured slope is significantly different than the theoretical value.

Page 151

Chapter 6

Hypothesis Testing: Applications

Outlier Testing Introduction Have you ever had a data set, and one particular measurement seemed to stick out like a sore thumb? Consider the following sequence of measurements (which might, for example, be measurements of analyte concentration in a single sample) 24, 21, 62, 16, 19 The third measurement, 62, doesn’t seem to belong with the others. Its presence has a serious effect on the value of sample statistics, including sample mean, sample standard deviation, and any test statistics for hypothesis testing. Measurements that are “surprising” or “unexpected” in some way are called outlier measurements, or simply outliers (because the measurement “lies outside” the rest of the data). What do we do with outliers? There are three basic tactics: accommodation, investigation and rejection. • outlier accommodation Outliers are a concern because they can have a huge effect on sample statistics (and hence confidence intervals and hypothesis tests). So we just choose sample statistics that are not as sensitive to the presence of outliers; these are called robust statistics. An example is the sample median, x˜, as an indicator of location. For symmetric distributions, x˜ is an unbiased estimate of µx; however, the standard error of the sample median is somewhat larger than that of the sample mean, and so it is not considered as good an estimate of µx. However, x˜ is much less affected by outliers than x . Let’s consider the above data set, with and without the outlier: x

( 24 21 62 16 19 )

x

( 24 21 16 19 )

T

T

mean( x) = 28.4

median( x) = 21.0

mean( x) = 20.0

median( x) = 20.0

As you can see, removing the outlier had a much greater effect on the sample mean than on the sample median, which is more “tolerant” of the outlier. There are a number of robust estimates of location (e.g., µx) and dispersion (e.g., σx); an interesting example of the latter is the median deviation, sm: s m = median{|x i − x˜|} The use of such robust statistics is becoming more common. Robust linear regression is another useful subject along these lines. • outlier investigation At the very least, the occurrence of an outlier sends most people scrambling for their laboratory books, or has them racking their brains to remember if there was something done differently when that particular measurement was obtained! The first thought that crosses most people’s Page 152

Chapter 6

Hypothesis Testing: Applications

mind is that there must be some gross error in that particular measurement. Therefore the measurement procedure comes under scrutiny to try to explain the presence of the supposed “error.” In many situations, however, the outlier does not signify a gross error at all, but rather is a cause for concern about the assumptions of a particular data-generating mechanism. For example, consider the measurements in the following plot, which shows a calibration curve such as might be observed in quantitative analysis: 30

25

Signal

20

15

10

5

0

0

2

4

6

8

10

12

Concentration

Figure 6.2: a typical calibration curve in quantitative analysis, where instrument response (the “signal”) is plotted as a function of analyte concentration. The circled data point is an outlier; however, it might not be due to any “gross error,” but a failure of the linear model for that particular data point.

The last data point is an outlier − it doesn’t fall on the line, like the first five points. However, it may be there is not a linear dependence of signal on concentration at high concentrations: the model is not valid for this last data point. The point is, don’t just assume that an outlier is “wrong.” Very often, the experiment that generated the outlier might need to be repeated; if outliers still occur, then the assumptions about the nature of the experiment might need to be modified. In the calibration curve shown in figure 6.2, if more data points were collected at higher concentrations, the line might be seen to “curve off” for these measurements. Finally, the last common response to outliers is: • outlier rejection Many (some might say most) researchers simply delete outliers from the data set, with or without justification. Certainly if some gross error in the outlying data is identified, then deletion of the Page 153

Chapter 6

Hypothesis Testing: Applications

data is justified. All too often, however, an experimenter will adopt a “delete and repeat” attitude to outliers. Don’t do this. The data you are rejecting may very well be representative of the population you are trying to sample, and by the “delete and repeat” approach, you are simply pruning the data to match your expectations more closely. One purpose of including a section of outliers in a chapter on hypothesis testing is to provide some measure of statistical justification for outlier rejection, if that’s what you want to do.

Identification of Contaminants Let’s clarify two closely related concepts through the following definitions1: • an outlier is a measurement that is surprising or unexpected in some respect. • a contaminant is a measurement that is generated by some different data-generating mechanism than the bulk of a data set. What this means in practice is that contaminants belong to a different probability distribution than the rest of the data. Not all outliers are contaminants, just as all contaminants do not manifest themselves as outlying measurements. Consider the following figure: probability distribution of contaminant data points

probability distribution of "normal" data points

"outlying" data Figure 6.3: distinction between contaminants and outliers. All of the data in this group of measurements originates either from the “normal” probability distribution or the “contaminating” probability distribution. The three circled data points might be identified as “outliers” in this data set. However, only two of these points are true contaminants in the sense of belonging to the contaminating probability distribution. In addition, not all contaminant data points are identified as outliers.

In general, there are two distinct phases to dealing with outliers, particularly when deciding whether or not to reject them from a data set: 1

the distinction between outlier (a “surprising” measurement) and contaminant (a measurement from a different probability distribution) is an important one. However, the terms “outlier” and “contaminant” are not universally used in this manner. Many texts refer to discordant values as “potential” outliers and contaminants as “true” outliers. Page 154

Chapter 6

Hypothesis Testing: Applications

1. Identification of outliers. For univariate data, this means sorting the data into ascending order; any outliers will be at either the lower or upper extreme of the data set. For multivariate data (particularly when more than two variables are involved), outlier identification is sometimes not nearly so straightforward. 2. Testing of outliers. This is a process, sometimes called a discordancy test, of deciding whether to label one or more outliers as contaminants. For truly objective outlier testing, the principles of hypothesis testing must be applied. In the hypothesis testing of a single outlier, the general forms of the null and alternate hypotheses are H0: the outlier belongs to the same probability distribution as the rest of the data (i.e., the outlier is not a contaminant). H1: the outlier does not belong to the same probability distribution as the rest of the data (the outlier is a contaminant). Of course, hypotheses involving specific probability distributions in the null and alternate hypotheses can (and usually are) adopted. For our purposes, however, these generic statements will do just fine. After deciding on the appropriate confidence level, the next step in outlier testing would be to choose a test statistic. There are a bewildering array of statistics from which to choose; interestingly, however, most commercial statistical programs, such as SAS or SPSS, do not treat outlier testing in a comprehensive or consistent manner. Likewise, outlier testing is not a topic that is covered in most beginning statistical textbooks. This omission is a little puzzling, to say the least, since any scientist who deals with quantitative data must eventually deal with outliers. A good reference is: V. Barnett, T. Lewis, “Outliers in Statistical Data,” 3rd ed, Wiley Series in Probability and Mathematical Statistics, 1995; much of the material in these notes, as well as the tables for outlier testing in the Appendix, are from this source. Table 6.1: table of 5 test statistics (labeled T1-T5) for the testing of univariate, normally distributed data; see the appendix for critical values. See the text for more detail.

upper outliers

T1 = “t-like” test statistics

“Dixon-like” statistics

T2 =

xn − x sx

✟(x i − x)

lower outliers

T1 =

T2 = x −x T3 = n s x 1 sx

comments

x − x1 sx

Tests a single outlier; sometimes called the Tn test

✟(x − x i )

Block test for k upper or lower outliers.

sx

Block test for one upper and one lower outlier.

x −x T4 = xnn − xn−1 1

x −x T4 = x 2n − x 11

x −x T5 = xnn − xn−2 2

x3 − x1 T5 = x n−1 − x1

Tests for a single outlier; sometimes called the Q test Tests for a single outlier; a form of the Q-test that provides some protection from masking.

Page 155

Chapter 6

Hypothesis Testing: Applications

Barnett and Lewis present a large number of test statistics for a variety of purposes: by my count, at least 56 different test statistics are presented for univariate data, along with the appropriate distribution tables. From these I have culled 5 test statistics that will be enough to get us started. These statistics are collected in the following table for easy reference, along with a brief description; we will encounter and use all of them in the following discussion. It is worth mentioning at this point that an outlier can be classified as either an upper or lower outlier, depending on its position with respect to the main mass of data. The form of a particular test statistic will be slightly different in testing upper or lower outliers. The five test statistics have been grouped into two classifications: the three “t-like” statistics and the two “Dixon-like” statistics. The t-like statistics are all studentized deviations of some sort, where the deviations follow a normal distribution (assuming that the original data itself is normally distributed). The Dixon-like statistics get their name from the man who first proposed their use; they are easy to remember and calculate, and are widely used in analytical chemistry. The test statistics in the table are used to test outlying data points under the assumption (i.e., the null hypothesis) that the data should be described by a normal probability distribution. The identity of the distribution assumed in the null hypothesis is important, since it will determine the null distribution of the test statistic − indeed, it will likely determine which test statistic is best used in the hypothesis test. The assumed distribution also affects outlier identification, since values that might be considered to be outliers from a normal distribution might not be considered so surprising from, say, exponential or Poisson distributions.

Testing Single Outliers A Simple Test Let’s take a few of these outlier test statistics out for a spin. Example 6.7 Test the following measurements for contaminants: 18, 23, 55, 13, 17, 20, 14 First, let’s sort the seven measurements: x = 13, 14, 17, 18, 20, 23, 55. Obviously, the measurement value 55 is an upper outlier, which we will test for discordancy. H0: all the measurements are sampled from a population described by a single normal probability distribution. H1: the largest value is a contaminant: it originates from a different probability distribution than the rest of the measurements. We will use the following statistics to test H1: T1 =

xn − x sx

x −x T4 = xnn − xn−1 1

[6.8]

The first statistic is sometimes referred to as Tn, while the second is usually called Dixon’s Q by analytical chemists. T1 is the studentized deviation of the outlier from the sample mean; we Page 156

Chapter 6

Hypothesis Testing: Applications

would expect large values to be indicative of data contamination. Dixon-like statistics such as T4 will always be a fraction; as the outlier becomes more separated from the main body of data, the value of a Dixon-like statistic will approach unity. Let’s calculate the values of these test statistics. [Note: I use the software package MathCAD to work out these problems. MathCAD indexes its vectors and arrays starting with 0, and not 1, so that the expressions for the test statistics will appear slightly different. If you become confused, work out the problem yourself, using the expressions given in table 6.1; you should get the same numbers for the test statistics.] x

( 18 23 55 13 17 20 14 )

T1 obs

T4 obs

xlast( x )

mean( x)

stdev ( x) xlast( x )

xlast( x )

xlast( x )

1

x0

T

x

sort ( x)

T

x = ( 13 14 17 18 20 23 55 )

T1 obs = 2.2051

critical values: T1 .05

T4 obs = 0.7619

critical values:

T4 .05

1.94

0.507

T1 .01

2.10

T4 .01

0.637

Thus, in either case we can accept H1 at the 99% level.

Comparing the observed values of either test statistic results in rejection of the null hypothesis at the 99% confidence level. Thus, we can be 99% certain that the largest measurement value is a contaminant.

Important Warning! The null distributions (i.e., the critical values) of all of the test statistics have been calculated by assuming that the measurements should be normally distributed. The labeling of an outlier as a “contaminant” using these critical values rests upon the assumption that the outlier isn’t likely to be from the same population as the other measurements, assuming those other measurements are normally distributed. The critical values would be much different if the measurements were distributed according to, for example, an exponential distribution. Instead of deciding that the outlier is from a different probability distribution, another possible explanation for the outlier is that the measurements are not normally distributed at all, but are instead described by a probability distribution with “heavier tails” (i.e., a higher probability of extreme values) than a normal distribution. This is one reason that you must always be very careful in deleting measurements that are labeled as contaminants. A better approach is to use robust statistics, or simply to collect more measurements (remember that the sample mean will always tend towards a normal distribution as n increases).

A Complication: Outlier Masking One problem in testing for a single outlier is the sensitivity of the phenomenon of masking. The following example illustrates the problem.

Page 157

Chapter 6

Hypothesis Testing: Applications

Example 6.8 Test the following measurements for contaminants: 18, 23, 55, 38, 17, 20, 14 x

( 18 23 55 38 17 20 14 )

T

x

sort ( x)

T

x = ( 14 17 18 20 23 38 55 )

H0: all the measurements are described by the same normal probability distribution. H1: the largest measurement is a contaminant. Note: this set of measurements is almost identical to the measurements in the last example; only a single value has changed. T1 obs

T4 obs

xlast( x )

mean( x)

stdev ( x) xlast( x )

xlast( x )

xlast( x )

1

x0

T1 obs = 1.9280

critical values: T1 .05

T4 obs = 0.4146

critical values:

1.94

T4 .05

T1 .01

0.507

2.10

T4 .01

0.637

For both of these test statistics, the second-largest measurement value has masked the largest value. The largest value is not identified as a contaminant by either test statistic at the 95% confidence level. Dixon’s Q statistic (i.e., T4) is especially susceptible to masking effects, although the Tn statistic (i.e., T1) is not too much better. Masking can occur when a second outlier is observed with a value similar to the first. In this example, the second largest value has the effect of decreasing the denominator of both test statistics; it also increases the numerator of T1. One solution to this problem is to use test statistics that are less sensitive to masking. There are a number of Dixon-like statistics for this purpose; here is a good general purpose statistic to protect against masking (see table 6.1 for the corresponding statistic for a lower outlier). x −x T5 = xnn − xn−2 2 In this case, the numerator is the difference between the outlier and its second-nearest neighbor; no masking is observed by the measurement value xn−1, as we can see: T5 obs

xlast( x )

xlast( x )

xlast( x )

x1

2

T5 obs = 0.8421

critical values:

T5 .05

0.780

T5 .01

0.885

The T5 Dixon-like statistic identifies the largest value as a contaminant at the 95% confidence level (but not at the 99% confidence level).

Masking of can also occur when an upper and a lower outlier value are observed in the same data set, as seen in the following example. Example 6.9 Test the following measurements for contaminants: 18, 23, 39, -15, 17, 20, 14

Page 158

Chapter 6

Hypothesis Testing: Applications

Let’s work out this problem (remember that indexing in MathCAD begins at 0, not 1). x

( 18 39

15 17 20 14 )

T

x

sort ( x)

T

x = ( 15 14 17 18 20 39 )

It looks like both the smallest and the largest values are outliers. Let's test the smallest value to see if it is a contaminant. H0: all the measurements are described by the same normal probability distribution H1: the smallest value is a contaminant T1 obs

T4 obs

mean( x)

x0

stdev ( x) x1

x0

xlast( x )

x0

T1 obs = 1.7530

critical values:

T1 .05

1.82

T4 obs = 0.5370

critical values:

T4 .05

0.560

T1 .01

T4 .01

1.94

0.698

The observed values of both of these statistics is too small to allow rejection of H0 at the 95% level. The upper outlier has masked the lower outlier. Let's try the Dixon-like T5. T5 obs

x2 xlast( x )

x0 1

x0

T5 obs = 0.9143

critical values:

T5 .05

0.872

T5 .01

0.995

T5 allows rejection of H0 at the 95% (but not 99%) confidence level. Thus, we can state with 95% confidence that the lowest value is a contaminant.

From this example, we see that the upper outlier can mask the lower outlier (and vice versa, if we are testing the upper outlier). As before, the Dixon-like T5 statistic is not vulnerable to this effect; it protects against masking of a lower outlier x1 by either x2 (the nearest neighbor) or xn (an upper outlier). Likewise, T5 protects against masking of an upper outlier xn by either xn−1 or x1.

Testing Multiple Outliers Introduction Outlier masking happens because there are really two (or more) outliers, and the T1 and T4 statistics work best when testing data sets with a single outlier. If a data set contains more than one outlier, we should modify our approach somewhat, depending on the nature of the outliers. Let’s list how multiple outliers might occur: we may have a data set with • two or more upper outliers • two or more lower outliers • a combination of one (or more) upper outlier(s) and one or more lower outlier(s). Let’s say that k outliers have been identified in a data set, where k > 1. There are two approaches to testing multiple outliers:

Page 159

Chapter 6

Hypothesis Testing: Applications

1. the consecutive testing approach, where a test statistic such as T1 or T4 is applied repeatedly to a data set (one outlier at a time); or 2. the block testing approach, where a statistic simultaneously tests all k outliers in the data set.

Consecutive Testing of Multiple Outliers Consecutive testing is conceptually easy, and can occur in two varieties: the inward approach, and the outward approach. In the inward approach, the most extreme outlier is tested; if it gives a positive test, it is removed from the data set, and then the most extreme remaining outlier is tested (we are moving “inward” towards the sample mean). This procedure is repeated until all the outliers are tested, or until an outlier gives a negative test. The following flow chart illustrates the procedure in testing for multiple upper outliers. Data Set

Test largest value

contaminant?

Yes

remove from data set

No

Stop testing procedure

Figure 6.4: The inward consecutive outlier testing procedure. A test statistic like T1 or T4 is used to test measurements one at a time as we move from the most extreme outlier “inward” toward the mean of the data set. Note that contaminants are not permanently removed from the data set, but only temporarily removed for the purposes of the testing procedure. In other words, the purpose of the test is to label measurements as contaminants, not to make a decision to permanently reject these values.

The advantage of the inward approach is that it is not necessary to determine the value of k prior to testing; indeed, it is not even necessary to identify the outliers at all. However, the biggest disadvantage is the susceptibility of the procedure to masking effects. T1 and T4 are certainly fairly poor candidates for this type of testing procedure. The outward approach is as follows: after identifying the k outliers, the “innermost” (least-extreme) outlier is tested, with the remaining outliers removed from the data set. If this Page 160

Chapter 6

Hypothesis Testing: Applications

inner outlier gives a positive test, than all of the remaining outliers are considered to be contaminants. If, however, the inner outlier gives a negative test, then the next innermost outlier is tested, again with the more extreme outliers removed from the data set. This procedure is repeated until all of the outliers are tested (moving “outward” throughout the procedure), or until the first positive test, whichever comes first. When an outlier gives a positive test, that outlier, and all of the more extreme outliers, are considered to be contaminants. The flow chart in figure 6.5 describes the outward consecutive testing process. Data Set: all but the least extreme outlier have been removed

Test the least extreme outlier

contaminant?

No

add the next most extreme outlier to data set

Yes

Stop the testing procedure

Figure 6.5: The outward consecutive outlier testing procedure. A test statistic like T1 or T4 is used to test measurements one at a time as we move from the least extreme outlier “outward” from the mean of the data set. Initially, all but one outlier is removed from the data set, and these are added until a positive test result (or until we run out of outliers). Again, measurements are not permanently removed from the data set, but only temporarily removed for the purposes of the testing procedure.

The advantage of this procedure over the outside-in approach is the reduced susceptibility to masking; the test of the innermost outlier should not exhibit masking at all. However, the k outliers must be identified beforehand. Another, more fundamental, objection is that removing the extreme outliers amounts to outlier rejection before the outliers have been tested at all. The data subset that consists of only the more central data points might not be representative of the parent population, and the outlier test would be more apt to give a positive result. Let’s look at the next example to see how inside-out consecutive testing works. Example 6.10 Test the two outliers of the data set in example 6.8 using a consecutive test procedure.

Page 161

Chapter 6

Hypothesis Testing: Applications

From the answer to example 6.8, we already know that the inward consecutive test procedure using either T1 or T4 would fail to identify any contamination, since neither statistic could give a positive test for the most extreme outlier value. So let’s try the outward test procedure. x

( 18 23 55 38 17 20 14 )

T

x

sort ( x)

T

x = ( 14 17 18 20 23 38 55 )

The largest two values are upper outliers. Let's create a subset of this data that omits the largest outlier so that we may test the "inner" upper outlier. y

T

submatrix( x, 0 , 5 , 0 , 0 )

y = ( 14 17 18 20 23 38 )

H0: all the measurements in this subset of the data are described by the same normal probabiltiy distribution H1: the largest value in this data subset is a contaminant T1 obs T4 obs

y last( y )

mean( y )

stdev ( y ) y last( y ) y last( y ) y last( y )

y0

T1 obs = 1.9108 1

T4 obs = 0.6250

critical values: 1.82 (95%) and 1.94 (99%) critical values: 0.560 (95%) and 0.698 (99%)

In both cases, the innermost outlier tests positive at the 95% level. Thus, using the outward consecutive test procedure, we identify both the upper outliers as contaminants at the 95% level.

A problem with any consecutive test procedure is in the interpretation of the confidence level, because the confidence levels associated with the critical values of the test statistics T1, T4 and T5 have been determined for the testing of a single outlier. A conservative approach is advisable in choosing the confidence level for consecutive tests, particularly if the goal is to identify contaminants for subsequent removal from the data set.

Block Testing of Multiple Outliers An alternative to consecutive testing is block testing, where a statistic is used to test all k outliers at once. T2 is a t-like statistic used to test for k upper or k lower outliers: k upper outliers

T2 =

k lower outliers

T2 =

✟(x i − x) sx

✟(x − x i ) sx

The T2 statistic is the studentized sum of the deviations of all k outliers from the mean. Example 6.11 Test the two outliers of the data set in example 6.8 using a block test procedure.

Page 162

Chapter 6

x

Hypothesis Testing: Applications

( 18 23 55 38 17 20 14 )

T

x

sort ( x)

T

x = ( 14 17 18 20 23 38 55 )

H0: all measurements are described by the same normal probability distribution H1: the two largest measurements are contaminants We will use a block-testing procedure. T2 obs

xlast( x )

mean( x)

xlast( x )

1

mean( x)

T2 obs = 2.7088

stdev ( x)

The critical values are

T2 .05

2.66

T2 .01

2.79

Thus, we can reject H0 and accept H1 at the 95% level. We can say with 95% certainty that the upper two outliers are contaminants.

The statistic T3 is another block-testing statistic that is intended to be used in situations where an upper and lower outlier are both present in the data set. x −x T3 = n s x 1 T3 is simply the studentized difference between the two extreme values of the data set. Example 6.12 Test the two outliers of the data set in example 6.8 using a block test procedure. x

( 18 39

15 17 20 14 )

T

x

sort ( x)

T

x = ( 15 14 17 18 20 39 )

H0: all the measurements are described by a single normal probability distribution H1: the smallest and the largest measurement values are contaminants Let's use a block test procedure: xlast( x ) x0 T3 obs T3 obs = 3.1038 stdev ( x)

critical values: 3.22 (95%) and 3.34 (99%)

Thus, we must accept H0 at the 95% level. We cannot say with 95% confidence that the largest and smallest measurements are contaminants.

Assuming that all of the outliers have been identified, statistics intended for block tests are not susceptible to outlier masking. That’s a good thing, but we have also created another problem, that of outlier swamping. In block testing, all k outliers are either labeled as contaminants or they are accepted as “normal” measurements. There is no middle ground, as there is in consecutive testing, where we can say that some outliers are contaminants and some are not. Thus, there is the possibility that a marginal outlier might be falsely declared a contaminant because it is “carried along” in the block testing procedure by other, more extreme, outliers. Or perhaps a few marginal outliers cause the block testing procedure to fail, which means that the contaminants that are in the block will not be identified.

Page 163

Chapter 6

Hypothesis Testing: Applications

Advanced Topic: Data Contamination Models Note to students We did not cover this in class, and you will not be expected to know it. This section is just a “placeholder” to discuss this subject in a future version of this class. • discuss how outliers can arise. Use actual numbers, and pictures of probability distribution. See book for a good example of some pictures. Contaminants are presumed to be due to a different data-generating mechanism than the rest of the data • describe the slippage model, and maybe the “inherent alternative” model.

Choosing Test Statistics Note to students We did not cover this in class, and you will not be expected to know it. This section is just a “placeholder” to discuss this subject in a future version of this class.

Introduction • which test statistic? Experience, common sense. Can use more than one; they only differ in terms of false negatives, not false positives. • I will present more alternatives (didn’t want to confuse the issue earlier in this section)

Variations on a Theme • the notion of one-tailed and two-tailed outlier tests is a little hard to define • give T6 and T7, the two-sided versions of T1 and T4, and explain why they might be preferred to their one-sided brethren. • give T8 and T9 the externally studentized version of T1 and T6 • give T10 as an example of a Dixon-like stat that corrects for a specific masking effect

More Choices • give three more test stats: T11 (N4), T12 (N5), and T13 (N15), along with their properties.

Summary • summarize the advantages of the t-like stats: optimal if k is correct • summarize the advantages of the Dixon-like stats. Easy to calculate; easy to choose one to correct for a specific instance of masking. • give the advantages of the F-like stats Page 164

Chapter 6

Hypothesis Testing: Applications

• give the advantages of the sample kurtosis • block-testing vs consecutive testing. Give the problems with either, and when it is best to use one or the other (block: multiple outliers of a similar size; consecutive: multiple outliers of different sizes). You must be especially careful in rejecting multiple outliers. In using the outward test, you are essentially trimming the data before you begin. One possibility: use inward, but apply a Dixon-like stat that guards against masking for the first test. Then I would test at a fairly high CL for the second test to assuage my guilt. Maybe combine with a block test to further strengthen the argument.

Page 165

Chapter 6

Hypothesis Testing: Applications

Chapter Checkpoint The following terms/concepts were introduced in this chapter: block testing of outliers

outlier masking

consecutive testing of outliers

outlier swamping

contaminant

paired measurements

F-test

robust statistic

nonparametric test

t-test

outlier

z-test

In addition to being able to understand and use these terms, after mastering this chapter, you should • be able to perform z-tests (1- or 2-tailed) • be able to perform t-tests (1- or 2-tailed) • be able to perform F-tests (1- or 2-tailed) • know when to apply the above, and to be able to apply them to linear regression • identify outliers in a data set • objectively determine whether outliers are contaminants by a hypothesis test using the appropriate test statistic (to guard against masking or swamping effects, when appropriate)

Page 166

Suggest Documents