Chapter 6: Confidence intervals and hypothesis tests

Chapter 6: Confidence intervals and hypothesis tests 8/24/08 Chapter 6: Confidence intervals and hypothesis tests - 6.1 - Section 6.0: What we nee...
Author: Corey Norton
7 downloads 0 Views 171KB Size
Chapter 6: Confidence intervals and hypothesis tests

8/24/08

Chapter 6: Confidence intervals and hypothesis tests

- 6.1 -

Section 6.0: What we need to know when we finish this chapter

- 6.1 -

Section 6.1: Introduction

- 6.4 -

Section 6.2: The basis of confidence intervals and hypothesis tests

- 6.5 -

Section 6.3: Confidence Intervals

- 6.13 -

Section 6.4: Hypothesis Tests

- 6.21 -

Section 6.4.1: Two-tailed tests

- 6.22 -

Section 6.4.2: One-tailed tests

- 6.36 -

Section 6.4.3: Type I and type II errors

- 6.42 -

Section 6.5: The relationship between confidence intervals and hypothesis tests

- 6.58 -

Exercises

- 6.61 -

Section 6.0: What we need to know when we finish this chapter This chapter reviews the topics of confidence intervals and hypothesis tests. Confidence intervals give us ranges that contain the parameters with pre-specified degrees of certainty. They are more useful if they are narrower. Hypothesis tests evaluate whether the data at hand are consistent or inconsistent with pre-specified beliefs about parameter values. They are more useful if they are

© Jeffrey S. Zax 2008

- 6.1 -

unlikely to contradict these beliefs when the beliefs are really true, and if they are unlikely to be consistent with these beliefs when the beliefs are really false. Here are the essentials:

1.

Equation 6.7, section 6.2: The fundamental equation of this chapter is ⎛ ⎞ d−δ 1 − α = P⎜ − tα( df2 ) ≤ ≤ tα( df2 ) ⎟ SD(δ ) ⎝ ⎠

2.

Equation 6.8, section 6.3: Confidence intervals consist of known boundaries with a fixed probability of containing the unknown value of the parameter of interest. The general expression is

(

)

1 − α = P d − tα( df2 ) SD(d ) ≤ δ ≤ d + tα( df2 ) SD(d ) .

Confidence intervals ask the data for instruction.

3.

Section 6.4: Hypothesis tests ask the data for validation. The null hypothesis is the opposite of what we expect to find. Estimates in the acceptance region validate the null hypothesis. Estimates in the rejection region contradict it.

4.

Equation 6.14, section 6.4.1: The two-sided hypothesis test is

(

)

1 − α = P δ 0 − tα( df2 ) SD(d ) < d < δ 0 + tα( df2 ) SD(d ) .

© Jeffrey S. Zax 2008

- 6.2 -

5.

Section 6.4.1: Reject the null hypothesis when the estimate falls in the rejection region, the test statistic is greater than or equal to the critical value or the prob-value is less than or equal to the significance level. These decision rules are all equivalent.

6.

Equation 6.29, section 6.4.2: The one-sided, upper-tailed hypothesis test is

1 − α = P(d < δ 0 + tα( df ) SD(d )).

7.

Section 6.4.3: The size of the test is its significance level, the probability of a type I error. A type I error occurs when the null hypothesis is rejected even though it is true. It is the statistical equivalent of convicting an innocent person.

8.

Equation 6.36, section 6.4.3: A type II error occurs when the null hypothesis is accepted even though it is false. It is the equivalent of acquitting a guilty person. The power of the test is the difference between one and the probability of a type II error.

9.

Section 6.4.3: All else equal, reducing the probability of either a type I or a type II error increases the probability of the other.

10.

Equation 6.40, section 6.4.3: Statistical distance is what matters, not algebraic distance. The standard deviation of the estimator is the metric for the statistical distance.

11.

Section 6.5: Any value within the confidence interval constructed at the (1!")%

© Jeffrey S. Zax 2008

- 6.3 -

confidence level would, if chosen as the null hypotheses, not be rejected by a two-sided hypothesis test at the "% significance level.

Section 6.1: Introduction Chapter 5 demonstrated that, with the appropriate assumptions regarding the disturbances, OLS estimates of $ and " are better than any other convenient estimator. This is certainly nice to know. However, all it really says is that any other estimator we might think of would tell us less about $ and ". The question of how much b and a actually tell us about these two parameters remains. This is the question of inference. At the moment, all we really know about $ and " is based on the expected values and variances of b and a. Because E(b)= $ and E(a)=", the two sample statistics are unbiased estimators for their respective parameters. This means that collections of values for b and a, where each pair was calculated from an independent sample, would tend to cluster around the true values of $ and ". Because b and a are BLUE, these clusters would be tighter than they would be for any other linear unbiased estimators of $ and ". If they are also BCE, the clusters would be tighter than for any other consistent estimators. However, apart from in this textbook, we’re rarely in a position to examine multiple independent values of the same estimator. If we want to know more about $ and ", we have to see if there is more that we can make of the one set of estimators we’re likely to have. In particular, there’s more to be learned from their variances. We’ve already examined them to the point of verifying that they are smaller than they would be for any other linear unbiased estimator, under the Gauss-Markov theorem, or as small as those for any other consistent estimator,

© Jeffrey S. Zax 2008

- 6.4 -

under the assumptions of maximum likelihood estimation. But these are all comparisons with other potential estimators. We have yet to examine whether the clusters that we might expect for b and a, as given by their variances, tell us enough about where $ and " might be to be of practical value. Inference is the name that we give to this examination, because we are attempting to “infer” something relatively specific about the parameter values from the sample information. The inferential task begins with some basic techniques that we learned in statistics. We may already understand this material, based on our previous training. However, it’s very important, and occasionally subtle. For that reason, this chapter presents a thorough review. We’ll apply the principles and techniques that we revisit here to b and a in chapter 7. That, of course, is our ultimate destination. Whether we go there directly or pause here to refresh is simply a question of how comfortable we already feel with the task of inference.

Section 6.2: The basis of confidence intervals and hypothesis tests So far what we have in b and a are point estimators. Point estimators are individual values that are our single best guesses of $ and ". Point estimators are obviously convenient in many contexts. For example, point estimates of $ and " yield a single value for the prediction of yi associated with any specific value xi. A single, unambiguous predicted value is certainly convenient, if nothing else. However, point estimates are almost surely wrong, to some degree. No one with any sense is going to claim that their point estimate is exactly equal, to the very last decimal point, to the unknown value of the underlying parameter. We would certainly be more confident of being correct if we weren’t compelled to commit ourselves to a single value.

© Jeffrey S. Zax 2008

- 6.5 -

For many purposes, we aren’t forced to specify a point estimate. In these circumstances, we can avoid most of the errors inevitably associated with doing so by relying on interval estimates instead. Inference is, essentially, the construction of interval estimators. All of the inference in which we will be interested, at least until chapter 12, begins with a single probabilistic statement. Let *, the lower-case Greek letter “delta”, represent a population parameter. The sample statistic d is an unbiased estimator of *, E(d)=*. Its true, or theoretical variance is V(d) and its estimated standard deviation is SD(d).1 Assume, finally, that d is normally distributed. Accordingly,

d ~ N (δ ,V ( d ) ).

(6.1)

The “standardized” value of d, d*, is

d* =

d −δ + V (d )

.

(6.2)

We’ve subtracted from d its expected value, *, and divided by its population standard deviation, the positive square root of its population variance. Let’s take a moment to review the properties of d*. The expected value of d* is

⎛ d −δ ⎞ ⎟. E d * = E⎜ ⎜ + V (d ) ⎟ ⎝ ⎠

( )

Applying equation 5.34 to the expectation to the right of the equality,

1

It’s good to be aware of the notational ambiguity. Here we use “V” to represent the population and “SD” to refer to the sample.

© Jeffrey S. Zax 2008

- 6.6 -

( )

E d* =

1 + V (d )

E (d − δ ).

Invoking equation 5.13,

( )

E d* =

1 + V (d )

( E (d ) − E (δ )).

E(d)=* by assumption and E(*)=* by equation 5.15. Therefore,

( )

E d* =

1 + V (d )

( E (d ) − E (δ )) =

1 + V (d )

(δ − δ ) = 0.

(6.3)

The derivation of the population variance of d* is similar. It begins, naturally enough, with

⎛ d −δ ⎞ ⎟. V d* =V⎜ ⎜ + V (d ) ⎟ ⎝ ⎠

( )

Equation 5.43 implies that the variance to the right of the equality is

( )

V d*

2

⎛ ⎞ 1 ⎟ V d − δ ). =⎜ ⎜ + V (d ) ⎟ ( ⎝ ⎠

The squared ratio to the right of the equality is just 1/V(d). The second term to the right of the equality is

V (d − δ ) = V (d ) + V (δ ) − 2COV (d − δ ).

© Jeffrey S. Zax 2008

- 6.7 -

according to equation 5.44. In exercise 6.1 we demonstrate that the last two terms are equal to zero. Therefore,

V (d − δ ) = V (d ) + 0 − 2(0) = V (d ). Consequently,

⎛ 1 ⎞ V d* = ⎜ ⎟V (d ) = 1. ⎝ V (d ) ⎠

( )

(6.4)

Equation 6.2 gives d* as a linear function of d: It is equal to the difference between d multiplied by 1/V(d) and * multiplied by 1/V(d). Linear functions of normal random variables are also distributed normally. This, along with equations 6.3 and 6.4, implies that the true distribution of d* is

d * ~ N (0,1). As given in equation 6.2, d* varies only because d can vary from sample to sample. However, we don’t know the value of V(d). It’s a parameter of the distribution of d. All we have is an estimate of SD(d) from our sample. In practice, we have to standardize d with an estimate of the standard deviation rather than its true value. The consequence of this is that the actual value of d* will vary from sample to sample both because d and its estimated standard deviation, SD(d), can vary. We have to account for this additional source of variability by adjusting the distribution of d*. We recall that, for this reason, we have to treat d* as if it has the t distribution, rather than the standard normal. Utilizing SD(d), we have

© Jeffrey S. Zax 2008

- 6.8 -

d* =

d −δ ~ t ( df ) , SD(d )

(6.5)

where t(df) represents the t distribution with df degrees of freedom. We recall that, when df is “large”, the relevant t distribution closely resembles the standard normal distribution. This is because, roughly speaking, large samples yield pretty good estimates of the standard deviation of d. In consequence, we don’t introduce much of a distortion when we standardize d by its estimated rather than by its true standard deviation. Therefore, it’s not important to adjust for this distortion by switching from the normal to the t distribution. In other words, in large samples, the standard normal distribution is a good approximation of the relevant t distribution.2 We’re going to proceed on the expectation that at least some of our samples will be too small to support this approximation. Accordingly, the t distribution will feature in all of our derivations, in order to ensure that they apply regardless of sample size. If, in a particular application, the standard normal approximation is appropriate and preferred, the equations here will work if we simply substitute Z wherever t(df) appears, and refer to Appendix table 1 rather than 2. With these preliminaries, the probabilistic statement with which we begin is

(

)

1 − α = P − tα( df2 ) < d * < tα( df2 ) .

(6.6)

2

How big does df have to be? Most tables for the t distribution present approximately 40 values for df: all integer values from one to thirty, several values up to, perhaps, 120, and then df=4. If we compare Appendix tables 1 and 2, we can verify that the t distribution is identical to the standard normal distribution when df=4. Moreover, if we examine Appendix table 2, we’ll see that the critical values for the t distribution with df>30 begin to look very similar to those for the t distribution with df=4. Therefore, the normal distribution becomes an acceptable approximation for the t distribution somewhere in the range of df>30.

© Jeffrey S. Zax 2008

- 6.9 -

In equation 6.6, " represents the probability that is to be omitted from our interval estimates.3 The value for the t(df) distribution that leaves one half of this probability in the upper tail of the distribution, outside of our interval, is t"/2(df). The negative of this value leaves the same probability outside of our interval in the lower tail.4 Consequently, the probability in the two tails together is ". There’s a notational subtlety here that will be endlessly confusing if we don’t grasp it right now. The symbol t(df) represents a random variable that follows the particular t distribution associated with df degrees of freedom. A range of values are possible for this, as for all random variables. In this case, the range is from &4 to 4. The probability of observing an outcome within any part of this range is given, in principle, by the integral of the density function for this distribution, as discussed in footnote 5 of chapter 5. In practice, the probabilities of observing relevant outcomes are given, instead, by the numbers in Appendix table 2. The symbol t"/2(df) is an example of such a number. It represents the specific value for the random variable t(df) above which lies ("/2)% of the probability in the distribution for t(df), and below which lies (1&"/2)% of the probability in the distribution. To summarize, the symbol t(df) represents a random variable. The same symbol, augmented by a subscript as in the case of t"/2(df), represents a constant, a particular point in the distribution of the random variable. The value of the subscript identifies the probability that remains in the

It is really unfortunate that " is the standard notation for this probability. It is also, as in chapter 5, frequently used to represent the intercept of the population relationship. We’re going to have to decide which interpretation is appropriate based on context. Fortunately, we’re rarely going to be in the business of constructing interval estimates for the intercept, so we shouldn’t have to use " for both of its meanings at the same time. This should make it relatively easy for us to ascertain which usage of " is relevant. 3

4

Just to reiterate, if df were large, we could treat d* as a standard normal random variable in equation 6.5. Under this treatment, we would replace t"/2(df) with Z"/2.

© Jeffrey S. Zax 2008

- 6.10 -

distribution above the point t"/2(df). Figure 6.1

The density function for d*

P⎛⎜⎝ d * ≥ tα( df ) ⎞⎟⎠ = α 2

P⎛⎜⎝ d * ≤ − tα( df ) ⎞⎟⎠ = α 2

2

2

P⎛⎜⎝ − tα( df ) < d * < tα( df ) ⎞⎟⎠ = 1 − α 2

− tα( df2 )

2

0

tα( df2 )

Figure 6.1 illustrates the relationship in equation 6.6. The density function is centered at the origin of the horizontal axis, because E(d*)=0. The area beneath this density function between

&t"/2(df) and t"/2(df) is equal to 1&". This indicates that the probability of observing a value for the random variable d* between &t"/2(df) and t"/2(df) is equal to 1&". The probability of observing a value that is less than &t"/2(df) is "/2, as is the probability of observing a value that is greater than t"/2(df).The probability of observing a value that is either less than &t"/2(df) or greater than t"/2(df) is equal to ".5 5

Figure 6.1, and indeed the rest of this book, uses strict inequalities to define the area comprising (1&")% of the probability. Nothing would change if this area included the points that

© Jeffrey S. Zax 2008

- 6.11 -

Replacing d* in equation 6.6 with its equivalent in terms of our original population parameter and estimator from equation 6.2, we have

d−δ ⎛ ⎞ 1 − α = P⎜ − t α( df2 ) < < t α( df2 ) ⎟ . ⎝ ⎠ SD(δ )

(6.7)

This is the fundamental equation of this chapter, and, for that matter, inference. Both of our interval estimators, confidence intervals and hypothesis tests, are simply restatements of equation 6.7.

Section 6.3: Confidence Intervals The confidence interval is the easier interval estimate to understand. A confidence interval has two important attributes. First, the confidence level is equal to (1&")%, and is therefore determined by our choice of ". Second, the interval is defined by a lower bound and an upper bound, determined jointly by our choice of confidence level and the information in the sample at hand. Lastly, the purpose of the confidence interval is to establish a range within which we believe, with the pre-specified level of confidence, that the true parameter value lies. In other words, the confidence level gives the probability that our interval contains the value in which we are interested. With this objective, it is easy to see how confidence intervals emerge from equation 6.7. If

define its boundaries, because the probability associated with any individual value for a continuous random variable is negligible. The convention that we adopt here is consistent with the typical practice of forming hypothesis tests. We won’t discuss this practice until section 6.4, but in general it takes critical values identically equal to the standardized estimator or probvalues identically equal to " as rejections of the null hypothesis.

© Jeffrey S. Zax 2008

- 6.12 -

we want an interval in which * is likely to lie, we need to rewrite the inequality in equation 6.7,

− tα( df2 )
δ > d − tα( df2 ) SD(d ). Finally, we obtain the conventional representation of the confidence interval by reversing the order of the terms and, consequently, the inequalities, and reinserting in equation 6.7:

(

)

1 − α = P d − tα( df2 ) SD(d ) < δ < d + tα( df2 ) SD(d ) .

(6.8)

The probabilities in equations 6.7 and 6.8 have to be the same because values for d that satisfy the inequality in one must always satisfy the inequality in the other. Values that violate one inequality must violate both.

© Jeffrey S. Zax 2008

- 6.13 -

Figure 6.2 (1-")% confidence interval

d!t"/2(df)SD(d)

d t"/2(df)SD(d)

d+t"/2(df)SD(d) t"/2(df)SD(d)

Figure 6.2 presents a graphical representation of the confidence interval in equation 6.8. This representation restates the interpretation we provided for the confidence interval just before the formal development. First, the confidence interval is centered on d, the estimator from our sample. Second, the lower and upper bounds of the confidence interval are known: Given our choice of a specific value for " and the degrees of freedom in our sample, we obtain a specific value for t"/2(df) from the table for the t distribution, Appendix table 2. Our sample has provided us with a specific value for d, and an estimate of its standard deviation, SD(d). We believe, with (1&")% confidence, that the unknown parameter * lies between the known bounds d&t"/2(df)SD(d) and d+t"/2(df)SD(d). Third, figure 6.2 doesn’t contain the density function for d because we don’t know where to put it. The problem is that the density function is centered on *. But that’s what we’re looking for. All we know about * is that our confidence interval is (1&")% likely to include it.

© Jeffrey S. Zax 2008

- 6.14 -

The only remaining question is the choice of ". The advantage of confidence intervals is that, while point estimators are almost surely “wrong”, in the sense of not being precisely right, confidence intervals can be constructed to provide any level of assurance. So, in general, we construct them so that they provide a very high degree of confidence. Of course, the highest degree would be 100% confidence, or "=0. This degree of confidence would certainly be reassuring. The problem is, as exercise 6.2 asks us to demonstrate, that the lower bound of the corresponding confidence interval is &4 and the upper bound is 4. In other words, we can be 100% certain that * lies between &4 and 4. This assertion is as useless as it is true. Imagine that someone told us, in our running example, that $, the annual return to a year of education, was surely between &4 and 4. Would we dash to the nearest institution of higher learning and enroll in twenty credits? This illustrates a more general problem. Confidence intervals that include a wide range of possible behaviors don’t give us much guidance as to which particular type of behavior we can expect. Predictions at different points of such confidence intervals would lead us to make very different choices. In consequence, confidence intervals of this sort aren’t very helpful. This is the essential tension in the construction of confidence intervals. All else equal, higher certainty requires that we expand the range of possible values for the parameter. If we expand the range of possible values, we are more confident that we are correct. However, we run a greater risk that values at different points of this range represent very different behaviors. The choice of " always represents a compromise between confidence and usefulness. In practice, we require confidence of at least 90%. Therefore, we always choose " to be .1 or less. We usually set it at .05, indicating that we wish to be 95% confident that our interval contains the true value of *.

© Jeffrey S. Zax 2008

- 6.15 -

Usefulness requires that confidence intervals be narrow enough to imply relatively consistent behavior throughout their ranges. Equation 6.8 and figure 6.2 demonstrate that the width of the confidence interval is equal to 2t"/2(df)SD(d). More confidence requires higher values of ", correspondingly higher values of t"/2(df) and wider confidence intervals. With " fixed, the width of the confidence interval depends only on SD(d). We attempt to ensure that our confidence intervals are useful by relying on data that are rich enough to yield acceptably small estimates for the sample standard deviation. Estimators will tend to have smaller sample standard deviations when based on data in which the behavior at issue is expressed clearly and consistently. In the example of the effects of education on earnings, clarity would require that both variables be measured with some precision. Consistency would require that education typically increased productivity, and productivity was reliably related to earnings. We might encounter behavior that was neither clear nor consistent in, for example, a poor command economy. There, records might be haphazard and earnings determined by ideology or access to economic rents. Estimators will also, of course, ordinarily have smaller sample standard deviations if the number of observations is larger. This is because we’re more likely to be able to discern the true relationship between xi and yi, no matter how weak, the more evidence we have. If narrower, more informative confidence intervals are desired, here is the most effective strategy: Get more data. We can illustrate all of these issues by continuing with the examples of earlier chapters. Let’s repeat an exercise from our statistics course. We’ll construct a 95% confidence interval for the expected value of earnings in the population that gave rise to the sample of table 3.3. This expected value is the unknown population parameter in which we are interested. We usually represent expected values with the Greek letter :, pronounced “myoo”. This symbol,

© Jeffrey S. Zax 2008

- 6.16 -

representing a specific population parameter, will replace the generic parameter * in equation 6.8. We typically estimate :, an expected value in a population, with an average from a corresponding sample. We’ve already labeled average earnings in the sample of table 3.3 as y . Accordingly, this symbol, representing a specific estimator, will replace the generic estimator d in equation 6.8. In addition, equation 6.8 requires three interrelated quantities, SD(d), df and t"/2(df). The first is the sample standard deviation of the estimator. Again, as we learned in our statistics course, the sample standard deviation of our particular estimator, y , can be written as6 SD( y ) =

SD( yi ) n

.

(6.9)

According to equation 3.11, the calculation of the sample variance, and therefore SD(yi), uses n!1 in the denominator. This is therefore the degrees of freedom, or df, that we need in equation 6.8. Lastly, we’ll follow our convention and set "=.05. Consequently, for this example, equation 6.8 becomes ⎛ n − 1) SD( yi ) n − 1) SD( yi ) ⎞ .95 = P⎜ y − t .(025 < μ < y + t .(025 ⎟. ⎝ n n ⎠

(6.10)

Now we need actual quantities for all of the symbols in equation 6.10 with the exception of

:. Three come from what we already know about this sample: Table 3.3 reports that the sample average for earnings, y , is $28,415 and the sample size, n, is 20. The end of section 3.4 gives the standard deviation of earnings in this sample, SD(yi), as $30,507. The fourth quantity in equation 6

Exercise 6.4 reviews these results from our statistics course.

© Jeffrey S. Zax 2008

- 6.17 -

6.10, now that we know what n!1 must be, comes from Appendix table 2: t.025(19)=2.093. With these values, the confidence interval for : is 30,507 30,507 ⎞ ⎛ .95 = P⎜ 28,415 − 2.093 < μ < 28,415 + 2.093 ⎟ ⎝ 20 20 ⎠ = P(14,137 < μ < 42,693).

(6.11)

In other words, we are 95% confident that the expected value of earnings in the population from which we drew the sample of table 3.3 is somewhere between the lower bound of $14,137 and the upper bound of $42,693. This isn’t very helpful. The lower bound would be, for many families, below the poverty level. The upper bound could be consistent with a relatively comfortable lifestyle. Suppose that we would like to have more precise knowledge of this expected value? As we’ve just said, there’s really only one responsible way to accomplish this: Augment or replace our sample with more or richer data, in the hopes of reducing SD( y ) . Let’s replace the sample of 20 from table 3.3 with a sample of 1,000 observations such as those tabulated in the third line of table 5.5. Average earnings in this sample are very close to those in our smaller sample , y = 29,146. In addition, SD(yi)=42,698 and df=999. Consequently, t.025(99).1.960. Therefore, equation 6.10 becomes ⎛ 42,698 42,698 ⎞ ⎟ .95 = P⎜ 29,146 − 1960 . < μ < 29,146 + 1960 . 1,000 1,000 ⎠ ⎝ = P( 26,500 < μ < 31,792).

© Jeffrey S. Zax 2008

(6.12)

- 6.18 -

The confidence interval in equation 6.11 is $28,555 wide. That in equation 6.12 is only $5,293 wide. Clearly, the latter is a lot more informative than the former. The difference is mostly because SD( y ) is much smaller for the larger sample of equation 6.12. The smaller value for t.025(df) in this equation also plays a subsidiary role. As we can see, when we construct these confidence intervals, we assume an intellectual posture that is without prejudice. In other words, we impose no preconceived notions regarding the value of *. We simply ask the data to instruct us as to where it might lie. We can illustrate this posture with what might be a useful metaphor. Imagine playing horseshoes with an invisible post. Ordinarily, this would be frustrating. But here, think of the unknown value of the population parameter, *, as the invisible post. Think of d&t"/2(df)SD(d) and d+t"/2(df)SD(d) as the ends of the horseshoe. This horseshoe is different from ordinary horseshoes, because it actually has an affinity for the post. This affinity arises from the relationship between the unknown value of the parameter * and the known value of d, given in equation 6.1. In consequence, if we throw this particular horseshoe at this particular invisible post, the probability of a ringer is (1&")%.

Section 6.4: Hypothesis Tests The arithmetic of hypothesis tests is virtually identical to that of confidence intervals. However, the intellectual posture is completely different. We invoke hypothesis tests when we have hypotheses, or strong prior convictions, regarding what values for * might be of interest. For our purposes, the source of these convictions will usually be in behavioral intuition, economic theory or substantively relevant thresholds. When we make hypothesis tests, we address the data not for instruction, but

© Jeffrey S. Zax 2008

- 6.19 -

for validation. This requires that the test be constructed with a certain amount of artifice. Imagine that we believe something strongly. We want to convince someone else. They say, “Prove it!” What would be an effective response? A response like, “I just believe it.” would ordinarily be considered pretty weak. If we were able to cite a couple of facts that were consistent with our belief, that would be stronger. If we were able to point out a large number of empirical observations that couldn’t be explained any other way, that would usually be pretty convincing. The objective here is to be very convincing. Therefore, we begin, essentially, by asserting the opposite of what we really expect. We refer to this assertion as the “null hypothesis”, represented by H0. It specifies a value for the parameter that would be incompatible with our expectations, *0:

H 0 :δ = δ 0 .

As will become apparent in a bit, we construct the test so as to “stack the deck” in favor of this hypothesis. Then, if we’re still able to conclude that it isn’t plausible, we can reasonably claim that it must be discarded, and our actual expectations adopted.

Section 6.4.1: Two-tailed tests Intuitively, a hypothesis test takes the form of a question. Is d, the observable estimate of *, close enough to *0 to be taken as validation of the null hypothesis? In order to put this question formally, we identify a fairly wide range of values for the estimator d that would be sufficiently close to *0.

© Jeffrey S. Zax 2008

- 6.20 -

This range is called the acceptance region. It is an interval defined by a lower and an upper bound, within which values for d should lie with pre-specified probability. As such, it seems natural to derive this region by returning to the inequality in equation 6.7. Now, we replace * with its hypothesized value of *0, and rearrange the inequality so as to isolate d in the middle term. As in the construction of confidence intervals, we multiply all three terms in this inequality by SD(d): − t α( df2 ) SD(d ) < d − δ 0 < tα( df2 ) SD(d ).

However, this time we simply add *0 to each term, yielding

δ 0 − tα( df ) SD(d ) < d < δ 0 + tα( df ) SD(d ). 2

(6.13)

2

This expression defines the acceptance region. Its lower bound is *0&t"/2(df)SD(d) and its upper bound is *0+t"/2(df)SD(d). Both bounds are known: The null hypothesis gives us the value for

*0. The choice of ", the degrees of freedom in the sample and the distributional assumptions that we have made about d give us t"/2(df). Lastly, the data give us SD(d). Values for d between these bounds are taken as sufficiently close to *0 as to be consistent with, or supportive of, the null hypothesis that *=*0. Correspondingly, values of d below *0&t"/2(df)SD(d) and above *0+t"/2(df)SD(d) are in the rejection region. These values are sufficiently far from *0 as to be inconsistent with the null hypothesis, even when the “deck” is “stacked” in its favor. Because the rejection region has two components, one at either extreme of the range of possible values for d, we refer to this type of hypothesis test as a two-tailed test. The hypothesis test, itself, is formed by reinserting the expression of equation 6.13 into

© Jeffrey S. Zax 2008

- 6.21 -

equation 6.7:

(

)

1 − α = P δ 0 − tα( df2 ) SD(d ) < d < δ 0 + tα( df2 ) SD(d ) .

(6.14)

This equation states that, if *=*0, d, an unbiased estimate of *, should lie within the acceptance region with probability 1&". In order to use this test, we ask if our calculated value of d lies within the known bounds of the probabilistic interval in equation 6.13.7 The decision rule accompanying this test is as follows: If the calculated value of d actually lies within the acceptance region, the data are not inconsistent with the null hypothesis that *=*0. We choose to retain *0 as our belief about *.8 In this case, we announce that we have “failed to reject the null hypothesis.” If, instead, this value lies in the rejection region, it is clearly inconsistent with the null hypothesis. Accordingly, we choose to abandon it.

7

Notice the typographical similarity between equations 6.8 and 6.14. The only big difference is that d and * have switched places. There’s also the minor difference that * has acquired a “0” subscript. This similarity is both good news and bad. The good news is that once we’ve mastered the arithmetic for either the confidence interval or the hypothesis test, the other should be easy to learn. The bad news is that we have to be careful not to mistake one for the other. 8

In this circumstance, the data are sometimes described as “consistent” with the null hypothesis, which is therefore “accepted”. This is language is convenient, but gives *0 too much credit. “Consistency” should indicate some measure of active agreement. Here, the test accepts the null hypothesis unless a serious contradiction occurs. Therefore, a test can often fail to reject the null hypothesis even if the point estimate is quite different from *0, in terms of what it implies about the behavior in question.

© Jeffrey S. Zax 2008

- 6.22 -

Figure 6.3 Two-tailed hypothesis test at "% significance

density function for d under H0

*0!t"/2(df)SD(d)

*0 t"/2(df)SD(d)

rejection region

*0+t"/2(df)SD(d) t"/2(df)SD(d)

acceptance region

rejection region

Figure 6.3 illustrates this test. The null hypothesis specifies that E(d)=*0. Under the null hypothesis, the density function for d is therefore centered on *0. The acceptance region extends t"/2(df)SD(d) above and below *0. The rejection region lies beyond in either direction. In the context of hypothesis tests, the value of " is called the significance level. As in the case of the confidence interval, the value of " affects the width of the relevant interval. Higher values imply more probability in the rejection regions, lower values for t"/2(df), and narrower acceptance regions. Therefore, higher values of " make it easier to reject the null hypothesis.

© Jeffrey S. Zax 2008

- 6.23 -

For example, suppose that "=1. In this case, the acceptance region becomes the single point

*0.9 If d is equal to anything else, the decision rule states that we should reject the null hypothesis. Obviously, this feels as though the test is fixed to guarantee rejection.10 Returning to the introductory discussion of this section, it would amount to dismissing an opposing position without any consideration of its merits. While we all know people who behave in this way, it’s not good science. Moreover, if this is how we’re going to behave, why bother with the test? We already know what decision we’ll make. So the question of what value we should assign to " is clearly important. It becomes the question of the extent to which we want to predispose the test to rejection or acceptance. To begin an answer, recall that we actually want to make a very strong case that the null hypothesis is wrong. To do so, we have to give it the benefit of the doubt. In other words, we want to be very careful that we don’t reject the null hypothesis when it is really true. Therefore, we are prepared to construe a relatively wide range of values for d as not inconsistent with the hypothesis that *=*0. For this reason, we choose values for " that are low. At the same time, we never set " as low as zero. This significance level would imply boundaries for the acceptance region of &4 and 4. In other words, we would take any evidence as consistent with the null hypothesis, the proposition that we dispute! This makes even less sense then setting "=1. Therefore, we always set ">0. As in the case of confidence intervals, " is never more than .1, and most often .05. These values ensure that we reject the null hypothesis only when the value of the estimator is so far from the hypothesized value that it would be most unlikely to occur, if the

9

Why? Exercise 6.5 will help.

10

Essentially, this is a restatement of the point made previously that point estimates are almost certainly wrong. Again, can we see why?

© Jeffrey S. Zax 2008

- 6.24 -

null hypothesis were correct. When we reject the null hypothesis, we declare that the hypothesis test is significant at the level of ". This practice unfortunately involves some awkward terminology. If we choose a smaller value for ", we give greater benefit of the doubt to the null hypothesis. If we reject the null hypothesis even so, it’s more noteworthy. So we often describe rejections at lower values for ", lower significance levels, as being of greater significance. This terminology is almost universal, so there’s no way to fix the ambiguity. We just have to be alert to it. In any case, rejection is a big moment. We have constructed the test so that a wide range of values for d would validate the null hypothesis. Nevertheless, the data contradict it. We can therefore discard it with conviction. If, instead, we fail to reject the null hypothesis, we say that the test is insignificant. It can also be called a “failure to disprove”, and, more derisively, “proving the null hypothesis”. No one gets famous proving null hypotheses. This is because, after all, we’ve constructed our test so that this hypothesis is very hard to disprove. Also, there is the nagging possibility that the reason we have failed to disprove is not because the null hypothesis is actually true, but because we just weren’t industrious enough to collect data that was informative enough to reveal its falsity. Better data might have yielded a more precise estimate of *, in the form of a smaller sample standard deviation for d, and an unambiguous rejection. For these reasons, an insignificant test is usually regarded as a modest accomplishment, if an accomplishment at all. We can illustrate these ideas by returning to the problem of the expected value for earnings. We rewrite equation 6.14 with the symbols that are appropriate for this problem, as we discussed in

© Jeffrey S. Zax 2008

- 6.25 -

section 6.3: ⎛ SD( yi ) SD( yi ) ⎞ n − 1) n − 1) ⎟. .95 = P⎜⎜ μ0 − t .(025 < y < μ0 − t .(025 n n ⎟⎠ ⎝

(6.15)

The only new symbol here is :0. Therefore, the only thing we need to add to the information from section 6.3 is a null hypothesis. Let’s use H 0 : μ0 = 25,000.

This null hypothesis asserts that the expected value of earnings is $25,000. Section 6.3 reproduced the values for all of the other quantities in the sample of 20 observations presented in table 3.3. If we now replace all of the symbols in equation 6.15 with these values, we obtain 30,507 30,507 ⎞ ⎛ .95 = P⎜ 25,000 − 2.093 < y < 25,000 − 2.093 ⎟ ⎝ 20 20 ⎠

(6.16)

= P(10,722 < y < 39,278).

In other words, the acceptance region for this null hypothesis, given this sample, ranges from $10,722 to $39,278. If the average value for earnings in our sample is outside of this range, it would be in the rejection region. We would conclude that the evidence in our sample was inconsistent with, and therefore contradicted our null hypothesis. However, as we’ve already seen, for this sample. y = 28,415 . This estimator lies within the acceptance region. The appropriate conclusion is that this sample is not inconsistent with the null hypothesis, and therefore fails to reject it.

© Jeffrey S. Zax 2008

- 6.26 -

The next step would be to wonder whether we have failed to reject the null hypothesis because it is actually plausible, or because our sample isn’t sufficiently informative. After all, the acceptance region in equation 6.16 is $28,555 wide! We can compare these two possibilities by asking the same question of the larger sample of 1,000 observations that we examined in section 6.3: ⎛ 42,698 42,698 ⎞ ⎟ .95 = P⎜ 25,000 − 1960 . < y < 25,000 + 1960 . 1,000 1,000 ⎠ ⎝ = P(22,354 < y < 27,646).

(6.17)

The acceptance region in equation 6.17 offers a much more discriminating test. Although it is at the same significance level as the test in equation 6.16, it is only $5,293 wide. Therefore, a sample average that failed to reject the null hypothesis would actually have to be somewhat close, in substantive terms, to the hypothesized value of $25,000. In the event, it’s not. The average in this sample is $29,146. It lies in the rejection region. We conclude that the first sample was consistent with the null hypothesis, not because the hypothesis was true but rather because the average for that sample was estimated too imprecisely. It would have been consistent with a wide range of values for the null hypothesis. The average for the second sample is estimated much more precisely. Consequently, it does a much better job of distinguishing between values for :0 that are plausible and those that are not. Now that we understand the foundations of the two-tailed hypothesis test, we’re in a position to discuss two methods that perform the same test but simplify the execution a bit. The first method begins with equation 6.14, but rewrites it in the form of equation 6.7:

© Jeffrey S. Zax 2008

- 6.27 -

⎛ ⎞ d − δ0 1 − α = P⎜ − tα( (2df ) < < tα( (2df ) ⎟ . SD(d ) ⎝ ⎠

(6.18)

The standardized value of d under the null hypothesis is d 0* =

d − δ0 . SD( d )

(6.19)

Equation 6.18, with the substitution of equation 6.19, becomes

⎛ ⎞ d − δ0 1 − α = P⎜ − tα( df2 ) < < tα( df2 ) ⎟ = P − tα( df2 ) < d 0* < tα( df2 ) . SD(d ) ⎝ ⎠

(

)

(6.20)

The second equality of equation 6.20 demonstrates that the test rejects the null hypothesis if

d 0* ≤ − tα( df2 ) or tα( df2 ) ≤ d 0* . These two conditions can be combined as

tα( df2 ) ≤ d 0* .

(6.21)

In other words, the test can be construed as a comparison between the absolute value of d0* to t"/2(df). We can calculate the former using d and SD(d) from our sample, and *0 from H0. We obtain the latter from Appendix table 2 for the t(df) random variable. The absolute value of the test statistic, |d0*|, is called the test statistic. The test rejects the null hypothesis if the test statistic equals or exceeds t"/2(df) and fails to reject it otherwise. Therefore, t"/2(df) is the decisive, or critical value for this test. Exercise 6.7 demonstrates that d lies in the rejection region of equation 6.14 whenever equation 6.21 is true and vice versa. Conversely, d lies in the acceptance region of equation 6.14 whenever equation 6.21 is false.

© Jeffrey S. Zax 2008

- 6.28 -

In our example of earnings, equation 6.19 becomes

μ0* =

y − μ0

SD( y )

.

(6.22)

Using the values from the first of our two samples of earnings observations, this is μ0* =

y − μ0

SD( y )

=

28,415 − 25,000 = .501. ⎛⎜ 30.507 ⎞⎟ ⎝ 20 ⎠

(6.23)

In section 6.3 we identified the critical value for this sample as t.025(19)=2.093. This clearly exceeds the test statistic of equation 6.23. Therefore, just as we concluded in our discussion of equation 6.16, we do not reject the null hypothesis with these data. As we might now expect, the second of our two samples of earnings yields a different outcome. Using the values we’ve already established for this sample, equation 6.22 becomes

μ0* =

y − μ0

SD( y )

=

29,146 − 25,000 = 3.071. ⎛ 42,698 ⎞ ⎜ ⎟ 1,000 ⎠ ⎝

(6.24)

The critical value for this sample is t"/2(df)=1.960. This is much less than 3.071, the test statistic in equation 6.24. Therefore, as we saw in equation 6.17, this sample decisively rejects the null hypothesis that :0=25,000. The second simplification of the two-tailed hypothesis test also returns to equation 6.14. The decision rule that we have adopted there implies that we reject the null hypothesis if d differs

© Jeffrey S. Zax 2008

- 6.29 -

so greatly from *0 that, were *=*0, the probability of observing d would be " or less. Equation 6.20 tells us that this is equivalent to rejecting the null hypothesis if the probability of observing the value |d0*| for the random variable t(df) is " or less. This probability is the prob-value.11 In other words, another way to evaluate the null hypothesis is to identify this probability, P(t(df)#!|d0*| or t(df)$|d0*|), where |d0*| is the test statistic. We first locate the probability of a positive outcome |d0*| units or more above zero directly from the entry for this value in the table for the random variable t(df).12 This is equal to the probability of a negative outcome |d0*| units or more below zero because of the symmetry of the distribution for this random variable:

(

)

(

)

P t ( df ) ≤ − d 0* = P t ( df ) ≥ d 0* . Therefore, the probability of observing an outcome for t(df) at least |d0*| units away from zero in either direction is twice the probability of observing a value at least |d0*| greater than zero, as given in the table for t(df):

(

) (

) (

)

(

)

P t ( df ) ≤ − d 0* or t ( df ) ≥ d 0* = P t ( df ) ≤ − d 0* + P t ( df ) ≥ d 0* = 2 P t ( df ) ≥ d 0* .

The prob-value is therefore:

(

)

prob − value = 2 P t ( df ) ≥ d 0* . The hypothesis test then becomes the comparison of the prob-value – the probability of observing the estimated absolute value or a greater value – to the significance level " – the

11

We first encountered prob-values in section 1.8, where they were introduced as a companion to the F-statistic. Here, it’s a companion to the t-statistic, but the underlying meaning is the same. Remember, |d*| is always positive, regardless of the sign on d*. Similarly, !|d*| is always negative. 12

© Jeffrey S. Zax 2008

- 6.30 -

probability that we have established as our threshold for rejection in either direction from the null hypothesis. If the prob-value is larger than ", the probability of observing our value of d0* is greater than our significance level. Therefore, our value of d is in the acceptance region for the hypothesis test, and consistent with H0. In other words, we accept the null hypothesis if

prob − value > α .

(6.25)

Analogously, we reject the null hypothesis if the probability of observing our value of d* is less than or equal to the threshold probability:

prob − value ≤ α .

(6.26)

When this is true, d must be in the rejection region for H0. Exercise 6.7 asks us to confirm that the decision rules in equations 6.25 and 6.26 are consistent with those that we originally established for the hypothesis test in equation 6.14, and reiterated for the reformulation of equation 6.21 in terms of critical values. Once again, we can illustrate this version of the two-tailed hypothesis test with the two samples that we have been following since section 6.3. From equation 6.23, the test-statistic for the first is .501. This value doesn’t appear in Appendix table 2 in the row for df=19.

© Jeffrey S. Zax 2008

- 6.31 -

Figure 6.4 P(t(19)>.501)

density function >.025 .25

.025

0

.501 .688

2.093

However, that row gives us all that we really need to know. It tells us that P(t (19 ) ≥ .688) = .25.

Figure 6.4 illustrates this situation. The probability in density function above .688 is .25. The test statistic .501 lies below .688. Therefore, the probability in the density function above .501 must exceed the probability above .688, P(t (19 ) ≥ .501) > .25.

(6.27)

Similarly, the probability that our test statistic would be less than !.501 must also exceed .25. Together, these implications demonstrate that the prob-value associated with the test of the null hypothesis :0=25,000 must exceed .5. This is more than ten times as large as our significance

© Jeffrey S. Zax 2008

- 6.32 -

level, "=.05. In this case, equation 6.25 directs us to accept the null hypothesis. The procedure is similar for the second sample. Equation 6.24 gives the test-statistic for this sample as 3.071. Appendix table 2 doesn’t have a row for df=999, but this is so large that we can safely use the row for df=4. Still, 3.071 doesn’t appear in that row. What does appear is P(t ( ∞ ) ≥ .2.576) = .005.

(6.28)

This implies that P(t ( ∞ ) ≥ .3.071) < .005.

This, in turn, implies that the prob-value associated with the test statistic in this sample is less than .01.13 With " still set at .05, equation 6.26 instructs us to reject the null hypothesis. The null hypothesis *0=0 is often of special interest, as will be discussed in section 7.4. In this case, equation 6.7 simplifies to ⎛ ⎞ d 1 − α = P⎜ − tα( df2 ) < < tα( df2 ) ⎟ . SD(δ ) ⎝ ⎠

The test of this hypothesis is especially simple: It consists of the comparison between the ratio of the absolute value of d/SD(d) to the critical value t"/2(df). If

tα( df2 ) ≤

13

d = dδ*0 = 0 , SD(d )

Exercise 6.8 directs us to derive this result explicitly.

© Jeffrey S. Zax 2008

- 6.33 -

then d must lie in the rejection region.14 The quantity dδ* = 0 =|d|/SD(d) is often referred to as the t-statistic. This is a little 0 misleading, because d0* has the t-distribution for all values of *0, not just for *0=0. However, this designation emphasizes the importance of the null hypothesis that *0=0 in statistical practice, and in regression analysis in particular. * If t"/2(df)< dδ 0 = 0 we often summarize this result by stating that “the t-statistic is significant

at the "% level”. If it’s not, we often describe this as a circumstance in which “we cannot statistically distinguish the estimate from zero”. Sometimes we’re more dismissive, and leave out the qualifier “statistically”.

Section 6.4.2: One-tailed tests When we reject the null hypothesis, regardless of its value, we have to confront the question of what we will choose to believe instead. What is our alternative hypothesis, or H1? Though we haven’t discussed it yet, part of the answer to this question is implicit in the construction of the rejection region. In the hypothesis test of equation 6.14, we reject the null hypothesis if the observed value of d is either too low or too high. In other words, when we reject, we don’t care whether values of d are above or below *0, so long as they are far enough away from it. Implicitly, our alternative hypothesis is “anything but *0”, or H1: *…*0. This is a somewhat negative posture. It’s also largely non-constructive. How comfortable

14

Informal applications of this test compare the ratio of the estimator to its estimated standard deviation to two, which, when "=.05, is a reasonable approximation to Z"/2 or to t"/2(df) for all but the smallest samples. We previously alluded to this practice in section 1.5.

© Jeffrey S. Zax 2008

- 6.34 -

can we be with predictions, decisions or policies based wholly on the belief that * isn’t *0? Often, we find ourselves with more instructive instincts with regard to the alternative to *0. If nothing else, we’re often in the position to suspect that if * isn’t *0, it is likely to lie on one side of *0. How should this belief get reflected in our rejection region? For example, imagine that we believe that * must be greater than *0 if it isn’t equal to it. Our alternative hypothesis is H1: *>*0. It should be obvious that it no longer makes sense to reject the null hypothesis of *=*0 if d is well below *0. No matter how ridiculously far below, any value of d less than *0 is going to be more consistent with the null hypothesis that *=*0 than with the alternative hypothesis that *>*0. So all values of d below *0 should be included in the acceptance region. What about values of d above *0? Here we have to be more careful. What should we do about the highest values of d included in the acceptance region of equation 6.14? These are the kinds of values that might naturally occur if the alternative hypothesis were true. We would therefore run a serious risk if we took them as not inconsistent with the null hypothesis. This means that, just as we have extended the acceptance region in the direction away from the alternative hypothesis, we need to shrink it in the other direction. In other words, rather than splitting the probability indicated by the significance level " between an upper and a lower section of the rejection region, we should consolidate that probability in a single rejection region located in the direction where the null and alternative hypotheses are going to conflict. A hypothesis test with a unitary rejection region is called a one-tailed test. Formally, when the alternative hypothesis is *>*0, we define the rejection region as consisting solely of values for d that are so high as to indicate that the alternative is preferable to the null. This is sometimes referred to as an upper-tailed test. The range of outcomes in this single

© Jeffrey S. Zax 2008

- 6.35 -

region must therefore occur with probability ". The corresponding modification to equation 6.7 is ⎞ ⎛ d−δ 1 − α = P⎜ < tα( df ) ⎟ . ⎠ ⎝ SD(δ )

(6.29)

The lower bound for the inequality in the parentheses of equation 6.29 is implicit, and equal to &4. The upper bound, t"(df), is a value from Appendix table 2 for the t distribution with df degrees of freedom. This value is the point at which "% of the probability in this distribution is above, and (1&")% below. The transformation of equation 6.29 that is analogous to that yielding equation 6.14 is15 1 − α = P(d < δ 0 + tα( df ) SD(d )).

(6.30)

Again, the lower bound is implicitly &4. This hypothesis test directs us to accept the null hypothesis of *=*0 if we observe a value for d that is less than *0+t"(df)SD(d). In other words, the acceptance region for this null hypothesis is between &4 and *0+t"(df)SD(d). The test directs us to reject the null hypothesis in favor of the alternative that *>*0 if we observe value for d in the rejection region. This region is at or above *0+t"(df)SD(d).16

15

We make this transformation explicitly in exercise 6.9.

16

Exercise 6.10 directs us to construct and interpret the lower-tailed test, the one-tailed hypothesis test when the alternative hypothesis suggests a value lower than that of the null hypothesis.

© Jeffrey S. Zax 2008

- 6.36 -

Figure 6.5 Upper-tailed hypothesis test at "% significance

density function for d under H0

*0

*0+t"/2(df)SD(d)

acceptance region

rejection region

In other words, the test accepts the null hypothesis automatically if d α . Once again, we can illustrate the points of this section with our continuing analysis of earnings. Let’s adopt this alternative hypothesis: H 0 : μ1 > 25,000.

(6.31)

This implies that we need an upper-tailed hypothesis test. Equation 6.30, rewritten with the appropriate notation, is

(

)

.95 = P y < μ0 + t α( df ) SD( y ) .

(6.32)

This is the general form of the test that we need. For our first sample, Appendix table 2 reports that t.05(19)=1.729. Accordingly, equation 6.32 becomes

30,507 ⎞ ⎛ .95 = P⎜ y < 25,000 + 1729 . ⎟ = P( y < 36,794). ⎝ 20 ⎠

(6.33)

There are two interesting things to note about this test. First, average earnings in this sample, $28,415, are in the acceptance region and therefore consistent with the null hypothesis, even in this upper-tailed test.

© Jeffrey S. Zax 2008

- 6.38 -

Figure 6.6 Two-tailed and one-tailed hypothesis tests with 20 observations

density function for y under H0

.025

.025

10,722

25,000

.025

36,794 39,278

Two-tailed acceptance region

One-tailed acceptance region

Second, figure 6.6 compares the acceptance and rejection regions for the one-tailed test in equation 6.33 to those in the two-tailed test, using the same sample, in equation 6.16. As we discussed just above, the lower end of the acceptance region in the upper-tailed test includes all values below the null hypothesis. This means that it contains all of the area below $10,722, the area that was in the lower of the two rejection regions for the two-tailed hypothesis test of equation 6.16. The probability associated with this area was 2.5%. The rejection region for the upper-tailed hypothesis test of equation 6.33 contains all of the

© Jeffrey S. Zax 2008

- 6.39 -

area above $39,278, the area in the upper rejection region for the two-tailed hypothesis test of equation 6.16. However, the probability associated with this area is only 2.5%. The upper-tailed rejection region for a test at 5% significance must have probability equal to 5%. Therefore, it also incorporates the area between $36,794 and $39,278, which contributes the necessary additional probability of 2.5%. The other two methods of evaluating this hypothesis test, naturally, yield the same result. According to equation 6.23, the test statistic is μ0* = .501 for this sample. We have just established that the critical value is t.05(19)=1.729. Therefore, the critical value exceeds the test statistic. We do not reject the null hypothesis. Similarly, in equation 6.27 we found that P(t (19 ) ≥ .501) > .25. As we said just above, this probability is equal to the prob-value for a one-tailed test. It still exceeds our chosen significance level, "=.05. Yet again, we conclude that we cannot reject the null hypothesis.17

Section 6.4.3: Type I and type II errors Now we have a thorough understanding of how to construct a hypothesis test, and what decisions to make. We next have to consider how this whole procedure could go wrong. There are two ways. First, imagine that the null hypothesis is correct. In this case, given the way that we have designed the test, the probability of getting an observed value for d in the rejection region is exactly

". As we said above, we deliberately set " to be low in order to give the benefit of the doubt to the null hypothesis. Nevertheless, if we test 100 true null hypotheses, we will typically reject 100" of

17

We can repeat these analyses with the sample of 1,000 observations on earnings in exercise 6.11.

© Jeffrey S. Zax 2008

- 6.40 -

them. For example, if "=.05, we would ordinarily expect to reject five out of every 100 true null hypotheses. In these cases of mistaken rejection, we have simply observed a value for d that is relatively rare. We have mistaken a low-probability event, generated by the null hypothesis, for an indication that the null hypothesis is false and the alternative hypothesis is true. For example, imagine that we observed a series of coin tosses that yielded ten consecutive “heads”. Would we conclude that we have been extremely lucky, or that we are dealing with a crooked coin? Our procedure for hypothesis testing directs us to accept the alternative hypothesis, that the coin is crooked.18 However, we must have some concern that the null hypothesis – that the coin is fair – is true and that we have simply been fortunate to observe something that almost never occurs. This kind of mistake, rejecting the null hypothesis when it is true, is called a type I error. This is the kind of mistake that worries us most. Accordingly, we design our statistical test so as to ensure that the probability that it will occur is acceptably low. As we saw just above, we set that probability at the significance level ". We also refer to " as the size of the hypothesis test. The other mistake that we can make is accepting the null hypothesis when it is false. How does this occur? We can only accept the null hypothesis if the observed value of d lies in the acceptance region. So this type of mistake, a type II error, occurs when the alternative hypothesis is true, but happens to yield a value for the estimator that appears to validate the null hypothesis. In the example of the coin, a type II error would occur if the coin were truly crooked, but we concluded that it was fair. This could happen if, for example, the coin yielded only seven heads in 18

For entertainment, we might try to calculate the probability of observing this outcome for a fair coin.

© Jeffrey S. Zax 2008

- 6.41 -

ten tosses. While this is a lot of heads for a fair coin, it’s not that far from the five that we would ordinarily expect. And remember, we’re giving the null hypothesis of a fair coin the benefit of the doubt. Accordingly, we would probably not reject it on the basis of this evidence. Table 6.1 Hypothesis tests: Actions and consequences

Action:

In fact:

Consequence:

If we accept the null hypothesis...

and it’s true...

we’ve done the right thing

and it’s false...

type II error

and it’s true...

type I error

and it’s false...

we’ve done the right thing

If we reject the null hypothesis...

Table 6.1 provides a convenient summary of the actions we might take and the mistakes we might make in the process of testing a statistical hypothesis. Unfortunately, the names that are universally given to these two mistakes, type I and type II error, don’t help us remember what’s actually at issue. They are not especially descriptive, and they are very easy to confuse. An analogy should help us to remember the distinctions between them. Consider the criminal justice system in the United States. In general, an individual accused of criminal activity is considered “innocent until proven guilty”. In other words, the null hypothesis in this context is that of innocence. A type I error occurs when the null hypothesis of innocence is true, but rejected. In other words, a type I error is the statistical equivalent of convicting an innocent person.

© Jeffrey S. Zax 2008

- 6.42 -

This kind of mistake, when it is discovered, is usually very upsetting. Therefore, we limit the probability that it will occur by requiring that evidence demonstrate guilt “beyond reasonable doubt”. This is the legal equivalent of our practice of setting a low significance level, choosing a low value for ", in a hypothesis test. However, the cost of this stringent evidentiary standard is that sometimes the evidence is not sufficient to support a conviction when it would be appropriate. In this case, a type II error occurs: The null hypothesis of innocence is accepted even though the individual is guilty. In other words, a type II error is the statistical equivalent of acquitting a guilty person. We obviously care about type II errors as well as type I errors, but not as much. That’s why we design our tests specifically to control the probability of the latter, without regard to the probability of the former, at least initially. But that leaves the obvious question, what is the probability that a type II error will occur? Based on what we have said so far about H1, we can’t tell. If the alternative hypothesis is true, but simply states that *>*0, it’s possible that the true value for * is only slightly greater than *0. In this case, it would be easy for a value of d to occur that was in the acceptance region, and the probability of a type II error would be very high.19 However, under this alternative hypothesis it is also possible that the true value for * is much greater than *0. In this case, it would be very unlikely that d would nevertheless be small enough to appear within the acceptance region. The probability of a type II error here could be small and even negligible. What this discussion reveals is that, in order to identify a specific probability of type II

19

Of course, if the null and alternative hypotheses are this close, the substantive differences between them might be so small that it doesn’t matter which one we believe.

© Jeffrey S. Zax 2008

- 6.43 -

error, we have to specify precisely the value of * under the alternative hypothesis. We have to be prepared to assert H1:*=*1 for some explicit value *1. Before we attempt this, we need to be sure we understand what’s at stake. What’s not at stake is the hypothesis test, itself. The only bearing that the alternative hypothesis has on this test is to indicate whether it should be two-tailed, as in equation 6.14, or one-tailed, as in equation 6.30. Once that’s decided, neither of these equations depends on the value for *1. Therefore, the choice of a specific value for *1 has no effect on the calculation or outcome of the hypothesis test. It affects only the probability of incorrectly concluding that the null hypothesis shouldn’t be rejected. Any specific alternative hypothesis can be either above or below the value for the null hypothesis, but not both. In other words, either *1*0. In either case, the appropriate test is one-tailed.

© Jeffrey S. Zax 2008

- 6.44 -

Figure 6.7 Probability of type II error for an upper-tailed hypothesis test

density function for d under H1

P(type II error)

*0

*0+t"(df)SD(d) acceptance region

*1 rejection region

For example, *1>*0 implies an upper-tailed test. Equation 6.30 gives the acceptance region for this test as d#*0+t"/2(df)SD(d). If the alternative hypothesis is true, the probability that d will nevertheless lie in the acceptance region for the null hypothesis is20

P(type II error ) = P(d ≤ δ 0 + tα( df ) SD( d )|δ = δ1 ).

(6.34)

Figure 6.7 illustrates this probability.

20

The vertical strike “|” in equation 6.34 indicates conditional probability, the probability assuming that the value of * is *1.

© Jeffrey S. Zax 2008

- 6.45 -

If the alternative hypothesis were true, *=*1. In this case, the correct standardization for d would be

d1* =

d − δ1 ~ t ( df ) . SD(d )

(6.35)

If we subtract *1 from both sides of the inequality in equation 6.34, and divide both sides by SD(d), we obtain ⎛ d − δ1 δ 0 + t α( df ) SD( d ) − δ1 ⎞ ⎟ P(Type II error ) = P⎜ ≤ SD( d ) ⎝ SD( d ) ⎠ ⎛ ( df ) δ 0 + t α( df ) SD( d ) − δ1 ⎞ ⎟ = P⎜ t ≤ SD( d ) ⎝ ⎠

(6.36)

The inequality in parentheses to the right of the first equality sign in equation 6.36 contains two terms. The quantity to the left of the inequality sign is simply d1*, a t random variable, as given in equation 6.35. The quantity to the right of the inequality looks pretty fearsome, but it’s just a number. The value of *0 is given by the null hypothesis H0, that of *1 is given by the alternative hypothesis H1, t"(df) comes from Appendix table 2 for the t distribution and SD(d) is given by our data. Therefore, the probability of a type II error is simply the probability that a t random variable will generate a value less than the known quantity

δ 0 − tα( df ) SD(d ) − δ1 SD(d )

© Jeffrey S. Zax 2008

.

- 6.46 -

We locate this quantity in the margins of the table for the t random variable, Appendix table 2. The probability that we’re looking for is associated with this quantity in the body of the table.21 There are only two actions we can take when the null hypothesis is false. We can either accept it, incorrectly, or reject it. The probability of the first action is given by equation 6.36. The probability of the second action is the power of the hypothesis test:

power = 1 − P(type II error ).

(6.37)

The power of the test is the probability that the test will correctly reject the null hypothesis when it is false. For the last time, we illustrate these issues with our sample of 20 observations on earnings. First, we rewrite equation 6.36 in the appropriate notation: ⎛ μ0 + t α( df ) SD( y ) − μ1 ⎞ ⎟⎟ P(Type II error ) = P⎜⎜ t ( df ) ≤ SD( y ) ⎝ ⎠

(6.38)

We have values for all of the quantities to the right of the inequality with the exception of :1. Let’s now assert an alternative hypothesis: the expected earnings in the population is $35,000, or H1 : μ1 = 35,000.

Now we can take the expression to the right of the inequality in equation 6.38 and replace all of the symbols with actual values:

21

Exercise 6.13 addresses the calculation of the probability of type II error in the context of a lower-tailed test. The probability of a type II error is occasionally represented as $. Can we see why we would want to avoid this notation? There must not be enough Greek letters to go around.

© Jeffrey S. Zax 2008

- 6.47 -

⎞ ⎛ ⎛ 30,507 ⎞⎟ 25,000 + 1729 . ⎜⎝ − 35,000 ⎟ ⎜ ⎠ 20 ⎟ = P( t (19 ) ≤ .263). P(Type II error ) = P⎜ t (19 ) ≤ ⎛⎜ 30,507 ⎞⎟ ⎟⎟ ⎜⎜ ⎝ ⎠ ⎝ 20 ⎠

This probability doesn’t appear explicitly in Appendix table 2. However, this table tells us that P( t (19 ) ≤ .688) = .750.

In addition, we know that P( t (19 ) ≤ 0) = .500,

because zero is the expected value of this random variable and its density function is symmetric around zero. Lastly, we know that

P( t (19 ) ≤ .688) > P( t (19 ) ≤ .263) > P( t (19 ) ≤ 0).

(6.39)

Equation 6.39 implies that

.750 > P( t (19 ) ≤ .263) > .500. In other words, the probability of a type II error in our sample of 20 observations, when the null hypothesis specifies :0=25,000 and the alternative hypothesis specifies :1=35,000, is between 50% and 75%. Correspondingly, according to equation 6.37, the power of this test is no greater than 50%, and could be as low as 25%. It should be evident that a hypothesis test is more convincing, the lower is its significance level, or size, and the higher is its power. A lower significance level means that we are less likely to reject the null hypothesis when it’s true. Higher power means that we are less likely to accept the

© Jeffrey S. Zax 2008

- 6.48 -

null hypothesis when it’s false. In other words, tests with lower significance and higher power are less likely to make either of the possible mistakes. The unfortunate truth is that, if we don’t change our null hypothesis, our alternative hypothesis or our data, reducing the significance level also reduces the power. Algebraically, reducing the significance level means reducing ". Reducing " means increasing t". Increasing t" means moving the boundary of the acceptance region for H0 closer to the value for * specified by the alternative hypothesis H1. This increases the probability that, if *1 were the correct expected value for d, we could obtain an actual value for d that we would take as consistent with *0.

© Jeffrey S. Zax 2008

- 6.49 -

Figure 6.8 The effects of lower significance levels on the probability of type II error

density function for d under H1

Power P(type II error|"1) P(type II error|"0)

*0

*0+t"0(df)SD(d) *0+t"1(df)SD(d) acceptance region

*1 rejection region

Figure 6.8 demonstrates this dilemma. The original significance level is "0. The associated acceptance region is below *0+t"0(df)SD(d). The new, lower significance level is "1, with "1

Suggest Documents