Sociology 6Z03 Topic 14: Hypothesis Tests

Sociology 6Z03 Topic 14: Hypothesis Tests John Fox McMaster University Fall 2016 John Fox (McMaster University) Soc 6Z03:Hypothesis Tests Fall 201...
1 downloads 1 Views 489KB Size
Sociology 6Z03 Topic 14: Hypothesis Tests John Fox McMaster University

Fall 2016

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

1 / 44

Fall 2016

2 / 44

Outline: Hypothesis Tests Introduction An Extended Example The Procedure of Hypothesis Testing One-Sided and Two-Sided Tests “Statistical Significance” Reporting P-Values Hypothesis Tests and Confidence Intervals Hypothesis Testing as a Decision Problem The Power of the Test

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Introduction In constructing a confidence interval, we specify a range of values that plausibly might be thought to contain the parameter — such as a population mean — that we want to estimate. In hypothesis testing — the second common procedure of classical statistical inference — we assess the strength of evidence against the proposition that a parameter is equal to some specific value. As in the case of confidence intervals, the mechanics of statistical hypothesis testing are reasonably straightforward, but the rationale for the procedure, and the proper interpretation of the results, are more complex.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

3 / 44

An Extended Example An Imaginary Experiment

It is best to approach the logic of hypothesis testing by developing a simple example. Consider the following (made-up) experiment: An educational researcher wants to know whether a new method of teaching statistics is superior to the old method. Ten instructors who each teach two sections of an introductory statistics class are recruited into the study. Each instructor has one of his or her sections assigned at random to the new teaching method; the other section is taught by the old method. At the end of the study, the students in all sections of the course take a common exam.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

4 / 44

An Extended Example The average grade on the exam in each section along with the difference between the new and old methods are as follows: Instructor 1 2 3 4 5 6 7 8 9 10

New Method Class Mean 94 75 75 84 79 85 73 75 75 83

Old Method Class Mean 71 70 58 80 70 67 66 80 72 70

Difference xi 23 5 17 4 9 18 7 −5 3 13

The mean difference is x = 9.4, and the standard deviation of the n = 10 differences is s = 8.38. John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

5 / 44

An Extended Example Cautions

Even a study this simple raises serious questions of interpretation: Suppose, as I show below, that there is strong evidence, beyond the expected level of chance fluctuations, that sections taught by the new method performed better. It is possible that instructors were enthusiastic about the change and consequently taught better with the new method, even if that method has no intrinsic advantage. Furthermore, although the new method has an advantage on average, some types of instructors might obtain better results with the old method; likewise, some students might learn better with the old method. Finally, if we wish to generalize from these ten instructors (and their students) to some larger population, then the instructors in the study must be “representative” of the more general population, ideally by being drawn at random from the population.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

6 / 44

An Extended Example Establishing Whether the New Method is Better

We want to establish whether or not the data support the proposition that the new method is more effective on average than the old. Let us call the population mean difference between the two methods µ. If the new method is better than the old, then µ > 0. If, on the other hand, the two methods are equally effective, then µ = 0. Let us, for the moment, rule out the possibility that the old method is better — i.e., that µ < 0.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

7 / 44

An Extended Example The Null and Alternative Hypotheses

Hypothesis testing proceeds by assuming, for purposes of argument, that the two methods are equally effective. This is called the null hypothesis, H0 : µ = 0 Notice that the null hypothesis specifies a particular value for the parameter µ, namely the value zero. The null hypothesis always specifies a particular value, but what that value is depends upon context.

The opposite of the null hypothesis is called the alternative hypothesis. Here, the alternative hypothesis is that the new method is superior to the old one, Ha : µ > 0 John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

8 / 44

An Extended Example Some Points About the Null and Alternative Hypotheses

Unlike the null hypothesis, the alternative hypothesis does not specify a particular value for the parameter µ. This means that we cannot test the alternative hypothesis directly. Instead, we will assess the strength of evidence against the null hypothesis. If the null hypothesis is not supported by the data, then we will reject it in favour of the alternative hypothesis. It is this “backwards” logic that makes hypothesis testing conceptually difficult.

Usually, our interest is in establishing the alternative hypothesis. Assessing the strength of evidence against the null hypothesis is an indirect way of assessing the strength of evidence in favour of the alternative hypothesis.

You may encounter some variation in terminology and notation: Sometimes the alternative hypothesis is called the research hypothesis, and is symbolized by H1 . John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

9 / 44

An Extended Example Testing the Null Hypothesis

We assess the evidence against the null hypothesis by tentatively assuming that it is true. We then calculate the probability that data as discrepant as those we in fact obtained arise purely by chance. We do this by referring to the sampling distribution of the sample mean x, constructed assuming that the null hypothesis is true. If this probability is sufficiently small — that is, if data as extreme as ours are rare assuming the truth of H0 — then we conclude that the null hypothesis is probably wrong.

If the null hypothesis is true — that is, if µ = 0 — then, with repeated sampling, sample means x are approximately √ distributed with an average value of µ = 0 and a √ normally standard deviation of σ/ n = σ/ 10.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

10 / 44

An Extended Example The Hypothetical Sampling Distribution of the Mean

There are two practical problems here which, for the moment, we will effectively ignore: 1

We don’t know the standard deviation σ of x in the population. We will simply take σ = s = 8.38, but this is not a good solution when — as here — the sample size is small.

2

Because the sample size is small, the sampling distribution of the means x from repeated samples may not be close enough to a normal distribution if the population distribution of x is sufficiently non-normal. √ Disregarding these problems, x ∼ N (0, 8.38/ 10) = N (0, 2.65). This hypothetical sampling distribution of x, calculated assuming the truth of the null hypothesis, is shown on the next slide. The observed value of x = 9.4 is also shown on the graph. John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

11 / 44

Fall 2016

12 / 44

An Extended Example The Hypothetical Sampling Distribution of the Mean

Hypothetical Sampling Distribution of x under H0 : µ = 0

P = .0002

0.0

9.4

x

3.55

z

observed value of x John Fox (McMaster University)

0 Soc 6Z03:Hypothesis Tests

An Extended Example The P-Value for the Test

Now we calculate the probability of getting a sample mean x of 9.4 or larger, assuming that H0 is true. This is simply a normal distribution calculation: Converting x = 9.4 to the standard normal value z, x − µ0 √ σ/ n 9.4 − 0 = = 3.55 2.65

z=

In this formula, µ0 is not the true population value of µ, which is unknown, but rather the value of µ specified by the null hypothesis. Then P (x ≥ 9.4) = P (z ≥ 3.55) = 1 − .9998 = .0002

This probability is called the P-value of the hypothesis test. John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

13 / 44

An Extended Example Interpretation of the P-Value

Thought Question Indicate whether each of the following statements is (A) true or (B) false: If the null hypothesis is true, then the probability of getting a sample mean as large as or larger than the one obtained (x = 9.4) is very small — about 2 chances in 10,000. This is a very small probability, and so the null hypothesis is probably correct and the alternative hypothesis — that the new method is better on average than the old one — is probably incorrect.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

14 / 44

The Procedure of Hypothesis Testing The Null and Alternative Hypotheses

Specify the null hypothesis, H0 : µ = µ 0 and the alternative hypothesis. There are three alternative hypotheses that could accompany H0 : µ = µ 0 : 1 2 3

Ha : µ > µ0 Ha : µ < µ0 Ha : µ 6= µ0 The first two of these alternative hypotheses specify a direction of departure from H0 ; the third alternative hypothesis is nondirectional. In any given application, only one alternative hypothesis would be used. More about this shortly.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

15 / 44

The Procedure of Hypothesis Testing The Null and Alternative Hypotheses

Notice that the null and alternative hypotheses are specified in terms of the parameter of interest, here µ. It is wrong, for example, to write, H0 : x = 0.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

16 / 44

The Procedure of Hypothesis Testing Hypothetical Sampling Distribution, Test Statistic, and P-Value

Tentatively assuming the truth of the null hypothesis, find the hypothetical sampling distribution of the sample mean x,   σ x ∼ N µ0 , √ n Using the hypothetical sampling distribution, find the probability P of obtaining a result as or more extreme than the one observed. This is done by calculating the test statistic z=

x − µ0 √ σ/ n

and using the standard normal table. John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

17 / 44

The Procedure of Hypothesis Testing Interpreting the P-Value

If the P-value is sufficiently small, then the null hypothesis is probably wrong and the alternative hypothesis is probably right.

Important Point It is not correct, however, to say that P is the probability that the null hypothesis is right. The null hypothesis is either right or wrong — µ is either equal to µ0 or it is not — but we do not know which. This is similar to the interpretation of a confidence interval, where the level of confidence is not the probability that the parameter µ is in our specific interval.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

18 / 44

One-Sided and Two-Sided Tests Testing H0 : µ = 0 against the alternative hypothesis Ha : µ > 0 led to a one-sided test: To find the P-value for the test we looked in the upper tail of the hypothetical sampling distribution, beyond the observed value of the sample mean x. In this case, the alternative hypothesis is also called one-sided or directional; the test is sometimes called one-tailed.

If we had instead expected the new method to be worse than the old one, we would have specified the directional alternative hypothesis Ha : µ < 0, and would have found the P-value by looking to the left of the sample mean x rather than to the right. For the observed value of x = 9.4, the left-tail P-value is very big, P = .9998 (see the graph on the next slide). This counts as evidence in favour of the null hypothesis and against the alternative hypothesis (that the old method is better).

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

19 / 44

Fall 2016

20 / 44

One-Sided and Two-Sided Tests Hypothetical Sampling Distribution of x under H0 : µ = 0

P = .9998

observed value of x

John Fox (McMaster University)

.0002

0.0

9.4

x

0

3.55

z

Soc 6Z03:Hypothesis Tests P-value for Ha : µ < 0.

One-Sided and Two-Sided Tests One-Sided Tests

Thought Question True or False: This example illustrates an important characteristic of one-sided tests: If you observe a departure from µ0 in the direction opposite to the expected one, then this counts as evidence in favour of H0 no matter how far x is from µ0 . A True. B False. C I don’t know.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

21 / 44

One-Sided and Two-Sided Tests Two-Sided Test

If you are not confident in your expectation that the true value of µ departs from the null value µ0 in a particular direction, then you should use a two-sided or nondirectional alternative hypothesis, which leads to a two-sided or two-tailed test: Ha : µ 6 = µ 0 In this case, the alternative hypothesis simply states that µ is different from µ0 , and we reject H0 in favour of Ha for values of x that are sufficiently far from µ0 in either direction.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

22 / 44

One-Sided and Two-Sided Tests Two-Sided Test

To implement a two-sided test, we need to find the probability of getting a result as or more extreme than the one obtained in either direction. This, in effect, requires us to double the one-sided P-value; if, for example, the obtained z is positive, then P = P (Z ≤ −z or Z ≥ z ) = 2 × P (Z ≥ z ) Here, Z is a standard normal variable.

For the example, z = 3.55, so the two-sided P-value is (see the graph on the next slide) P = P (Z ≤ −3.55 or Z ≥ 3.55)

= 2 × .0002 = .0004

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

23 / 44

Fall 2016

24 / 44

One-Sided and Two-Sided Tests Hypothetical Sampling Distribution of x under H0 : µ = 0

.0002

.0002

observed value of x −3.55 John Fox (McMaster University)

0.0

9.4

x

0

3.55

z

Soc 6Z03:Hypothesis Tests

P-value for two-sided Ha : µ 6= 0.

One-Sided and Two-Sided Tests Because the two-sided P-value is always twice as large as the one-sided value if the departure of x from µ0 is in the predicted direction, you might be tempted to select the direction of Ha after examining the data. This is cheating: The P-value for a one-sided test is correct only if the direction of the alternative hypothesis is specified in advance of looking at the data.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

25 / 44

“Statistical Significance” We say that a result is statistically significant when the P-value is sufficiently small to warrant rejection of the null hypothesis. How small is “sufficiently small”? Conventionally, the criterion of statistical significance is a P-value smaller than .05 (one chance in 20). Sometimes, other criteria are used, such as .1, .01, or even .001.

A pre-specified value for assessing statistical significance is called the α-level (“alpha-level”) of the test — e.g., α = .05. Notice that the smaller the α-level of the test, the larger the departure of x from µ0 needed to reject H0 .

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

26 / 44

“Statistical Significance” Cautions

The term “statistical significance” is in very wide use, so you need to understand what it means, but the terminology is unfortunate: To say that a result is statistically significant means that it is unlikely to have occurred by chance alone if the null hypothesis is true. The null hypothesis is therefore probably wrong. But this does not necessarily mean that the observed result is of any practical significance or importance.

For example, in a very large sample, even a very small departure of x from µ0 can prove to be statistically significant. Because the null hypothesis is unlikely to be exactly right, we are very likely to reject it given a sample that is sufficiently large, even if H0 is nearly correct.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

27 / 44

“Statistical Significance” Cautions

To say that a result is “statistically significant” means that it is unlikely to be the product of chance. Performing a statistical test does not correct basic flaws in the design of a study. For example, to find a statistically significant difference between the two teaching methods does not rule out the possibility that it was simply the novelty of the new method that produced the difference.

Beware of performing many statistical tests simultaneously. Suppose that you perform 100 tests, each at the α = .05 level. Although the probability of rejecting any individual null hypothesis just by chance, even if it is correct, is 5 percent, the probability of rejecting at least one hypothesis among the 100 is much larger than 5 percent.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

28 / 44

“Statistical Significance” Cautions

Thought Question Suppose, for example, that we performed 100 independent hypothesis tests of true null hypotheses, each at the α = .05 level. How many null hypotheses would we expect to reject by chance alone? A 0 of the 100 hypotheses. B 5 of the 100 hypotheses. C 10 of the 100 hypotheses. D All 100 hypotheses. E I don’t know.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

29 / 44

Reporting P-Values With the normally distributed test statistic z, we can find a nearly exact P-value as long as the obtained value of z is in the range of the standard normal table. Alternatively, most statistical software reports the P-value for test statistics.

Sometimes, however, you may need to use a statistical table for a test statistic that only shows the “critical values” of the statistic corresponding to certain α-levels. For the normal distribution, for example,

one-tail P two-tail P critical z ∗

.1 .2 1.282

.05 .1 1.645

α-level .025 .01 .005 .05 .02 .01 1.960 2.326 2.576

.001 .002 3.091

.0005 .001 3.291

Then, having obtained z = 3.55 for a two-sided test, for example, we would report P < .001.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

30 / 44

Reporting P-Values Thought Question To take another example, suppose that z = 2.45 for a one-sided test in which the alternative hypothesis is Ha :µ > 0. Using the table of critical values of z, what is the P-value for the test?

one-tail P two-tail P critical z ∗

.1 .2 1.282

.05 .1 1.645

α-level .025 .01 .005 .05 .02 .01 1.960 2.326 2.576

.001 .002 3.091

.0005 .001 3.291

A P = .01. B .01 > P > .005 . C .01 < P < .005. D I don’t know. John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

31 / 44

Hypothesis Tests and Confidence Intervals Although I have developed them separately, there is a close relationship between hypothesis tests and confidence intervals: The hypothesis H0 : µ = µ0 is acceptable against a two-sided alternative at the level α if and only if the value µ0 lies in the confidence interval for µ constructed at the level of confidence 1 − α. For example, a two-sided hypothesis test at the α = .05 level corresponds to a 95-percent confidence interval. We can therefore think of the 95-percent confidence interval as testing all possible hypotheses about µ at the α = .05 level: Any value of µ that lies within the confidence interval is acceptable at the .05 level; and any value of µ that is outside of the interval is unacceptable.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

32 / 44

Hypothesis Tests and Confidence Intervals Thought Question A two-sided hypothesis test at α = .01 corresponds to a confidence interval at what level of confidence? A 1%. B 5%. C 95%. D 99%. E I don’t know.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

33 / 44

Hypothesis Tests and Confidence Intervals To illustrate, recall the “educational experiment” in which ten instructors each taught two sections of an introductory statistics course, and the variable of interest gave the difference in average grades between a new and old method of instruction. In this example, n = 10, x = 9.4, and s = 8.38; we took σ = s. For the example, the 95 percent confidence interval for µ is σ 8.38 x ± 1.96 √ = 9.4 ± 1.96 √ n 10 = 9.4 ± 5.19

= 4.21 to 14.59

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

34 / 44

title Thought Question (A) True, (B) False, or (C) I don’t know? Because µ = 0 is outside of this interval, the hypothesis H0 : µ = 0 can be rejected at the α = .05 level for a two-sided test. (We already know that the two-sided P-value for this test, .0004, is much smaller than .05.)

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

35 / 44

Hypothesis Tests and Confidence Intervals The relationship between the 95 percent confidence interval and a two-sided hypothesis test at α = .05. .025

.025

-1.96

0

1.96

-5.19

0

5.19

z _ x

4.21 9.4 14.59

95 percent confidence interval around _ observed value of x

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

36 / 44

Hypothesis Testing as a Decision Problem One way of thinking about hypothesis testing is as a decision-making problem: We need to decide whether to accept the null hypothesis H0 or whether to reject it in favour of its alternative.

Scientific inference isn’t quite the same as decision-making, and there are few applications in which hypothesis testing and decision-making literally coincide. Nevertheless, it can illuminate the essential nature of hypothesis testing to think about it in this way.

Suppose that we want to test the null hypothesis H0 : µ = µ0 , where — as before — µ0 is some value (like 0) specified in advance. There are two possibilities: Either H0 is correct, or it is wrong. We do not know, of course, which of these “states of nature” obtains, or we would not need to test the hypothesis.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

37 / 44

Hypothesis Testing as a Decision Problem Based on sample data, we decide either to accept the null hypothesis H0 (if the test statistic z fails to exceed a pre-specified critical value, say 1.96 for a two-sided test at the α = .05 level), or we decide to reject H0 (if the test statistic exceeds the critical value). All possibilities are summarized in the following table:

Decision Reject H0 Accept H0

John Fox (McMaster University)

State of nature H0 true H0 false Type I error :( Correct decision :) Correct decision :) Type II error :(

Soc 6Z03:Hypothesis Tests

Fall 2016

38 / 44

Hypothesis Testing as a Decision Problem Thought Question Decision Reject H0 Accept H0

State of nature H0 true H0 false Type I error :( Correct decision :) Correct decision :) Type II error :(

(A) True, (B) False, or (C) I don’t know? 1

If the null hypothesis is true and we reject it, then we commit a mistake, called a Type II error.

2

If the null hypothesis is true and we accept it, then we have made a correct decision.

3

If the null hypothesis is false and we reject it, then we have made a Type I error.

4

If the null hypothesis is false and we accept it, then we have committed a Type II error.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

39 / 44

The Power of the Test Figuring out the probability of a Type I error is simple: It is just the significance level of the test, α. The probability of a Type II error — failing to reject a false H0 — is less straightforward. The power of the test is the probability of correctly rejecting a false null hypothesis: power = 1 − P (Type II error). true sampling _ distribution of x

sampling distribution _ of x under H0

power of the test α µ0 John Fox (McMaster University)

µ

Soc 6Z03:Hypothesis Tests

Fall 2016

40 / 44

The Power of the Test The probability of a Type II error and the power of the test depend upon three factors: 1. The α-level of the test: The larger we make the value of α, the easier it is to reject H0 (whether H0 is true or false). Thus, increasing the probability of a Type I error (as on the right in the figure below) decreases the probability of a Type II error and increases the power of the test.

power of the test α µ0

µ

µ0

µ

Unfortunately, then, the two types of errors work at cross-purposes. John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Fall 2016

41 / 44

The Power of the Test 2. The true value of the parameter µ: The farther µ is from the hypothesized value µ0 (as on the right), the more likely it is that we will have evidence against H0 : µ = µ0 , and the less likely it is that we will commit a Type II error — increasing the power of the test .

power of the test α µ0

John Fox (McMaster University)

µ0

µ

Soc 6Z03:Hypothesis Tests

µ

Fall 2016

42 / 44

The Power of the Test 3. The sample size n: The larger the sample (as on the right), the easier it will be to detect a departure of a given size from the null hypothesis (because the standard deviation of x is smaller in a large sample) — so the probability of a Type II error goes down and the power of the test increases.

power of the test α µ0 John Fox (McMaster University)

µ

µ0

µ

Soc 6Z03:Hypothesis Tests

Fall 2016

43 / 44

Fall 2016

44 / 44

The Power of the Test Summary

Thought Question (A) True, (B) False, or (C) I don’t know? The probability of a Type II error goes down and the power of the test goes up: as α gets smaller; as µ gets farther from µ0 ; as n gets smaller.

John Fox (McMaster University)

Soc 6Z03:Hypothesis Tests

Suggest Documents