Hypothesis Testing. Lecture 4: Hypothesis Testing. Steps of Hypothesis Testing. Hypothesis test for a single mean I

Hypothesis Testing Lecture 4: Hypothesis Testing We will first discuss hypothesis testing as it applies to means of distributions for continuous vari...
Author: Moses Floyd
0 downloads 2 Views 803KB Size
Hypothesis Testing

Lecture 4: Hypothesis Testing We will first discuss hypothesis testing as it applies to means of distributions for continuous variables

Ani Manichaikul [email protected]

We will then discuss discrete data (specifically dichotomous variables)

20 April 2007

3 / 69

1 / 69

Steps of Hypothesis Testing

Hypothesis test for a single mean I

Assume a population of normally distributed birth weights with a known standard deviation, σ = 1000 grams

Define the null hypothesis, H0

Birth weights are obtained on a sample of 10 infants; the sample mean is calculated as 2500 grams

Define the alternative hypothesis, Ha , where Ha is usually of the form “not H0 ”

Question: Is the mean birth weight in this population different from 3000 grams?

Define the type I error, α, usually 0.05 Calculate the test statistic Calculate the p-value

Set up a two-sided test of

If the p-value is less than α, reject H0 Otherwise, fail to reject H0

H0 : µ = 3000 vs. Ha : µ != 3000 Let α = 0.05 denote a 5% significance level 2 / 69

4 / 69

Hypothesis test for a single mean II

Hypothesis test for a single mean IV Could also use the “critical region” or “rejection region” approach

Calculate the test statistic: zobs =

¯ − µ0 X 2500 − 3000 √ = √ = −1.58 σ/ n 1000/ 10

Based on our significance level (α = 0.05) and assuming H0 is true, how “far” does our sample mean have to be from H0 : µ = 3000 in order to reject?

What does this mean? Our observed mean is 1.58 standard errors below the hypothesized mean

Critical value = zc where 2 × P(Z > |zc |) = 0.05

The test statistic is the standardized value of our data assuming the null hypothesis is true!

In our example, zc = 1.96 The rejection region is any value of our test statistic that is less than -1.96 or greater than 1.96

Question: If the true mean is 3000 grams, is our observed sample mean of 2500 “common” or is this value unlikely to occur?

Decision should be the same whether using the p-value or critical / rejection region 5 / 69

Hypothesis test for a single mean III

7 / 69

Hypothesis test for a single mean V

Calculate the p-value:

An alternative approach for the two sided hypothesis test is to calculate a 100(1-α)% confidence interval for the mean

p-value = P(Z < −|zobs |)+P(Z > |zobs| ) = 2×0.057 = 0.114

We are 95% confident that the interval (1880, 3120) contains the true population mean µ

If the true mean is 3000 grams, our data or data more extreme than ours would occur in 11 out of 100 studies (of the same size, n=10)

¯ ± zα/2 √σ → 2500 ± 1.96 1000 √ X 10 10

In 11 out of 100 studies, just by chance we are likely to observe a sample mean of 2500 or more extreme if the true mean is 3000 grams

The hypothetical true mean 3000 is a plausible value of the true mean given out data

What does this say about our hypothesis?

We cannot say that the true mean is different from 3000

General guideline: if p-value < α, then reject H0

6 / 69

8 / 69

Hypothesis tests for one mean H0 : µ = µ0 , Ha : µ != µ0

P-values

Definition: The p-value for a hypothesis test is the null probability of obtaining a value of the test statistic as or more extreme than the observed test statistic

Population Distribution Normal

The rejection region is determined by α, the desired level of significance, or probability of committing a type I error Reporting the p-value associated with a test gives an indication of how common or rare the computed value of the test statistic is, given that H0 is true

Not Normal/ Unknown

We often use zobs to denote the computed value of the test statistic

Sample Size Any

Population Variance σ 2 known

Any

σ 2 unknown uses s 2 , df=n-1 σ 2 known

Large Large Small

Test Statistic ¯ √0 zobs = Xσ/−µ n

s 2 unknown uses s 2 Any

tobs = zobs = zobs =

¯ −µ0 X √ s/ n

¯ −µ0 X √ σ/ n ¯ −µ0 X √ s/ n

Non-parametric methods

9 / 69

11 / 69

Hypothesis tests for one proportion H0 : p = p0 , Ha : p != p0

Determining the correct test statistic

Depends on your assumptions on σ Population Distribution

When σ is known, we have a standard normal test statistic When σ is unknown and our sample size is relatively small, the test statistic has a t-distribution

Binomial

Sample Size Large Small

The only chance in the procedure is the calculation of the p-value or rejection region uses a t- instead of normal distribution

10 / 69

Test Statistic 0 zobs = q pˆp −p (1−p 0

n

0)

Exact methods

12 / 69

Hypothesis tests for a difference of two means H0 : µ1 − µ2 = µ0 , Ha : µ1 − µ2 != µ0 Population Distribution

Normal

Sample Size Any

Any

Any

Population Variances Known

unknown assume σ12 = σ22 , df = n1 + n2 − 2 unknown assume σ12 != σ22 , df = ν

zobs

Example: Hypothesis test for two means (two independent samples) II Test the hypothesis:

Test Statistic ¯ ¯ 0 = (Xr1 −X2 2 )−µ 2

tobs =

tobs =

σ 1 n1

σ

+ n2

vs. Ha : µplacebo

!= µtreated

at the 5% significance level

s2

+ np

The data are:

2

Treatment Placebo Treated

¯1 −X ¯ )−µ0 (X r 2 s2 1 n1

= µtreated

2

¯1 −X ¯ )−µ0 (X r 2 sp2 n1

H0 : µplacebo

s2

+ n2

2

n 73 156

mean 2526 2751

SD 848 670

15 / 69

13 / 69

Example: Hypothesis test for two means (two independent samples) I

Example: Hypothesis test for two means (two independent samples) III

Calculate the test statistic: The EPREDA Trial: randomized, placebo-controlled trial to determine whether dipyridamole improves the efficacy of aspirin in preventing fetal growth retardation

tobs =

Pregnant women randomized to placebo (n=73), aspirin or aspirin plus dipyridamole (n=156) Mean birth weight was statistically significantly higher in the treated than in the placebo group

¯1 − X ¯ )−µ (X 2526 − 2751 ! 2 2 2 0 =! = −1.99 s1 s2 8482 6762 + + 73 156 np nt

The observed difference in mean birth weight comparing the placebo to treated groups is approximately 2 standard errors below the hypothesized difference of 0 Our sample size is pretty large, so the test statistic will behave like a standard normal variable

2751 (SD 670) grams vs. 2526 (SD 848) grams

14 / 69

16 / 69

Hypothesis tests for a difference of two means H0 : µ1 − µ2 = µ0 , Ha : µ1 − µ2 != µ0

Example: Hypothesis test for two means (two independent samples) IV

Population Distribution

What is the p-value in this example?

Sample Size Large

Population Variances Known

p-value= 0.047

What is your decision in this case?

Not Normal/ Unknown

Not straightforward There may be a difference in birth weight comparing the two groups Need to consider the practical implications

Large

Large

unknown assume σ12 = σ22 , unknown assume σ12 != σ22 ,

small

Any

zobs

Test Statistic ¯ ¯ 0 = (Xr1 −X2 2 )−µ 2

zobs =

zobs =

σ 1 n1

σ

+ n2 2

¯1 −X ¯ )−µ0 (X r 2 sp2 n1

s2

+ np

2

¯1 −X ¯ )−µ0 (X r 2 σ2 1 n1

2

Nonparametric Methods 19 / 69

17 / 69

Example: Hypothesis test for two means (two independent samples) V

σ2

+ n2

Additional Considerations: We’re not always right

Can also give 95% confidence interval for the difference in the two means: (-446.13, -3.87)

Conclusion based on Data (sample) Reject H0 Fail to reject H0

Again, this is a plausible range of values for the true difference in birth weights comparing the placebo to treated groups What is your null hypothesis? No difference!

“Truth” H0 true H0 false Type I error Correct Correct Type II error

Given this confidence interval, is “no difference” a plausible value? Almost?

18 / 69

20 / 69

Errors in hypothesis testing β II

Errors in hypothesis testing α

β depends on sample size, α, and the specified alternative value

α = P(Type I error)

The value of β is usually unknown since the true mean (or other parameter) is generally unknown Before data collection, scientists should decide

= probability of rejecting a true null hypothesis = “level of significance”

the test they will perform the desired Type I error rate α the desired β, for a specified alternative value

Aim: to keep Type I error small by specifying a small rejection region

After specifying this information, an appropriate sample size can be determined

α is usually set before performing a test, typically at level α = 0.05

23 / 69

21 / 69

Errors in hypothesis testing β I

Critical Regions I

β = P(Type II error) = P(fail to reject H0 given H0 is false)

Power = 1 − β

= probability of rejecting H0 when H0 is false

Aim: to keep Type II error small and achieve large power

22 / 69

24 / 69

Critical Regions II

Type II error

25 / 69

Critical Regions III

27 / 69

Dichotomous variables

Proportions 2 × 2 tables

Study Design Hypothesis tests

26 / 69

28 / 69

Proportions and 2 × 2 tables

Population Population 1 Population 2 Total

Success x1 x2 x1 + x2

Study Designs

Failure n1 − x1 n2 − x2 n − (x1 + x2 )

Total n1 n2 n

Cross-sectional Cohort Case-control Matched case-control

Row 1 shows results of a binomial experiment with n1 trials Row 2 shows results of a binomial experiment with n2 trials

29 / 69

31 / 69

Cohort Studies

How do we compare these proportions

Application to Aceh Vitamin A Trial

Often, we want to compare p1 , the probability of success in population 1, to p2 , the probability of success in population 2

25,939 pre-school children in 450 Indonesian villages in northern Sumatra

Usually: “Success” = Disease Population 1 = Treatment 1

200,000 IU vitamin A given 1-3 months after the baseline census, and again at 6-8 months

How do we compare these proportions?

Consider 23,682 out of 25,939 who were visited on a pre-designed schedule

It depends!

30 / 69

32 / 69

Trial Outcome

Confidence interval for RR Step 1: Find the estimate of the log RR Vit A Yes No Total

Alive at 12 months? No Yes 46 12,048 74 11,514 120 23,562

log(

Total 12,094 11,588 23,682

pˆ1 ) pˆ2

Step 2: Estimate the variance of the log(RR) as: 1 − p1 1 − p2 + n1 p1 n2 p2

Does Vitamin A reduce mortality? Calculate risk ratio or “relative risk”

Step 3: Find the 95% CI for log(RR): log(RR) ± 1.96 · SD(log RR) = (lower, upper)

Relative Risk abbreviated as RR Could also compare difference in proportions: called “attributable risk”

Step 4: Exponentiate to get 95% CI for RR; e (lower, upper) 33 / 69

35 / 69

Confidence interval for RR from Vitamin A Trial

Relative Risk Calculation

95% CI for log relative risk is: Relative Risk = = = = =

Rate with Vitamin A Rate without Vitamin A pˆ1 pˆ2 46/12, 094 74/11, 588 0.0038 0.0064 0.59

log(RR) ± 1.96

·

SD(log RR)

= log(0.59) ± 1.96 · = −0.53 ± 0.37

"

0.9962 0.9936 + 46 74

= (−0.90, −0.16) 95% CI for relative risk (e −0.90 , e −0.16 ) = (0.41, 0.85)

Vitamin A group had 40% lower mortality! Does this confidence interval contain 1? 34 / 69

36 / 69

Which p1 and p2 do we use?

What if the data were from a case-control study? Recall: in case-control studies, individuals are selected by outcome status

Calculate OR both ways

Disease (mortality) status defines the population, and exposure status defines the success

Using “case-control” p1 and p2

p1 and p2 have a difference interpretation in a case-control study than in a cohort study Cohort:

OR =

(46/120)/(74/120) 46/74 = = 0.59 (12048/23562)/(11514/23562) 12048/11514

Using “cohort” p1 and p2

p1 = P(Disease | Exposure) p2 = P(Disease | No Exposure)

OR =

Case-Control:

p1 = P(Exposure | Disease) p2 = P(Exposure | No Disease)

(46/12094)/(12048/12094) 46/12048 = = 0.59 (74/11588)/(11514/11588) 74/11514

We get the same answer either way!

⇒ This is why we cannot estimate the relative risk from case-control data! 39 / 69

37 / 69

The Odds Ratio

Bottom Line

The odds ratio measures association in Case-Control studies P(event occurs) Odds = P(event does not occur) Odds ratio for death given Vitamin A status is the odds of death given Vitamin A divided by the odds of death given no Vitamin A OR =

The relative risk cannot be estimated from a case-control study The odds ratio can be estimated from a case-control study OR estimates the RR when the disease is rare The OR is invariant to cohort or case-control designs, the RR is not

p ˆ1 /(1−ˆ p1 ) p ˆ2 /(1−ˆ p2 )

38 / 69

40 / 69

Confidence interval for OR

Matched-pairs case-control study design II

Step 1: Find the estimate of the log OR Results

pˆ1 /(1 − pˆ1 ) log( ) pˆ2 /(1 − pˆ2 )

E = exposed Ec = not exposed

Step 2: Estimate the variance of the log(OR) as:

N = total number of pairs

1 1 1 1 + + + n1 p1 n1 q1 n2 p2 n2 q2

Concordant pair Same exposure

Step 3: Find the 95% CI for log(OR): log(OR) ± 1.96 · SD(log OR) = (lower, upper)

Cases

E Ec

Controls E Ec a b c d a+c b+d

a+b c+d N

Discordant pair Different exposure

Step 4: Exponentiate to get 95% CI for OR; e (lower, upper) 41 / 69

Matched-pairs case-control study design I

43 / 69

Matched-pairs case-control study design III

Concordant pairs provide little information about differences We focus on the discordant pairs

Samples not independent

EEc pairs (b), in which the case is exposed and the control is unexposed Ec E pairs (c), in which the case is unexposed and the control is exposed

Cases and controls matched on age, race, sex, etc. The data are summarized in a different type of table

42 / 69

44 / 69

Example: Estrogen and Endometrial Cancer I

Matched-pairs case-control study design IV

Under the null hypothesis of no difference:

H0 : OR = 1

P(EE c ) = P(E c E ) = 12 = p The number of EEc discordant pairs follows a binomial distribution

Ha : OR != 1

mean = np variance = npq n = b+c (the total number of discordant pairs)

So we can test the null hypothesis, H0 : p = statistic z =

b− n q 2 , 1 1 · ·n 2 2

1 2

Matched pairs design

using the test

which is approximately normally distributed

Cases

Estrogen No estrogen

Controls Estrogen No estrogen 17 76 10 111 27 187

93 121 214 pairs 47 / 69

45 / 69

McNemar’s Test

Example: Estrogen and Endometrial Cancer II

Algebra shows that:

b 76 = = 7.6 c 10 = estimate of the relative risk

OR =

b − n2 z2 = (! )2 1 1 2 · 2 ·n =

c)2

(b − b+c

of disease for exposed vs. unexposed

∼ χ21

McNemar’s test statistic:

This test statistic is much easier to look at, but always gives us the same result as our original z-test

z2 =

Note that the χ21 distribution is defined as the distribution of Z 2 where Z ∼ N(0, 1)

(b − c)2 (76 − 10)2 = = 50.65 b+c 76 + 10

The estimated odds of endometrial cancer among estrogen users is 7.6 times the odds of cancer among those with no estrogen exposure (p