Hypothesis Testing
Lecture 4: Hypothesis Testing We will first discuss hypothesis testing as it applies to means of distributions for continuous variables
Ani Manichaikul
[email protected]
We will then discuss discrete data (specifically dichotomous variables)
20 April 2007
3 / 69
1 / 69
Steps of Hypothesis Testing
Hypothesis test for a single mean I
Assume a population of normally distributed birth weights with a known standard deviation, σ = 1000 grams
Define the null hypothesis, H0
Birth weights are obtained on a sample of 10 infants; the sample mean is calculated as 2500 grams
Define the alternative hypothesis, Ha , where Ha is usually of the form “not H0 ”
Question: Is the mean birth weight in this population different from 3000 grams?
Define the type I error, α, usually 0.05 Calculate the test statistic Calculate the p-value
Set up a two-sided test of
If the p-value is less than α, reject H0 Otherwise, fail to reject H0
H0 : µ = 3000 vs. Ha : µ != 3000 Let α = 0.05 denote a 5% significance level 2 / 69
4 / 69
Hypothesis test for a single mean II
Hypothesis test for a single mean IV Could also use the “critical region” or “rejection region” approach
Calculate the test statistic: zobs =
¯ − µ0 X 2500 − 3000 √ = √ = −1.58 σ/ n 1000/ 10
Based on our significance level (α = 0.05) and assuming H0 is true, how “far” does our sample mean have to be from H0 : µ = 3000 in order to reject?
What does this mean? Our observed mean is 1.58 standard errors below the hypothesized mean
Critical value = zc where 2 × P(Z > |zc |) = 0.05
The test statistic is the standardized value of our data assuming the null hypothesis is true!
In our example, zc = 1.96 The rejection region is any value of our test statistic that is less than -1.96 or greater than 1.96
Question: If the true mean is 3000 grams, is our observed sample mean of 2500 “common” or is this value unlikely to occur?
Decision should be the same whether using the p-value or critical / rejection region 5 / 69
Hypothesis test for a single mean III
7 / 69
Hypothesis test for a single mean V
Calculate the p-value:
An alternative approach for the two sided hypothesis test is to calculate a 100(1-α)% confidence interval for the mean
p-value = P(Z < −|zobs |)+P(Z > |zobs| ) = 2×0.057 = 0.114
We are 95% confident that the interval (1880, 3120) contains the true population mean µ
If the true mean is 3000 grams, our data or data more extreme than ours would occur in 11 out of 100 studies (of the same size, n=10)
¯ ± zα/2 √σ → 2500 ± 1.96 1000 √ X 10 10
In 11 out of 100 studies, just by chance we are likely to observe a sample mean of 2500 or more extreme if the true mean is 3000 grams
The hypothetical true mean 3000 is a plausible value of the true mean given out data
What does this say about our hypothesis?
We cannot say that the true mean is different from 3000
General guideline: if p-value < α, then reject H0
6 / 69
8 / 69
Hypothesis tests for one mean H0 : µ = µ0 , Ha : µ != µ0
P-values
Definition: The p-value for a hypothesis test is the null probability of obtaining a value of the test statistic as or more extreme than the observed test statistic
Population Distribution Normal
The rejection region is determined by α, the desired level of significance, or probability of committing a type I error Reporting the p-value associated with a test gives an indication of how common or rare the computed value of the test statistic is, given that H0 is true
Not Normal/ Unknown
We often use zobs to denote the computed value of the test statistic
Sample Size Any
Population Variance σ 2 known
Any
σ 2 unknown uses s 2 , df=n-1 σ 2 known
Large Large Small
Test Statistic ¯ √0 zobs = Xσ/−µ n
s 2 unknown uses s 2 Any
tobs = zobs = zobs =
¯ −µ0 X √ s/ n
¯ −µ0 X √ σ/ n ¯ −µ0 X √ s/ n
Non-parametric methods
9 / 69
11 / 69
Hypothesis tests for one proportion H0 : p = p0 , Ha : p != p0
Determining the correct test statistic
Depends on your assumptions on σ Population Distribution
When σ is known, we have a standard normal test statistic When σ is unknown and our sample size is relatively small, the test statistic has a t-distribution
Binomial
Sample Size Large Small
The only chance in the procedure is the calculation of the p-value or rejection region uses a t- instead of normal distribution
10 / 69
Test Statistic 0 zobs = q pˆp −p (1−p 0
n
0)
Exact methods
12 / 69
Hypothesis tests for a difference of two means H0 : µ1 − µ2 = µ0 , Ha : µ1 − µ2 != µ0 Population Distribution
Normal
Sample Size Any
Any
Any
Population Variances Known
unknown assume σ12 = σ22 , df = n1 + n2 − 2 unknown assume σ12 != σ22 , df = ν
zobs
Example: Hypothesis test for two means (two independent samples) II Test the hypothesis:
Test Statistic ¯ ¯ 0 = (Xr1 −X2 2 )−µ 2
tobs =
tobs =
σ 1 n1
σ
+ n2
vs. Ha : µplacebo
!= µtreated
at the 5% significance level
s2
+ np
The data are:
2
Treatment Placebo Treated
¯1 −X ¯ )−µ0 (X r 2 s2 1 n1
= µtreated
2
¯1 −X ¯ )−µ0 (X r 2 sp2 n1
H0 : µplacebo
s2
+ n2
2
n 73 156
mean 2526 2751
SD 848 670
15 / 69
13 / 69
Example: Hypothesis test for two means (two independent samples) I
Example: Hypothesis test for two means (two independent samples) III
Calculate the test statistic: The EPREDA Trial: randomized, placebo-controlled trial to determine whether dipyridamole improves the efficacy of aspirin in preventing fetal growth retardation
tobs =
Pregnant women randomized to placebo (n=73), aspirin or aspirin plus dipyridamole (n=156) Mean birth weight was statistically significantly higher in the treated than in the placebo group
¯1 − X ¯ )−µ (X 2526 − 2751 ! 2 2 2 0 =! = −1.99 s1 s2 8482 6762 + + 73 156 np nt
The observed difference in mean birth weight comparing the placebo to treated groups is approximately 2 standard errors below the hypothesized difference of 0 Our sample size is pretty large, so the test statistic will behave like a standard normal variable
2751 (SD 670) grams vs. 2526 (SD 848) grams
14 / 69
16 / 69
Hypothesis tests for a difference of two means H0 : µ1 − µ2 = µ0 , Ha : µ1 − µ2 != µ0
Example: Hypothesis test for two means (two independent samples) IV
Population Distribution
What is the p-value in this example?
Sample Size Large
Population Variances Known
p-value= 0.047
What is your decision in this case?
Not Normal/ Unknown
Not straightforward There may be a difference in birth weight comparing the two groups Need to consider the practical implications
Large
Large
unknown assume σ12 = σ22 , unknown assume σ12 != σ22 ,
small
Any
zobs
Test Statistic ¯ ¯ 0 = (Xr1 −X2 2 )−µ 2
zobs =
zobs =
σ 1 n1
σ
+ n2 2
¯1 −X ¯ )−µ0 (X r 2 sp2 n1
s2
+ np
2
¯1 −X ¯ )−µ0 (X r 2 σ2 1 n1
2
Nonparametric Methods 19 / 69
17 / 69
Example: Hypothesis test for two means (two independent samples) V
σ2
+ n2
Additional Considerations: We’re not always right
Can also give 95% confidence interval for the difference in the two means: (-446.13, -3.87)
Conclusion based on Data (sample) Reject H0 Fail to reject H0
Again, this is a plausible range of values for the true difference in birth weights comparing the placebo to treated groups What is your null hypothesis? No difference!
“Truth” H0 true H0 false Type I error Correct Correct Type II error
Given this confidence interval, is “no difference” a plausible value? Almost?
18 / 69
20 / 69
Errors in hypothesis testing β II
Errors in hypothesis testing α
β depends on sample size, α, and the specified alternative value
α = P(Type I error)
The value of β is usually unknown since the true mean (or other parameter) is generally unknown Before data collection, scientists should decide
= probability of rejecting a true null hypothesis = “level of significance”
the test they will perform the desired Type I error rate α the desired β, for a specified alternative value
Aim: to keep Type I error small by specifying a small rejection region
After specifying this information, an appropriate sample size can be determined
α is usually set before performing a test, typically at level α = 0.05
23 / 69
21 / 69
Errors in hypothesis testing β I
Critical Regions I
β = P(Type II error) = P(fail to reject H0 given H0 is false)
Power = 1 − β
= probability of rejecting H0 when H0 is false
Aim: to keep Type II error small and achieve large power
22 / 69
24 / 69
Critical Regions II
Type II error
25 / 69
Critical Regions III
27 / 69
Dichotomous variables
Proportions 2 × 2 tables
Study Design Hypothesis tests
26 / 69
28 / 69
Proportions and 2 × 2 tables
Population Population 1 Population 2 Total
Success x1 x2 x1 + x2
Study Designs
Failure n1 − x1 n2 − x2 n − (x1 + x2 )
Total n1 n2 n
Cross-sectional Cohort Case-control Matched case-control
Row 1 shows results of a binomial experiment with n1 trials Row 2 shows results of a binomial experiment with n2 trials
29 / 69
31 / 69
Cohort Studies
How do we compare these proportions
Application to Aceh Vitamin A Trial
Often, we want to compare p1 , the probability of success in population 1, to p2 , the probability of success in population 2
25,939 pre-school children in 450 Indonesian villages in northern Sumatra
Usually: “Success” = Disease Population 1 = Treatment 1
200,000 IU vitamin A given 1-3 months after the baseline census, and again at 6-8 months
How do we compare these proportions?
Consider 23,682 out of 25,939 who were visited on a pre-designed schedule
It depends!
30 / 69
32 / 69
Trial Outcome
Confidence interval for RR Step 1: Find the estimate of the log RR Vit A Yes No Total
Alive at 12 months? No Yes 46 12,048 74 11,514 120 23,562
log(
Total 12,094 11,588 23,682
pˆ1 ) pˆ2
Step 2: Estimate the variance of the log(RR) as: 1 − p1 1 − p2 + n1 p1 n2 p2
Does Vitamin A reduce mortality? Calculate risk ratio or “relative risk”
Step 3: Find the 95% CI for log(RR): log(RR) ± 1.96 · SD(log RR) = (lower, upper)
Relative Risk abbreviated as RR Could also compare difference in proportions: called “attributable risk”
Step 4: Exponentiate to get 95% CI for RR; e (lower, upper) 33 / 69
35 / 69
Confidence interval for RR from Vitamin A Trial
Relative Risk Calculation
95% CI for log relative risk is: Relative Risk = = = = =
Rate with Vitamin A Rate without Vitamin A pˆ1 pˆ2 46/12, 094 74/11, 588 0.0038 0.0064 0.59
log(RR) ± 1.96
·
SD(log RR)
= log(0.59) ± 1.96 · = −0.53 ± 0.37
"
0.9962 0.9936 + 46 74
= (−0.90, −0.16) 95% CI for relative risk (e −0.90 , e −0.16 ) = (0.41, 0.85)
Vitamin A group had 40% lower mortality! Does this confidence interval contain 1? 34 / 69
36 / 69
Which p1 and p2 do we use?
What if the data were from a case-control study? Recall: in case-control studies, individuals are selected by outcome status
Calculate OR both ways
Disease (mortality) status defines the population, and exposure status defines the success
Using “case-control” p1 and p2
p1 and p2 have a difference interpretation in a case-control study than in a cohort study Cohort:
OR =
(46/120)/(74/120) 46/74 = = 0.59 (12048/23562)/(11514/23562) 12048/11514
Using “cohort” p1 and p2
p1 = P(Disease | Exposure) p2 = P(Disease | No Exposure)
OR =
Case-Control:
p1 = P(Exposure | Disease) p2 = P(Exposure | No Disease)
(46/12094)/(12048/12094) 46/12048 = = 0.59 (74/11588)/(11514/11588) 74/11514
We get the same answer either way!
⇒ This is why we cannot estimate the relative risk from case-control data! 39 / 69
37 / 69
The Odds Ratio
Bottom Line
The odds ratio measures association in Case-Control studies P(event occurs) Odds = P(event does not occur) Odds ratio for death given Vitamin A status is the odds of death given Vitamin A divided by the odds of death given no Vitamin A OR =
The relative risk cannot be estimated from a case-control study The odds ratio can be estimated from a case-control study OR estimates the RR when the disease is rare The OR is invariant to cohort or case-control designs, the RR is not
p ˆ1 /(1−ˆ p1 ) p ˆ2 /(1−ˆ p2 )
38 / 69
40 / 69
Confidence interval for OR
Matched-pairs case-control study design II
Step 1: Find the estimate of the log OR Results
pˆ1 /(1 − pˆ1 ) log( ) pˆ2 /(1 − pˆ2 )
E = exposed Ec = not exposed
Step 2: Estimate the variance of the log(OR) as:
N = total number of pairs
1 1 1 1 + + + n1 p1 n1 q1 n2 p2 n2 q2
Concordant pair Same exposure
Step 3: Find the 95% CI for log(OR): log(OR) ± 1.96 · SD(log OR) = (lower, upper)
Cases
E Ec
Controls E Ec a b c d a+c b+d
a+b c+d N
Discordant pair Different exposure
Step 4: Exponentiate to get 95% CI for OR; e (lower, upper) 41 / 69
Matched-pairs case-control study design I
43 / 69
Matched-pairs case-control study design III
Concordant pairs provide little information about differences We focus on the discordant pairs
Samples not independent
EEc pairs (b), in which the case is exposed and the control is unexposed Ec E pairs (c), in which the case is unexposed and the control is exposed
Cases and controls matched on age, race, sex, etc. The data are summarized in a different type of table
42 / 69
44 / 69
Example: Estrogen and Endometrial Cancer I
Matched-pairs case-control study design IV
Under the null hypothesis of no difference:
H0 : OR = 1
P(EE c ) = P(E c E ) = 12 = p The number of EEc discordant pairs follows a binomial distribution
Ha : OR != 1
mean = np variance = npq n = b+c (the total number of discordant pairs)
So we can test the null hypothesis, H0 : p = statistic z =
b− n q 2 , 1 1 · ·n 2 2
1 2
Matched pairs design
using the test
which is approximately normally distributed
Cases
Estrogen No estrogen
Controls Estrogen No estrogen 17 76 10 111 27 187
93 121 214 pairs 47 / 69
45 / 69
McNemar’s Test
Example: Estrogen and Endometrial Cancer II
Algebra shows that:
b 76 = = 7.6 c 10 = estimate of the relative risk
OR =
b − n2 z2 = (! )2 1 1 2 · 2 ·n =
c)2
(b − b+c
of disease for exposed vs. unexposed
∼ χ21
McNemar’s test statistic:
This test statistic is much easier to look at, but always gives us the same result as our original z-test
z2 =
Note that the χ21 distribution is defined as the distribution of Z 2 where Z ∼ N(0, 1)
(b − c)2 (76 − 10)2 = = 50.65 b+c 76 + 10
The estimated odds of endometrial cancer among estrogen users is 7.6 times the odds of cancer among those with no estrogen exposure (p