Outline for first two hours
Nested case-control and case-cohort studies An introduction and some new developments Pre-course 13. Norwegian Epidemiology Conference Tromsø 23-24. November 2005
In general on cohort and case-control studies Relative risks and odds-ratios Efficiency comparisons between case-control and cohort studies Logistic regression for cohort and unmatched studies Matched studies: Mantel-Haenzel and conditional logistic regression
Ørnulf Borgan and Sven Ove Samuelsen Department of Mathematics, University of Oslo
Nested case-control and case-cohort studies – p.1/46
Outline of course
Nested case-control and case-cohort studies – p.3/4
Cohort and Case-Control
Wednesday 23/11
Cohort study (prospective)
11:00-11:45
Introduction to case-control studies (SOS)
Exposure at start of study
12:00-12:45
More on traditional case-control studies (SOS)
Disease after follow-up of all individuals
14:00-14:45
Introduction to Cox-regression (ØB)
15:00-15:45
Nested case-control studies (ØB)
Case-status from registry
16:15-17:00
Case-cohort studies (SOS)
Exposure on cases and sample of eligible controls
Case-control study (retrospective)
Thursday 24/11 at Rica Ishavshotell 8:30-9:15 9:30-10:30
Countermatching (ØB) Stratified case-cohort studies (SOS) Nested case-control and case-cohort studies – p.2/46
Nested case-control and case-cohort studies – p.4/4
Main ideas:
Data sources case-control
Cohort study: Compare disease rates between
Cases:
exposed individuals
Registries
non-exposed individuals
Within cohort study
Higher incidence among exposed points to causes of disease Case-controls study: Compare exposure characteristics between
Patients with a specific disease in a hospital Controls:
diseased individuals: cases
General population
non-diseased individuals: controls
Within same cohort study Patients with other diseases in the hospital
Higher level of exposure among cases also points to exposure being associated with causes of disease.
Cases and controls should be selected from the same population Nested case-control and case-cohort studies – p.5/46
Nested case-control and case-cohort studies – p.7/4
Example: Bladder cancer in suspect industry
69
No
257
299
Total
375
368
118
31.5 % = 18.8 % =
Recall bias: Retrospective information gathered from cases and controls need not be equally reliable
Case Controls
Yes
Selection bias: Cases and controls not selected from same population
suspect industry
However, case-control studies can be subject to
Employed in
Case-control study will be cheaper and less time-consuming than a cohort study and can provide almost as precise risk estimates.
375 cases of bladder cancer 368 controls (no bladder cancer)
One will typically only take a sample of all eligible non-diseased individuals.
Cohort or Case-Control?
exposed among cases exposed among control
Exposure "suspect industry" seems to be associated with disease. Nested case-control and case-cohort studies – p.6/46
Nested case-control and case-cohort studies – p.8/4
Example: Smoking and lung cancer, Doll & Hill (1951)
Relative risks in cohort study (population) Total
Exposure for all in a population at start of study
Hospital controls
Lung cancer cases
1-4 5-14 15-24 25-49
No. cigarettes
Cohort (or prospective) study:
Disease registered during study
non-exposed
E
D
non-diseased (control)
D=disease (case)
we have the rates (or probability) of disease among exposed and non-exposed as and P DE P DE
We can test for differences between distributions among cases and controls in several ways, for instance
E=exposed
The distribution of cigarettes smoked seems to be shifted towards higher values among the lung cancer cases.
With dichotomous (0/1) exposure:
Total
= 6900
7018
= 29900
30157
20.84
(std. dev. 14.07)
Yes
118
Mean cig. among control patients
15.89
(std. dev. 11.69)
No
257
which leads to a t-statistic
Mean cig. among lung cancer cases
Non-diseased
Case
Suspect industry
Assigning no. cigarettes 0, 2.5, 10, 20, 37.5, 60 to groups 0, 2-4 etc. cigarettes per day we get
Assume we had complete population data for the bladder cancer data (we don’t!) with data as in this table
on 5 d.f.,
More on matching later).
Nested case-control and case-cohort studies – p.10/46
P DE P DE
(Actually this study was matched on age and sex and somewhat different tests are more appropriate.
RR
leading to
and again a p-value
and P D E
P DE
we would estimate disease rates
For this table the Pearson
Nested case-control and case-cohort studies – p.11/4
Example: Bladder cancer and suspect industry
Lung cancer example cont.
P DE P DE
Relative risk (RR) is thus given as RR
t-test assigning a number of cigarettes to each exposure group Nested case-control and case-cohort studies – p.9/46
Chi-square test for "homogeneity" in table
Bladder cancer twice as common in suspect industry.
Nested case-control and case-cohort studies – p.12/4
P DE
P D E and
Let
RR when incidence is small
Odds-ratio
Relativ risk
Odds
P(D|E) 1-P(D|E)
Thus difference is acceptably small with Nested case-control and case-cohort studies – p.13/46
as high as 0.20.
Nested case-control and case-cohort studies – p.15/4
Estimation of RR and OR in cohort study
Parameter-interpretation in logistic regression Can be estimated in case-control studies (as we will see)
From 2x2 table over a = no. of subjects that are exposed and diseased, etc. D D
Why Odds-ratio?
and
P DE P DE P DE P DE
P DE P DE
P DE P DE
Odds Odds
OR
The Odds-ratio is then defined as
P(D|E) 1-P(D|E)
Among unexposed
P(D|E) P(D|E)
Odds
Among exposed:
Instead of relative risks we often use odds-ratios defined by means of Odds
Approximation OR
Odds-ratio
Approximation to relative risk when incidence is low OR RR
b a+b
E a
d c+d
E c
OR
. Thus the
RR
and P D E with we estimate P D E with estimate for relative risk becomes
In general we either have
OR = RR = 1
RR
Thus RR is always closer to one.
while the odds-ratio is estimated by
Nested case-control and case-cohort studies – p.14/46
OR
RR
OR
Nested case-control and case-cohort studies – p.16/4
Odds-ratio from case-control studies
The artificial cohort Bladder-cancer data
We assumed we had cohort data as in the table
No
257
= 29900
30157
Odds
Among controls
Odds
Among cases:
P(E|D) P(E|D)
Odds Odds
OR
Nested case-control and case-cohort studies – p.17/46
Nested case-control and case-cohort studies – p.19/4
Why?
2x2 table in un-matched case-control
In a case-control study we know the number of cases and the number of controls D D b
c
d
P(E|D)
a
If you really want to verify this mathematical fact use that conditional probabilities are defined as
E
and similarly for other terms involved.
This is standard algebra, although rather boring and somewhat
and of P(E|D by
thus now the column marginals are fixed. Then we may estimate P(E|D) by
P(E and D) P(D)
Total
E
P(E|D) 1-P(E|D)
P(E|D) 1-P(E|D)
Odds Odds
and since
.
whereas
(as previously calculated)
This is so because the case-control study allows estimation of the odds of exposure for cases and controls
7018
= 6900
Then
is valid also in
118
Yes
Total
Non-diseased
Case
Suspect industry
However, the odds-ratio estimate OR case-control studies.
.
tedious.
However, without knowledge of sampling fractions, we can not estimate P D E and P D E and so neither can we estimate the relative risk RR. Nested case-control and case-cohort studies – p.18/46
Nested case-control and case-cohort studies – p.20/4
Estimation of OR in case-control study
Example: Lung cancer and no. cigarettes The argument can be made for more than two exposure levels f.ex. groups of no. cigarettes.
Odds
Total
Controls
and this gives
Cancer cases
Odds
1-4 5-14 15-24 25-49
No. sigarettes
The estimates of exposure-odds among cases and controls are
Odds-ratio
Odds
Odds
For instance the odds-ratio non-smoker and those that smoke 1-4 cigarettes becomes
also in a case-control study (Cornfield, 1951).
Nested case-control and case-cohort studies – p.21/46
Alternative argument
Nested case-control and case-cohort studies – p.23/4
Confidence interval for OR:
Case Control
Exposed
This gives a 95% confidence interval for OR:
Not exp.
se
by a normal distribution approximation when a, b, c and d are all
and hence
and
we have
"large".
for controls
With probabilities of being included in case-control study for cases
OR
Not exp.
Exposed
OR
var
se
Disease Not disease
Wolfe’s formula: Variance estimate for log-odds-ratio
Case-control
Population
Nested case-control and case-cohort studies – p.22/46
Nested case-control and case-cohort studies – p.24/4
Efficiency case-control vs. cohort When the disease is rare the number of available controls in the cohort and is large compared to the number of cases and , thus the cohort variance is approximately
,
so the 95% CI =
se
Ex) Lung-cancer: 1-4 cig vs Non-smoke:
Examples CI
where =no. exposed cases and =no. non-exposed cases.
Exact methods may give a better confidence interval.
cases with Assume an case-control study with all controls per case. Total no. controls is then .
The normal approximation this CI relies on, though, is shaky (b=2 isn’t really big).
Assume also OR=1, thus no effect of exposure. We would then have c K a and d K b Nested case-control and case-cohort studies – p.25/46
Efficiency case-control vs. cohort, contd.
sample size
Cohort variance Case-control variance
The efficiency becomes
Then since variances approximately are proportional to
Variance with design 1 Variance with design 2
By Wolfe’s formula the case-controll variance:
Assume that two designs allows estimation of the same quantity. The efficiency of design 2 relative to design 1 is then defined as
Efficiency between study designs
Nested case-control and case-cohort studies – p.27/4
when K large and little gain by more than K= 4-5.
that gives the same precision as design 2 (if design 1 more efficient than design 2).
Efficiency
Reduction in sample size with design 1
the interpretation of an efficiency is:
These efficiencies are approximately valid when OR not very different from 1 and exposure not very rare.
Nested case-control and case-cohort studies – p.26/46
Nested case-control and case-cohort studies – p.28/4
Logistic regression model for cohort studies
Example Efficiency: Bladder cancer
Then the odds of having the disease equals
and the odds-ratio between two individuals with covariates and becomes Odds OR Odds
a 1:1 case:control ratio.
Odds
, somewhat smaller 0.5 corresponding to
Efficiency
OR
var
P
Artificial cohort study
OR
var
Let be an indicator for disease and a covariate for an individual. Assume that the probability the individual has the disease can be written
Case-control study
Nested case-control and case-cohort studies – p.29/46
Logistic regression and binary exposure
Odds Odds
so Nested case-control and case-cohort studies – p.30/46
OR
OR
and
Efficiency
Odds
Odds
with
we get the following efficiencies
For different values of
257
No
118
Let if an individual is exposed and if the individual is un-exposed. Then the model for disease can be written as a logistic regression model
Yes
Non-diseased
Case
Suspect industry
controls per case
Artificial bladder-cancer data, contd. Assumed case-control study with
Nested case-control and case-cohort studies – p.31/4
.
Nested case-control and case-cohort studies – p.32/4
Proof: Logistic regression un-matched studies
Binary exposure and 2x2 table
c+d
d
c
E
a+b
b
a
E
D
D
Let be the indicator for being sampled as case or control. By Bayes’ rule
This framework with binary exposure and binary outcome can be put up in a 2x2 table:
instead of D, etc., henceforth.
Will use notation
Note: The argument shows that the estimates are valid.
We actually have that the estimates are maximum likelihood, so standard error, tests, etc. are also valid. Nested case-control and case-cohort studies – p.33/46
may be
P P
and other
.
Can estimate odds-ratio
where
.
P P
sampled
OR
P
In this setting
Then
where
P( sampled
P( sampled
P
P( sampled
P( sampled
sampling to case-control study does only depend on disease-status, not on covariates
Several covariates (confounders) adjusted for in model
Assume that
Multivariate logistic regression for cohort studies
Logistic regression for un-matched case-control studies:
Nested case-control and case-cohort studies – p.35/4
from case-control data!
Nested case-control and case-cohort studies – p.34/46
Nested case-control and case-cohort studies – p.36/4
Multivariate logistic regression
NB. Logistic regression for case-control requires
the argument will not hold.
sampled
P
P
Then, just like for univariate logistic regression,
is valid. If for instance the model is linear (risk difference model)
are sampling fractions among cases and controls
and
P
sampling to case-control study does only depend on disease-status, not on covariates
that the cohort model
for un-matched case-control studies: Again assume that
from case-control
Can estimate adjusted odds-ratios data! Only the intercept is changed.
Nested case-control and case-cohort studies – p.37/46
Ex: Dysmeli = missing fingers, parts of arm, toes, etc
AdjOR
Another method is to match on (some of) the confounders:
95% CI
Typical matching factors: Age, sex, neighborhood, family, ...
No
Mother smokes
Low
High maternal education
No
No Pregnant in spring time
For each case sample controls with same value on confounder as the case
Prev. spontaneous abortion
No
95% CI
Logistic regression is one method for controlling confounding.
CrOR
Nested case-control and case-cohort studies – p.39/4
Matching
21 cases and 107 controls from Grenland and Mo i Rana. Pregnant after using p-pills
.
However, if the sampling fractions and are known one can for cases and for do weighted regression with weights controls. STATA with "probability weighting" will produce correct standard errors.
where
Matching can also give some efficiency improvements and is generally a more flexible method for controlling the confounding factors. However, one can not estimate effects of the matching factor
Nested case-control and case-cohort studies – p.38/46
Nested case-control and case-cohort studies – p.40/4
Mantel-Haenzel by 1:1 matching
Matched sets
Discordant pairs:
1:1 matching: Select one control for each case
)
) and control non-exposed (
1) Case exposed (
controls for each case
1:K matching: Select
)
) and control non-exp. (
2) Case non-exposed (
cases
controls for a group of
M:K matching: Select
No. pairs of type 1: No. pairs of type 2:
If
The Mantel-Haenzel estimate then becomes
are large the design is often referred to as a stratified
, .
and
Methods of analysis may differ with theses different sizes.
design
OR
i.e. the ratio of the no. discordant pairs. Furthermore the OR is given by variance estimate of
Nested case-control and case-cohort studies – p.41/46
Conditional logistic regression with 1:K matching
Odds-ratio in matched study: Mantel-Haenzel estimate
Nested case-control and case-cohort studies – p.43/4
Logistic regression model: in set no.
disease-indicator for an individual
differ between sets (nuisance parameters).
d
Can "condition out" nuisance
:
one case in set
by maximizing conditional likelihood
and estimate
case
set
set
= no. cases in set (=1 with 1:M matching), = no. controls in set (=1 with 1:1 matching). Let also . Then the odds-ratio is estimated by OR
where
b
where
c
a
d
c
E
b
a
E
D
D
A matched study with binary exposure be represented by 2x2 tables for all matched sets:
which is on same form as a Cox-likelihood Nested case-control and case-cohort studies – p.42/46
Nested case-control and case-cohort studies – p.44/4
Conditional logistic 1:K matching and Cox-regression Actually is on the same form as a stratified Cox-regression. May fit model with program for Cox where Status variable is indicator for case For time variable use a common arbitrary value, f.ex. 1 for all individuals. Covariates as in Cox-regression Variable that represent matched set used as stratum variable Estimates and tests from Cox-regression are valid!
Nested case-control and case-cohort studies – p.45/46
Analysing M:K matched data More complex conditional likelihood Cox-regression.
. Can not use
Special programs available: Egret, Epicure, LogXact, SAS, ....
Estimates for usual covariate
With stratified studies, N and K large, use standard logistic regression with stratum as categorical covariate is almost unbiased
Estimates for the stratum variable are confounded with sampling fractions in the stratum and not interpretable.
Nested case-control and case-cohort studies – p.46/46