Nested case-control and case-cohort studies

Outline for first two hours Nested case-control and case-cohort studies An introduction and some new developments Pre-course 13. Norwegian Epidemiolo...
Author: Abel Freeman
0 downloads 4 Views 209KB Size
Outline for first two hours

Nested case-control and case-cohort studies An introduction and some new developments Pre-course 13. Norwegian Epidemiology Conference Tromsø 23-24. November 2005

In general on cohort and case-control studies Relative risks and odds-ratios Efficiency comparisons between case-control and cohort studies Logistic regression for cohort and unmatched studies Matched studies: Mantel-Haenzel and conditional logistic regression

Ørnulf Borgan and Sven Ove Samuelsen Department of Mathematics, University of Oslo

Nested case-control and case-cohort studies – p.1/46

Outline of course

Nested case-control and case-cohort studies – p.3/4

Cohort and Case-Control

Wednesday 23/11

Cohort study (prospective)

11:00-11:45

Introduction to case-control studies (SOS)

Exposure at start of study

12:00-12:45

More on traditional case-control studies (SOS)

Disease after follow-up of all individuals

14:00-14:45

Introduction to Cox-regression (ØB)

15:00-15:45

Nested case-control studies (ØB)

Case-status from registry

16:15-17:00

Case-cohort studies (SOS)

Exposure on cases and sample of eligible controls

Case-control study (retrospective)

Thursday 24/11 at Rica Ishavshotell 8:30-9:15 9:30-10:30

Countermatching (ØB) Stratified case-cohort studies (SOS) Nested case-control and case-cohort studies – p.2/46

Nested case-control and case-cohort studies – p.4/4

Main ideas:

Data sources case-control

Cohort study: Compare disease rates between

Cases:

exposed individuals

Registries

non-exposed individuals

Within cohort study

Higher incidence among exposed points to causes of disease Case-controls study: Compare exposure characteristics between

Patients with a specific disease in a hospital Controls:

diseased individuals: cases

General population

non-diseased individuals: controls

Within same cohort study Patients with other diseases in the hospital

Higher level of exposure among cases also points to exposure being associated with causes of disease.

Cases and controls should be selected from the same population Nested case-control and case-cohort studies – p.5/46

Nested case-control and case-cohort studies – p.7/4

Example: Bladder cancer in suspect industry

69

No

257

299

Total

375

368



118

31.5 % = 18.8 % =



Recall bias: Retrospective information gathered from cases and controls need not be equally reliable

Case Controls

Yes



Selection bias: Cases and controls not selected from same population

suspect industry



However, case-control studies can be subject to

Employed in



Case-control study will be cheaper and less time-consuming than a cohort study and can provide almost as precise risk estimates.

375 cases of bladder cancer 368 controls (no bladder cancer)



One will typically only take a sample of all eligible non-diseased individuals.



Cohort or Case-Control?

exposed among cases exposed among control

Exposure "suspect industry" seems to be associated with disease. Nested case-control and case-cohort studies – p.6/46

Nested case-control and case-cohort studies – p.8/4

Example: Smoking and lung cancer, Doll & Hill (1951)

Relative risks in cohort study (population) Total

Exposure for all in a population at start of study



 











Hospital controls



Lung cancer cases



1-4 5-14 15-24 25-49



No. cigarettes



Cohort (or prospective) study:

 

 







Disease registered during study

non-exposed





E

D

non-diseased (control)



D=disease (case)









we have the rates (or probability) of disease among exposed and non-exposed as and P DE P DE



We can test for differences between distributions among cases and controls in several ways, for instance

E=exposed



The distribution of cigarettes smoked seems to be shifted towards higher values among the lung cancer cases.



With dichotomous (0/1) exposure:







Total

= 6900

7018

= 29900

30157

20.84

(std. dev. 14.07)

Yes

118

Mean cig. among control patients

15.89

(std. dev. 11.69)

No

257

which leads to a t-statistic



Mean cig. among lung cancer cases



Non-diseased



Case



Suspect industry



Assigning no. cigarettes 0, 2.5, 10, 20, 37.5, 60 to groups 0, 2-4 etc. cigarettes per day we get

Assume we had complete population data for the bladder cancer data (we don’t!) with data as in this table









on 5 d.f.,















More on matching later).

Nested case-control and case-cohort studies – p.10/46













 







P DE P DE



 (Actually this study was matched on age and sex and somewhat different tests are more appropriate.

RR



leading to







and again a p-value

















 and P D E



P DE















    

     

we would estimate disease rates



For this table the Pearson

Nested case-control and case-cohort studies – p.11/4

Example: Bladder cancer and suspect industry



Lung cancer example cont.

P DE P DE



Relative risk (RR) is thus given as RR

t-test assigning a number of cigarettes to each exposure group Nested case-control and case-cohort studies – p.9/46







Chi-square test for "homogeneity" in table

Bladder cancer twice as common in suspect industry.

Nested case-control and case-cohort studies – p.12/4

 

 P DE





P D E and



Let







RR when incidence is small











Odds-ratio





Relativ risk



Odds

P(D|E) 1-P(D|E)



 













 

 









 





Thus difference is acceptably small with Nested case-control and case-cohort studies – p.13/46

as high as 0.20.

Nested case-control and case-cohort studies – p.15/4

Estimation of RR and OR in cohort study

Parameter-interpretation in logistic regression Can be estimated in case-control studies (as we will see)

From 2x2 table over a = no. of subjects that are exposed and diseased, etc. D D



Why Odds-ratio?

and











 



 



P DE P DE P DE P DE









  



 P DE P DE





 



P DE P DE





Odds Odds





OR











The Odds-ratio is then defined as

































 







 



P(D|E) 1-P(D|E)























 Among unexposed

P(D|E) P(D|E)



Odds



Among exposed:









Instead of relative risks we often use odds-ratios defined by means of Odds



Approximation OR

Odds-ratio

Approximation to relative risk when incidence is low OR RR

b a+b





E a







 



d c+d





E c

OR



. Thus the

 

















RR









and P D E with we estimate P D E with estimate for relative risk becomes

In general we either have

OR = RR = 1

  











RR





 













 



Thus RR is always closer to one.



while the odds-ratio is estimated by



Nested case-control and case-cohort studies – p.14/46

















OR



RR



OR

Nested case-control and case-cohort studies – p.16/4

Odds-ratio from case-control studies

The artificial cohort Bladder-cancer data





We assumed we had cohort data as in the table

No

257

= 29900

30157



  









Odds

Among controls

Odds













Among cases:





P(E|D) P(E|D)

Odds Odds

OR



Nested case-control and case-cohort studies – p.17/46

Nested case-control and case-cohort studies – p.19/4

Why?

2x2 table in un-matched case-control



In a case-control study we know the number of cases and the number of controls D D b

c

d

P(E|D)



a

If you really want to verify this mathematical fact use that conditional probabilities are defined as



E





and similarly for other terms involved.

This is standard algebra, although rather boring and somewhat









and of P(E|D by











thus now the column marginals are fixed. Then we may estimate P(E|D) by

P(E and D) P(D)



Total





E

P(E|D) 1-P(E|D)

P(E|D) 1-P(E|D)





Odds Odds



and since



.







































whereas

(as previously calculated)









 

This is so because the case-control study allows estimation of the odds of exposure for cases and controls



7018



= 6900

Then

is valid also in



118



Yes



Total



Non-diseased



Case

Suspect industry

However, the odds-ratio estimate OR case-control studies.

.

tedious.

 









However, without knowledge of sampling fractions, we can not estimate P D E and P D E and so neither can we estimate the relative risk RR. Nested case-control and case-cohort studies – p.18/46

Nested case-control and case-cohort studies – p.20/4

Estimation of OR in case-control study

Example: Lung cancer and no. cigarettes The argument can be made for more than two exposure levels f.ex. groups of no. cigarettes.

Odds











Total

 



 

 



 

 



 



Controls



and this gives





Cancer cases













Odds







1-4 5-14 15-24 25-49

No. sigarettes















 





The estimates of exposure-odds among cases and controls are





 

 



















Odds-ratio





Odds



Odds

For instance the odds-ratio non-smoker and those that smoke 1-4 cigarettes becomes

 









also in a case-control study (Cornfield, 1951).

Nested case-control and case-cohort studies – p.21/46

Alternative argument

Nested case-control and case-cohort studies – p.23/4

Confidence interval for OR:







Case Control

 

Exposed

This gives a 95% confidence interval for OR:



Not exp.



 

 





se

by a normal distribution approximation when a, b, c and d are all













and hence 













and

 





























 



we have

"large".

for controls

















With probabilities of being included in case-control study for cases









OR



























Not exp.







Exposed















OR





 var

se









Disease Not disease



Wolfe’s formula: Variance estimate for log-odds-ratio

Case-control



Population

Nested case-control and case-cohort studies – p.22/46

Nested case-control and case-cohort studies – p.24/4

Efficiency case-control vs. cohort When the disease is rare the number of available controls in the cohort and is large compared to the number of cases and , thus the cohort variance is approximately





 

 





,





 

 

 





so the 95% CI =











 

















se





 







Ex) Lung-cancer: 1-4 cig vs Non-smoke:



Examples CI

where =no. exposed cases and =no. non-exposed cases.

Exact methods may give a better confidence interval.

cases with Assume an case-control study with all controls per case. Total no. controls is then .















The normal approximation this CI relies on, though, is shaky (b=2 isn’t really big).





Assume also OR=1, thus no effect of exposure. We would then have c K a and d K b Nested case-control and case-cohort studies – p.25/46

Efficiency case-control vs. cohort, contd.

 sample size



 





 

Cohort variance Case-control variance



The efficiency becomes



Then since variances approximately are proportional to



Variance with design 1 Variance with design 2

















 

 





By Wolfe’s formula the case-controll variance:



Assume that two designs allows estimation of the same quantity. The efficiency of design 2 relative to design 1 is then defined as



Efficiency between study designs

Nested case-control and case-cohort studies – p.27/4

when K large and little gain by more than K= 4-5.

that gives the same precision as design 2 (if design 1 more efficient than design 2).





 



 



 



Efficiency

Reduction in sample size with design 1





the interpretation of an efficiency is:

These efficiencies are approximately valid when OR not very different from 1 and exposure not very rare.

Nested case-control and case-cohort studies – p.26/46

Nested case-control and case-cohort studies – p.28/4

Logistic regression model for cohort studies 

Example Efficiency: Bladder cancer

 

 











 



 





Then the odds of having the disease equals



  











































 





 









and the odds-ratio between two individuals with covariates and becomes Odds OR Odds



a 1:1 case:control ratio.







 



 



Odds

, somewhat smaller 0.5 corresponding to



Efficiency



 





















































 







OR

var



















P

Artificial cohort study













 

OR







  





var

Let be an indicator for disease and a covariate for an individual. Assume that the probability the individual has the disease can be written 

Case-control study

Nested case-control and case-cohort studies – p.29/46

Logistic regression and binary exposure



















  



 











 







 



 





Odds Odds

so Nested case-control and case-cohort studies – p.30/46

OR













OR









and





 



 



Efficiency















 

Odds





Odds



with



we get the following efficiencies



For different values of









257



No



118

Let if an individual is exposed and if the individual is un-exposed. Then the model for disease can be written as a logistic regression model 

Yes

Non-diseased



Case



Suspect industry

controls per case



Artificial bladder-cancer data, contd. Assumed case-control study with

Nested case-control and case-cohort studies – p.31/4

.

Nested case-control and case-cohort studies – p.32/4

Proof: Logistic regression un-matched studies 

Binary exposure and 2x2 table



















 











  



 

 





 

 



 















 





























 











 

  































c+d



d



c



E



a+b



b



a

  

E

D



D































  

Let be the indicator for being sampled as case or control. By Bayes’ rule 

This framework with binary exposure and binary outcome can be put up in a 2x2 table:



instead of D, etc., henceforth.



Will use notation

Note: The argument shows that the estimates are valid.

We actually have that the estimates are maximum likelihood, so standard error, tests, etc. are also valid. Nested case-control and case-cohort studies – p.33/46

 







 



may be

  



 

  





  























 



 

























 



 















  



 









P P

 

   



 





and other

.



Can estimate odds-ratio





where





.

   





  

P P









 

 

 



 



sampled





OR

 







 

P 







In this setting

Then

where







 

























 





P( sampled





P( sampled

P







P( sampled









P( sampled

















sampling to case-control study does only depend on disease-status, not on covariates

Several covariates (confounders) adjusted for in model



Assume that



Multivariate logistic regression for cohort studies



Logistic regression for un-matched case-control studies:

Nested case-control and case-cohort studies – p.35/4

from case-control data!

Nested case-control and case-cohort studies – p.34/46

Nested case-control and case-cohort studies – p.36/4

Multivariate logistic regression

NB. Logistic regression for case-control requires

  









 







 



 





  

 























the argument will not hold.





















  









 





sampled











 

P





P

Then, just like for univariate logistic regression,



is valid. If for instance the model is linear (risk difference model)

are sampling fractions among cases and controls



and















P







sampling to case-control study does only depend on disease-status, not on covariates



that the cohort model 

for un-matched case-control studies: Again assume that

 

 from case-control







Can estimate adjusted odds-ratios data! Only the intercept is changed.

Nested case-control and case-cohort studies – p.37/46

Ex: Dysmeli = missing fingers, parts of arm, toes, etc

AdjOR



Another method is to match on (some of) the confounders:

95% CI 



























 













Typical matching factors: Age, sex, neighborhood, family, ...  









 













No







Mother smokes





















 

Low



High maternal education







No

















No Pregnant in spring time

For each case sample controls with same value on confounder as the case





















Prev. spontaneous abortion

























 

No



95% CI

Logistic regression is one method for controlling confounding.



CrOR

Nested case-control and case-cohort studies – p.39/4

Matching

21 cases and 107 controls from Grenland and Mo i Rana. Pregnant after using p-pills













.



However, if the sampling fractions and are known one can for cases and for do weighted regression with weights controls. STATA with "probability weighting" will produce correct standard errors. 





 







where

Matching can also give some efficiency improvements and is generally a more flexible method for controlling the confounding factors. However, one can not estimate effects of the matching factor

Nested case-control and case-cohort studies – p.38/46

Nested case-control and case-cohort studies – p.40/4

Mantel-Haenzel by 1:1 matching

Matched sets

Discordant pairs: 

1:1 matching: Select one control for each case







)



) and control non-exposed (



1) Case exposed (

controls for each case



1:K matching: Select





)



) and control non-exp. (



2) Case non-exposed (



cases



controls for a group of







M:K matching: Select

No. pairs of type 1: No. pairs of type 2:

If

The Mantel-Haenzel estimate then becomes

are large the design is often referred to as a stratified



 



















, .



and

Methods of analysis may differ with theses different sizes.



design











OR

 

 

















i.e. the ratio of the no. discordant pairs. Furthermore the OR is given by variance estimate of

Nested case-control and case-cohort studies – p.41/46

Conditional logistic regression with 1:K matching 

Odds-ratio in matched study: Mantel-Haenzel estimate

Nested case-control and case-cohort studies – p.43/4



Logistic regression model: in set no.

disease-indicator for an individual

  







 



 



differ between sets (nuisance parameters).

d







Can "condition out" nuisance

: 

 















one case in set







by maximizing conditional likelihood









and estimate

 













 













case

















set







 



























































set





= no. cases in set (=1 with 1:M matching), = no. controls in set (=1 with 1:1 matching). Let also . Then the odds-ratio is estimated by OR 

where

 

 





 













b

where









c



a

d



c



E



b



a



E



D



D















 





A matched study with binary exposure be represented by 2x2 tables for all matched sets:

which is on same form as a Cox-likelihood Nested case-control and case-cohort studies – p.42/46

Nested case-control and case-cohort studies – p.44/4





Conditional logistic 1:K matching and Cox-regression Actually is on the same form as a stratified Cox-regression. May fit model with program for Cox where Status variable is indicator for case For time variable use a common arbitrary value, f.ex. 1 for all individuals. Covariates as in Cox-regression Variable that represent matched set used as stratum variable Estimates and tests from Cox-regression are valid!

Nested case-control and case-cohort studies – p.45/46





Analysing M:K matched data More complex conditional likelihood Cox-regression.

. Can not use

Special programs available: Egret, Epicure, LogXact, SAS, ....

Estimates for usual covariate



With stratified studies, N and K large, use standard logistic regression with stratum as categorical covariate is almost unbiased

Estimates for the stratum variable are confounded with sampling fractions in the stratum and not interpretable.

Nested case-control and case-cohort studies – p.46/46

Suggest Documents