• The Analysis of Case-Cohort Studies
•
Usha Seshadrr
• •
A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirsments for the degree of Master of Science
•
• Department of Mathematics and Statistics
•
McGili University Montreal, Quebec Canada ©Usha Seshadri, March 1993
•
•
•
• Introduction
•
Over the past decade there has been a vast amount of blostatlstlcal IIteratum on the analysis of survival time data
ln epldemlology. generally sucl1 data
Involve a cohort of Indlvlciuals who are followed up
ln
tlnle ln order to
•
Investigate a relatlonship between fallure rates (1 e. dlsense rates or delltb
•
ln environmental or occupatlonal cohorts, one IS Interested ln qur:lntlfymg Ihe
rates due ta a speclflc cause of Interest) and precedlng covanate histones
effects of exposure on olsease inCidence or death rates
ln order to obtam
meaningful results a cohort study usually :nvolves the follow-up of seve rai
•
thousand subjects over a number of years However often only a small subset of these subjects will develop the dlsease of interest dunng the prescnbed follow-up period. In the case of dlsease-free Indlvlduals, information gathered
•
can be largely redundant. This process tends ta be both expenslve and tlme consuming since covariate histories need ta be assessed for each mdlvldual
ln
the cohort. In order ta overcome these difflcultles, deSigns whlch Involve
• •
sampliny the full cohort have been implemented
Suc.;h deSigns reduce the
amount of covanate information necessary ta carry out meanlngful analyses
Alternatives ta the full cohort have been proposed under the headlng of synthetic retrospectlve studies (Mantel 1973). Today thls samphng scheme
IS
known as the nested case-control study. In thls approach, each subject
•
developing disease (referred to as a case)
IS
matched to one or more subJects
without disease (referred ta as a control) survlvmg ta the same pOint
ln
tlme
Covanate histories need only be assessed for cases and thelr matched
• •
contrais. Recently, the case-cohort deSign has been suggested to be a preferable way of sampling the full cohort (Prentlce 1986). ThiS sampi mg
•
Il
Includes ail cases wlthln the full cohort and a simple random sam pie or a
•
stratlfled random sam pie of the full cohort. called the sub-cohort. Covanate Information IS gathered on ail cases and ail members of the sub-cohort. Tne Idea of a suh-cohort was first Investlgated under the heading 'hybnd
•
retrospectlve design' (Kupper, McMlchael & Spiritas 1975) and later as the 'case-base' design (Miettlnen 1982). These tlUnlOrS consldered a binary faJlure rather than a tlme to fallure, and a bmary L'ovanate.
•
The advantage
ln
using the nested case-control an,d case-cohort designs over
the full cohort IS the reduced reqUirement for nbtainll'ig covariate histories. 80th
•
designs are conslderably cheaper. There are,
howev\~r,
denved fram using the case-cohort design; it enables the selection and data collection on the sub-cohort prior to identifying the
•
further advantages
caS~lS
in the full cohort, and
the same sub-cohort can be used to study more than one disease. Furthermore it IS possible to obtain approximate standardized mortalit~ or incidence ratios usmg external hazard rates (Wacholder and BOlvin 1987). The disadvantage
•
on the other hand, IS that the analysis is more complex.
The purpose of this thesis is to dlscuss methods for analyzing data fram the
•
case-cohort design under the assumptions of a proportional hazards model. Chapter one will descnbe a proportional hazards model for the full cohort and give a heuristlc denvation of a Ilkelihood function together with a brief summary
•
of estimation m this context. In chapter two the case-cohort design and the assoc;ated "pseudo-likehhood function are described. Estimation of the lO
parameters of interest will be presented. A computer intensive estimate of
•
.
variance for the parameter estlmates and a quicker but cruder estimate whlch is easler to compute will be descnbed .
•
•
III
ln chapter three we apply the methods descnbed
ln
ctmpter two to ,1 SpPCltlC
example using data taken from an occupatlona l case-cohort study
The d::1t,',
are u:.;ed to compare the computer intensive and crude esttnlùtes of vanancp
•
The covanate of interest is studled as a categoncal vanable
•
To illustrate a practlcal application of the case-cohort deSign we II1troduce t1m
model and as a contlnuous variable
t'xlJorH.-mtl,ll
an exponentlal and a hnear model
ln
following example whlch is discussed
ln .111
ln
detall
ln
chapter three. We shall
frequently be referring to thls example throughout thls eXposition
•
Example 1.0 A cohort of 16,297 male workers
ln
the aluminium production tndustry were
followed trom the st art of employment ta December 31, 1989 to Identlfy deaths from lung cancer. A total of 338 subjects were dlagnosed as havrng dled of lung cancer. A case-cohort design was Implemented by selectlng a mndom sample (sub-cohort) stratifiee by year ot blrth
•
trom the cohort The sampllng
fraction within each strata varied so as to reflect the ove rail dlstnbutlon of ctlses in the full cohort. The sub-cohort consisted of 1,138 subJects of whlch 62 were cases. The objective of this study was to estlmate the relatlonshlp between
•
exposure ta coal tar pitch volatiles, estlmated as benzene soluble matanal, and lung cancer among aluminium production workers
• •
•
• Abstract
•
Epldemlologlc cohort Studl8S often requlre a follow-up of and obtainmg 8xposure and confounder Information for several thousand subJects. This
•
process can bEl tlme consuming and expenslve. A sampling scheme whlch reduces costs
15
the case-cohort deSign. It Involves obtalnlng exposure and
confounder mfmmatlon for ail cases and a random sam pie of the remalnder of
•
the cohort.
In this thesis we consider the analysis of data fram a case-cohort
sam pie under thE~ assumptlons of a proportlonal hazards mode!.
Speclfically
we revlew and de scribe techniques for point and interval estimation of
•
parameters and tests of hypotheses tirst for a full cohort, then for a case-cohort sample. We apply the methods to a real example involvmg data from an occupatlonal case-cohort study.
•
• •
• • •
• Résumé
•
Les études épldémlologlques de types cohortes consiste à
sUlvr~
une
population en obtenant de l'mformatlon sur le facteur d'exposItIOn et les
•
facteurs de confusion Ce processus est à la fOIs long et coùteux
Une type
d'échantillonnage qUI rédUIt ce coût amsl que cette durée. est l'etude épidémlologlque de type cas-cohorte. Le but de cette etude est d'obtpnlr l'
•
information sur le facteur d'expOSition et les fac'eur de confusion pour chl1que individu de la cohorte et pour une échantillon aléatoire de la cohorte
Dans
cette thèse, nous présentons des méthodes d'analyse d'une étude cas-cohorte
•
en supposant un modèle de Cox de risques Instantanées proportlOnelles. En particulier, nous decrivons des techniques d'estimation ponctuelle et d'intervales de confiance pour les paramètres du modèle, amsI que des tests
•
d'hypothèses pour les études de types cohorte et cas-cor.ort. À titre d'exemplu. nous présentons l'analyse des données d'une étude profeSSionnelle de type cas-cohorte.
•
• • •
• Acknowledgements
• 1 would IIke ta thank my thesis supervisor Prof. Ben Armstrong for insplring me ta pursue my rnterests
•
ln
the case-cohort design. 1thank him for ail hls tlme,
patience and advlce he gave me long distance whllst workrng on thls thesis. wish ta thank Prof David Wolfson for stimulatrng my Interest through a reading course on Survlval Analysis and for his gUidance. Lastly 1thank my father Prof.
•
V Seshadn and my husband Michael Kreaden for thelr constant support and encouragement
• • •
•
• •
•
• Table of Contents
• •
• •
•
•
• • • •
Chapter 1 A Proportional Hazards Model for Cohort Studl9s Figure 1 Full Cohort ..... 1.1 The Model 1.2 Censonng ... 1 3 The Llkelihood Functlon ... 1.4 Types of Covanates . .. .. . 1.5 Stratification ........ '" . 1.6 Estimation............. ...... . .. 1.7 Hypothesis Testmg ............. . 1.8 Confidence Reglons ............ . Chapter 2 Sampling From The Full Cohort .. 2.0 Nested Case-Control ......... . Figure 2 Nested Case-Control.. . . ....... . 2.1 Case-Cohort.... ..... ....... ... ....... ....... .. . .. . Figure 3 Case-Co hart ... ..... .... ...... .... ... .. 2.2 Stratlfied Sub-Cohort .................. . 2.3 Estimation ................................... . 2.4 Testing Hypotheses and Confidence Reglons . 2.5 Computationallssues........... . ... .... . ... .
8 10
11 12 16 20
23 23
24 26
20 29 30 34 35
Chapter 3 An Application ............... ... .... . . ... .... . 3.0 Description of the Study ...... . .. .... ... .. .. 3.1 Standard Analysis ................ ....... ....... . 3.2 An Alternative Method For Case-Cohort Analysis . 3.3 An Empmcallnvestlgation of the Adequacy of the UnadJusted Variance Estimator ........ ....... ....... ........ . .... . .. 3.4 Small Group Size ........... ...... ..... . .... .. .
48 50
Chapter 4 Conclusion ............ ........ . . ...... .. . .... .
52
Bibliography ..... .... "
53
. ......
..... ..
.. ....
38 38 41 46
•
Page'
Chapter 1: A Proportional Hazards Model for Cohort Studies
•
The classlcal description of a cohort
15
a group of people who share some
common charactenstlc and are followed through tlme for the detectlon of new cases of dlsease We refer back to example 1 0 in whtch a group of people who
•
are employed tn an aluminium production plant, were followed up to ftnd whether and when each had dled of lung cancer. We are generally, as in this example, dealtng wtth a large nurnber of subJects and relatively fewer cases.
•
ln epldemlologlcal studies, the usual time scale used is age (Breslow et al 1983). This distinguishes the epidemiological context fram the slmpler statistical one ln
•
whlch the primary measure of tlme in a survlval analysis study begins at zero for ail tndlvlduals (1 e. tlme is measured as time from entry into the cohort). The age of an indlvldual's entry II1to the cohort need not necessarily be the same for each
•
subJect.
Figure (1) Illustrates a hypothetical cohort study in whlch each horizontal li ne
•
represents a study subject. Nine subjects are followed up until age eighty and three of these subjects are ascertalned as dying fram lung cancer. The rematntng
•
• • •
SIX
subJects dled tram other causes, or were alive but lost to follow-up.
• Figure 1: Full Cohort
• *
9
.-
__________________ 0
8 __________________
7
•
• l/tI,lth
~-----o
--------------~----~
~ 5 co
..J
l
4
•
n
1
1
.ltl ,t)'~
l
or ln,,', ln fllllnw up
!
2
J
l .1I1l .. 1
11
!
3
lunq
()p,ltll ffllllll1tth 1 1
Il
6
frllill
()
1 l
•
20
t
1
t
2
Associated wlth each subject (I.e. cases and controls)
•
i
= 1, ... 11.
The cl cases are denoted by the label
l,
IS
HO
;1
a l;:lbel
where
1
1,
Age
1
such tllat Il
the figure above, the cases are represented by the set of labels Il,11 thus il
•
= 9,
'2
1. 1.
cl
Il.
1.1 The Model ln
time until the occurrence of an event. for example. rJuatll
due to lung cancer. The outcome of interest IS not only the blncuy occummCf; (yes, no) of the event, but also the tlme T to (age of) that
occummc(~
nl~J tlfTH-~
occurrence may be thought of as a survival t,me We shall assume that the
• • •
III
= 5, ;3 = 3.
The cohort Îs followed
•
1
survival times are continuous so as to avold the complexlty of tled values
tu
•
Page 3
Definition 1. 1 Let T be a contmuous random vanable such that
•
FU)
= PIT < t} where
FU) is the distnbution function of T. The
sury/var (unet/on correspandmg ta time T is SU), such that
•
S(t)
=1- F(t} =P{T > t}.
Definition 1.2
•
The hazard fuacÜon
À.( t) at
time T is the conditional probability of
fallure (Le. death or disease incidence) in a vanishingly small interval t ta t + I1t
•
À(t)
•
t
given survival to time t
= hm
1
ill ---.()
. -P{t:::; T < t+ I1tIT;:;: Il I1t
The hazard functlon is sometimes referred ta as the age specifie failure rate and serves as a way of quantifying the population disease frequency during a
•
speclfied time penod. The hazard and survivor function share the relationshlp;
À(t)
= f(t) S(t)
• where lU)
•
•
the density of thH survival time T .
Of usual interest is the effect of a covariate functlon
•
IS
IS
= on the survival time. The hazard
the means by which this relationship is studied.
•
P;lgtl 4
Definition 1.3
• •
Let z be a px 1 dimenslOnal vector of covanates The conditional hazard is
ÀUI:) = lirn \(-.. '1 _1 Plt5;T
[, (z"
:/J)
•
r (equation 1.12)
and
• •
(equation 1.13)
An alternative procedure for findlng maximum likelihood estimates is via the ,
method of scoring which uses the matrix of expected second partial derivatives ca lied Fisher's information matrix. The pxp covariance matrix with u, vth element
•
'". (/3)
'~.
E {-
a/~J!. I(fJ)}
(equation 1.14)
•
is the Fisher information matrix. Analogous to the observed information matrix
•
The estlmate fJ does not depend on whether the observed information matrix
•
(,(fJ), we denote the Fisher information matrix by le(3).
A
1., tfJ)
or the Fisher information " (f3) is used. For exponential models the two
• forms are identical (Thomas 1981). However for ather models thls
•
not tne
case. Convergence to the maximum likelihood estimates may be somewllat slower when using the observed information matnx (Thomas 1981) An of variance of
•
IS
~
is glven by
[loClnt
or [lt>UJ)]
l~stllT1 t1 ), then the successive restncted nsk sets under case-cohort samphng would be,
Rl
•
= {9,6,5,4},R2 = {5,4}
and R3
= {3,2}
ln constructing the restricted risk sets we note that subjects who appear as
•
controls earlier on are re-used in aillater risk sets until death or censonng. For example, the subject in figure 3 with label
1)
=2 appears in nsk sets
-
R1,R2 and R3. Subjects who tail fram outside the sub-cohort Will be consldered
•
in only one risk set, as seen in R3.
Prentice (1986) praposed a likelihood for case-cohort data assuming a
• • • •
proportional hazards model in which the contribution of the 1 th fallure ta the likelihood is the probability that subject
1)
fails at (" glven the censonng and
covariate information up to time f" and that one subJect falls tram the restncted nsk set R, . The likelihood for the complete case-cohort sample IS thus,
L{fJ) =
n d
r(z, ;fJ)
/=1
Ir(Zt;fJ)
1
(equatlOn 2. 1)
• •
Page 29
The selection of subjects Into the sub-cohort in general affects many restricted risk sets. The restncted nsk sets at distinct failure times are not independent glven prlor censonng, tailure, and entry information. One possible
•
disadvantage of the case-cohort appraach occurs when there is substantial censoring; there may be no members of the sub-cohort to compare with later oGcunng cases (Prentice 1986, Wacholder et al 1991). This problem can be
•
mlnimized by sampling one or more additional sub-Gohorts. The case-cohort design
IS
thus unhke the full cohort or nested case-control where the risk sets
are independent conditional on prior censoring, failure and entry information.
•
Prentice (1986) referred to the likelihood function as a "pseudo-likelihood" rather than a partial likelihood.
fJ
•
We denote the "pseudo-likelihood" function for
by L(f3).
2.2 Stratified Sub-Cohort
It is useful to extend the corresponding "pseudo-likelihood" for a stratified subcohort. It may be derived similarly, fram the partiallikelihood for the stratified
•
full cohort. Glven a cohort Q, of size .
Gohort nof slze
1/ ,
Il
with strata sizes
•
let the sub-
be a stratified random sam pie without replacement and -
strata sizes
Il,,11 2 ,,,, nu'
1I 1.!l è " .. IIU '
Ill Q
with dl .... d q failures such that
-
q
=
Il.