The Analysis of Case-Cohort Studies

• The Analysis of Case-Cohort Studies • Usha Seshadrr • • A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillmen...
2 downloads 0 Views 2MB Size
• The Analysis of Case-Cohort Studies



Usha Seshadrr

• •

A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirsments for the degree of Master of Science



• Department of Mathematics and Statistics



McGili University Montreal, Quebec Canada ©Usha Seshadri, March 1993







• Introduction



Over the past decade there has been a vast amount of blostatlstlcal IIteratum on the analysis of survival time data

ln epldemlology. generally sucl1 data

Involve a cohort of Indlvlciuals who are followed up

ln

tlnle ln order to



Investigate a relatlonship between fallure rates (1 e. dlsense rates or delltb



ln environmental or occupatlonal cohorts, one IS Interested ln qur:lntlfymg Ihe

rates due ta a speclflc cause of Interest) and precedlng covanate histones

effects of exposure on olsease inCidence or death rates

ln order to obtam

meaningful results a cohort study usually :nvolves the follow-up of seve rai



thousand subjects over a number of years However often only a small subset of these subjects will develop the dlsease of interest dunng the prescnbed follow-up period. In the case of dlsease-free Indlvlduals, information gathered



can be largely redundant. This process tends ta be both expenslve and tlme consuming since covariate histories need ta be assessed for each mdlvldual

ln

the cohort. In order ta overcome these difflcultles, deSigns whlch Involve

• •

sampliny the full cohort have been implemented

Suc.;h deSigns reduce the

amount of covanate information necessary ta carry out meanlngful analyses

Alternatives ta the full cohort have been proposed under the headlng of synthetic retrospectlve studies (Mantel 1973). Today thls samphng scheme

IS

known as the nested case-control study. In thls approach, each subject



developing disease (referred to as a case)

IS

matched to one or more subJects

without disease (referred ta as a control) survlvmg ta the same pOint

ln

tlme

Covanate histories need only be assessed for cases and thelr matched

• •

contrais. Recently, the case-cohort deSign has been suggested to be a preferable way of sampling the full cohort (Prentlce 1986). ThiS sampi mg



Il

Includes ail cases wlthln the full cohort and a simple random sam pie or a



stratlfled random sam pie of the full cohort. called the sub-cohort. Covanate Information IS gathered on ail cases and ail members of the sub-cohort. Tne Idea of a suh-cohort was first Investlgated under the heading 'hybnd



retrospectlve design' (Kupper, McMlchael & Spiritas 1975) and later as the 'case-base' design (Miettlnen 1982). These tlUnlOrS consldered a binary faJlure rather than a tlme to fallure, and a bmary L'ovanate.



The advantage

ln

using the nested case-control an,d case-cohort designs over

the full cohort IS the reduced reqUirement for nbtainll'ig covariate histories. 80th



designs are conslderably cheaper. There are,

howev\~r,

denved fram using the case-cohort design; it enables the selection and data collection on the sub-cohort prior to identifying the



further advantages

caS~lS

in the full cohort, and

the same sub-cohort can be used to study more than one disease. Furthermore it IS possible to obtain approximate standardized mortalit~ or incidence ratios usmg external hazard rates (Wacholder and BOlvin 1987). The disadvantage



on the other hand, IS that the analysis is more complex.

The purpose of this thesis is to dlscuss methods for analyzing data fram the



case-cohort design under the assumptions of a proportional hazards model. Chapter one will descnbe a proportional hazards model for the full cohort and give a heuristlc denvation of a Ilkelihood function together with a brief summary



of estimation m this context. In chapter two the case-cohort design and the assoc;ated "pseudo-likehhood function are described. Estimation of the lO

parameters of interest will be presented. A computer intensive estimate of



.

variance for the parameter estlmates and a quicker but cruder estimate whlch is easler to compute will be descnbed .





III

ln chapter three we apply the methods descnbed

ln

ctmpter two to ,1 SpPCltlC

example using data taken from an occupatlona l case-cohort study

The d::1t,',

are u:.;ed to compare the computer intensive and crude esttnlùtes of vanancp



The covanate of interest is studled as a categoncal vanable



To illustrate a practlcal application of the case-cohort deSign we II1troduce t1m

model and as a contlnuous variable

t'xlJorH.-mtl,ll

an exponentlal and a hnear model

ln

following example whlch is discussed

ln .111

ln

detall

ln

chapter three. We shall

frequently be referring to thls example throughout thls eXposition



Example 1.0 A cohort of 16,297 male workers

ln

the aluminium production tndustry were

followed trom the st art of employment ta December 31, 1989 to Identlfy deaths from lung cancer. A total of 338 subjects were dlagnosed as havrng dled of lung cancer. A case-cohort design was Implemented by selectlng a mndom sample (sub-cohort) stratifiee by year ot blrth



trom the cohort The sampllng

fraction within each strata varied so as to reflect the ove rail dlstnbutlon of ctlses in the full cohort. The sub-cohort consisted of 1,138 subJects of whlch 62 were cases. The objective of this study was to estlmate the relatlonshlp between



exposure ta coal tar pitch volatiles, estlmated as benzene soluble matanal, and lung cancer among aluminium production workers

• •



• Abstract



Epldemlologlc cohort Studl8S often requlre a follow-up of and obtainmg 8xposure and confounder Information for several thousand subJects. This



process can bEl tlme consuming and expenslve. A sampling scheme whlch reduces costs

15

the case-cohort deSign. It Involves obtalnlng exposure and

confounder mfmmatlon for ail cases and a random sam pie of the remalnder of



the cohort.

In this thesis we consider the analysis of data fram a case-cohort

sam pie under thE~ assumptlons of a proportlonal hazards mode!.

Speclfically

we revlew and de scribe techniques for point and interval estimation of



parameters and tests of hypotheses tirst for a full cohort, then for a case-cohort sample. We apply the methods to a real example involvmg data from an occupatlonal case-cohort study.



• •

• • •

• Résumé



Les études épldémlologlques de types cohortes consiste à

sUlvr~

une

population en obtenant de l'mformatlon sur le facteur d'exposItIOn et les



facteurs de confusion Ce processus est à la fOIs long et coùteux

Une type

d'échantillonnage qUI rédUIt ce coût amsl que cette durée. est l'etude épidémlologlque de type cas-cohorte. Le but de cette etude est d'obtpnlr l'



information sur le facteur d'expOSition et les fac'eur de confusion pour chl1que individu de la cohorte et pour une échantillon aléatoire de la cohorte

Dans

cette thèse, nous présentons des méthodes d'analyse d'une étude cas-cohorte



en supposant un modèle de Cox de risques Instantanées proportlOnelles. En particulier, nous decrivons des techniques d'estimation ponctuelle et d'intervales de confiance pour les paramètres du modèle, amsI que des tests



d'hypothèses pour les études de types cohorte et cas-cor.ort. À titre d'exemplu. nous présentons l'analyse des données d'une étude profeSSionnelle de type cas-cohorte.



• • •

• Acknowledgements

• 1 would IIke ta thank my thesis supervisor Prof. Ben Armstrong for insplring me ta pursue my rnterests



ln

the case-cohort design. 1thank him for ail hls tlme,

patience and advlce he gave me long distance whllst workrng on thls thesis. wish ta thank Prof David Wolfson for stimulatrng my Interest through a reading course on Survlval Analysis and for his gUidance. Lastly 1thank my father Prof.



V Seshadn and my husband Michael Kreaden for thelr constant support and encouragement

• • •



• •



• Table of Contents

• •

• •





• • • •

Chapter 1 A Proportional Hazards Model for Cohort Studl9s Figure 1 Full Cohort ..... 1.1 The Model 1.2 Censonng ... 1 3 The Llkelihood Functlon ... 1.4 Types of Covanates . .. .. . 1.5 Stratification ........ '" . 1.6 Estimation............. ...... . .. 1.7 Hypothesis Testmg ............. . 1.8 Confidence Reglons ............ . Chapter 2 Sampling From The Full Cohort .. 2.0 Nested Case-Control ......... . Figure 2 Nested Case-Control.. . . ....... . 2.1 Case-Cohort.... ..... ....... ... ....... ....... .. . .. . Figure 3 Case-Co hart ... ..... .... ...... .... ... .. 2.2 Stratlfied Sub-Cohort .................. . 2.3 Estimation ................................... . 2.4 Testing Hypotheses and Confidence Reglons . 2.5 Computationallssues........... . ... .... . ... .

8 10

11 12 16 20

23 23

24 26

20 29 30 34 35

Chapter 3 An Application ............... ... .... . . ... .... . 3.0 Description of the Study ...... . .. .... ... .. .. 3.1 Standard Analysis ................ ....... ....... . 3.2 An Alternative Method For Case-Cohort Analysis . 3.3 An Empmcallnvestlgation of the Adequacy of the UnadJusted Variance Estimator ........ ....... ....... ........ . .... . .. 3.4 Small Group Size ........... ...... ..... . .... .. .

48 50

Chapter 4 Conclusion ............ ........ . . ...... .. . .... .

52

Bibliography ..... .... "

53

. ......

..... ..

.. ....

38 38 41 46



Page'

Chapter 1: A Proportional Hazards Model for Cohort Studies



The classlcal description of a cohort

15

a group of people who share some

common charactenstlc and are followed through tlme for the detectlon of new cases of dlsease We refer back to example 1 0 in whtch a group of people who



are employed tn an aluminium production plant, were followed up to ftnd whether and when each had dled of lung cancer. We are generally, as in this example, dealtng wtth a large nurnber of subJects and relatively fewer cases.



ln epldemlologlcal studies, the usual time scale used is age (Breslow et al 1983). This distinguishes the epidemiological context fram the slmpler statistical one ln



whlch the primary measure of tlme in a survlval analysis study begins at zero for ail tndlvlduals (1 e. tlme is measured as time from entry into the cohort). The age of an indlvldual's entry II1to the cohort need not necessarily be the same for each



subJect.

Figure (1) Illustrates a hypothetical cohort study in whlch each horizontal li ne



represents a study subject. Nine subjects are followed up until age eighty and three of these subjects are ascertalned as dying fram lung cancer. The rematntng



• • •

SIX

subJects dled tram other causes, or were alive but lost to follow-up.

• Figure 1: Full Cohort

• *

9

.-

__________________ 0

8 __________________

7



• l/tI,lth

~-----o

--------------~----~

~ 5 co

..J

l

4



n

1

1

.ltl ,t)'~

l

or ln,,', ln fllllnw up

!

2

J

l .1I1l .. 1

11

!

3

lunq

()p,ltll ffllllll1tth 1 1

Il

6

frllill

()

1 l



20

t

1

t

2

Associated wlth each subject (I.e. cases and controls)



i

= 1, ... 11.

The cl cases are denoted by the label

l,

IS

HO

;1

a l;:lbel

where

1

1,

Age

1

such tllat Il

the figure above, the cases are represented by the set of labels Il,11 thus il



= 9,

'2

1. 1.

cl

Il.

1.1 The Model ln

time until the occurrence of an event. for example. rJuatll

due to lung cancer. The outcome of interest IS not only the blncuy occummCf; (yes, no) of the event, but also the tlme T to (age of) that

occummc(~

nl~J tlfTH-~

occurrence may be thought of as a survival t,me We shall assume that the

• • •

III

= 5, ;3 = 3.

The cohort Îs followed



1

survival times are continuous so as to avold the complexlty of tled values

tu



Page 3

Definition 1. 1 Let T be a contmuous random vanable such that



FU)

= PIT < t} where

FU) is the distnbution function of T. The

sury/var (unet/on correspandmg ta time T is SU), such that



S(t)

=1- F(t} =P{T > t}.

Definition 1.2



The hazard fuacÜon

À.( t) at

time T is the conditional probability of

fallure (Le. death or disease incidence) in a vanishingly small interval t ta t + I1t



À(t)



t

given survival to time t

= hm

1

ill ---.()

. -P{t:::; T < t+ I1tIT;:;: Il I1t

The hazard functlon is sometimes referred ta as the age specifie failure rate and serves as a way of quantifying the population disease frequency during a



speclfied time penod. The hazard and survivor function share the relationshlp;

À(t)

= f(t) S(t)

• where lU)





the density of thH survival time T .

Of usual interest is the effect of a covariate functlon



IS

IS

= on the survival time. The hazard

the means by which this relationship is studied.



P;lgtl 4

Definition 1.3

• •

Let z be a px 1 dimenslOnal vector of covanates The conditional hazard is

ÀUI:) = lirn \(-.. '1 _1 Plt5;T

[, (z"

:/J)



r (equation 1.12)

and

• •

(equation 1.13)

An alternative procedure for findlng maximum likelihood estimates is via the ,

method of scoring which uses the matrix of expected second partial derivatives ca lied Fisher's information matrix. The pxp covariance matrix with u, vth element



'". (/3)

'~.

E {-

a/~J!. I(fJ)}

(equation 1.14)



is the Fisher information matrix. Analogous to the observed information matrix



The estlmate fJ does not depend on whether the observed information matrix



(,(fJ), we denote the Fisher information matrix by le(3).

A

1., tfJ)

or the Fisher information " (f3) is used. For exponential models the two

• forms are identical (Thomas 1981). However for ather models thls



not tne

case. Convergence to the maximum likelihood estimates may be somewllat slower when using the observed information matnx (Thomas 1981) An of variance of



IS

~

is glven by

[loClnt

or [lt>UJ)]

l~stllT1 t1 ), then the successive restncted nsk sets under case-cohort samphng would be,

Rl



= {9,6,5,4},R2 = {5,4}

and R3

= {3,2}

ln constructing the restricted risk sets we note that subjects who appear as



controls earlier on are re-used in aillater risk sets until death or censonng. For example, the subject in figure 3 with label

1)

=2 appears in nsk sets

-

R1,R2 and R3. Subjects who tail fram outside the sub-cohort Will be consldered



in only one risk set, as seen in R3.

Prentice (1986) praposed a likelihood for case-cohort data assuming a

• • • •

proportional hazards model in which the contribution of the 1 th fallure ta the likelihood is the probability that subject

1)

fails at (" glven the censonng and

covariate information up to time f" and that one subJect falls tram the restncted nsk set R, . The likelihood for the complete case-cohort sample IS thus,

L{fJ) =

n d

r(z, ;fJ)

/=1

Ir(Zt;fJ)

1

(equatlOn 2. 1)

• •

Page 29

The selection of subjects Into the sub-cohort in general affects many restricted risk sets. The restncted nsk sets at distinct failure times are not independent glven prlor censonng, tailure, and entry information. One possible



disadvantage of the case-cohort appraach occurs when there is substantial censoring; there may be no members of the sub-cohort to compare with later oGcunng cases (Prentice 1986, Wacholder et al 1991). This problem can be



mlnimized by sampling one or more additional sub-Gohorts. The case-cohort design

IS

thus unhke the full cohort or nested case-control where the risk sets

are independent conditional on prior censoring, failure and entry information.



Prentice (1986) referred to the likelihood function as a "pseudo-likelihood" rather than a partial likelihood.

fJ



We denote the "pseudo-likelihood" function for

by L(f3).

2.2 Stratified Sub-Cohort

It is useful to extend the corresponding "pseudo-likelihood" for a stratified subcohort. It may be derived similarly, fram the partiallikelihood for the stratified



full cohort. Glven a cohort Q, of size .

Gohort nof slze

1/ ,

Il

with strata sizes



let the sub-

be a stratified random sam pie without replacement and -

strata sizes

Il,,11 2 ,,,, nu'

1I 1.!l è " .. IIU '

Ill Q

with dl .... d q failures such that

-

q

=

Il.