Smoking, Genes, and Health: Evidence from the Health and. Retirement Study

Smoking, Genes, and Health: Evidence from the Health and Retirement Study∗ Daniel Benjamin, Andrew Caplin†, David Cesarini, Kevin Thom, and Patrick Tu...
8 downloads 1 Views 334KB Size
Smoking, Genes, and Health: Evidence from the Health and Retirement Study∗ Daniel Benjamin, Andrew Caplin†, David Cesarini, Kevin Thom, and Patrick Turley September 20 2015 PRELIMINARY

Abstract Genetic understanding of smoking is advancing rapidly. Three recent studies discovered variants in nicotinic receptor genes that impact measured smoking behavior. We document associations between these variants and multiple smoking and health outcomes available in the Health and Retirement Study (HRS). Although these variants are associated with relatively modest differences in measures of past smoking intensity, we find substantial effects on later-life health and mortality outcomes. To understand this set of reduced-form patterns, we develop and estimate a dynamic model of smoking, health, and mortality that explicitly incorporates genetic heterogeneity. Structural estimates will allow us to understand the mechanisms by which these genes operate (preferences v.s. addiction dynamics), shedding light on how policies differentially affect individuals by genotype. The estimated model will permit counterfactual simulations assessing the consequences of genetic testing interventions that provide individuals with more information about their predisposition for addiction. ∗ We

thank Laura Beirut and Li-Shiun Chen for their invaluable contributions. We are grateful for helpful comments and feedback from Jeff Smith, Chris Taber, and all of the participants at the 2015 IRP Summer Research Workshop at the University of Wisconsin. † Center for Experimental Social Science and Department of Economics, New York University and National Bureau of Economic Research.

1

1

Introduction Smoking behaviors are a central focus of biological and epidemiological research due to their

profound health consequences. As a result, our understanding of the neurophysiological effects of smoking has grown rapidly. Specific nicotinic and dopaminergic channels have been identified that may contribute to the addictive nature of smoking behaviors. The importance of these channels has been confirmed by three recent studies that discovered genetic variants (single nucleotide polymorphisms, or SNPs) in nicotinic receptor genes that impact measured smoking behavior.1 While the genetics literature has produced a set of credible gene-smoking associations, little is known about how the biological processes affected by these genes map into the behavioral mechanisms that determine an individual’s incentives to smoke, increase or decrease smoking intensity, and eventually quit. Following the seminal work of Becker and Murphy (1988) on the consumption of addictive goods, the economic literature on smoking has produced a set of dynamic models that clearly delineate such mechanisms and richly capture the life-cycle of smoking, cessation, and later life health outcomes (e.g. Arcidiacono et al. 2007, Darden 2014). Two of the most important behavioral mechanisms in these models are baseline preferences for nicotine, and the addiction process that makes it difficult for individuals to reduce their level of cigarette consumption. We develop a dynamic life-cycle model of smoking and health that explicitly incorporates genetic heterogeneity. The model allows genes to impact smoking through both preferences and the addictive process (specifically the cost of reduction relative to past consumption). While individuals are assumed to be fully aware of their preferences for nicotine, they are are uncertain about their future reduction costs. Identifying the mechanisms through which these genes operate is important for understanding how the effects of policies might systematically differ across individuals of different genotypes. Furthermore, the model allows us to simulate the effects of a unique counterfactual policy. If knowledge about one’s own genotype reduces uncertainty about the future cost of quitting, then early genetic screening could emerge as a policy tool for altering smoking behavior. The estimated model will allow us to evaluate the consequences of such an intervention. 1

The three studies were related and published in the same issue of Nature Genetics (The Tobacco and Genetics Consortium 2010, Liu, Tozzi, Waterworth, Pillai, Muglia, Middleton, Berrettini, Knouff, Yuan, Waeber, et al. 2010, Thorgeirsson, Gudbjartsson, Surakka, Vink, Amin, Geller, Sulem, Rafnar, Esko, Walter, et al. 2010)

2

We estimate the model using a recently genotyped subsample of the Health and Retirement Study (HRS). We focus on four SNPs identified by the existing genetic literature. The panel structure of the HRS enables us to evaluate how these SNPs impact smoking behaviors and smokingrelated health outcomes. A key finding is that the scale on which certain SNPs’ impact smokingrelated diseases and mortality risk is significantly higher than the scale on which they impact measured smoking. Indeed, while smoking-related SNPs explain a large amount of variation relative to SNPs found for other behavioral traits(Rietveld, Conley, Eriksson, Esko, Medland, Vinkhuyzen, Yang, Boardman, Chabris, Dawes, et al. 2014), their joint explanatory power remains modest.2 Yet the effects on health are of an altogether different scale. For example, two of our four SNPs are associated with 30% greater risk of chronic pulmonary obstructive disease (COPD). Amongst non-smokers, we find no evidence that the SNPs increase disease or mortality risk, suggesting that the SNPs are influencing health through their impact on smoking behavior rather than other causal channels. Our paper illustrates interdisciplinary gains from trade. Panel data sets of the form that are used to fit structural models, such as the HRS, involve samples that are too small for de novo gene discovery. That is why our research builds upon successful discovery in the literature on genetic epidemiology. Their discoveries have generally stood the test of time, being based on far larger samples albeit at the expense of crude behavioral measurement: see section 2. We leverage these findings in the HRS and use them to identify links between health beliefs, smoking behavior, and health outcomes in a formal structural model. Inclusion of biological factors in structural economic models will become increasingly important as genetic discovery advances. Our work suggests that behavioral understanding and genetically-informed policy making rest on further such interdisciplinary work. The paper is structured as follows. Section 2 provides a brief review of recent findings on genes, smoking, and health, and details our procedure for selecting the SNPs we study. In section 3 we introduce the data and present our basic findings on how the SNPs impact smoking and health. In section 4 we introduce the structural model. 2

Small effect sizes have caused some researchers to argue that indexes of SNPs will be necessary for “genoeconomic” research to be of practical import.

3

2

Background to Study In this section, we detail our procedure for selecting a set of “smoking-associated SNPs” on

which the ensuing analyses rest. We first provide the briefest of primers on DNA, the biology of smoking, and genetic association studies of smoking behavior. We then describe the precise method we used to identify the SNPs for which the evidence of a relationship to smoking behavior is particularly strong. We also summarize what is known about the biological function of these smoking-associated SNPs.

2.1

DNA

Human DNA is composed of a sequence of about 3 billion pairs of nucleotide molecules (spread across 23 chromosomes), each of which can be indexed by its location in the sequence. The sequence is comprised of about 25,000 subsequences called “genes,” which code for proteins that have specific functions in the human body, and regions in between genes, which help to regulate when certain genes are transcribed into proteins. At the overwhelming majority of DNA locations, there is virtually no variation in the nucleotides across individuals. The segments of DNA where individuals do differ are called “polymorphisms.” The most common polymorphisms are called single-nucleotide polymorphisms (SNPs). SNPs are locations in the DNA sequence where individuals differ from one another in terms of a single nucleotide. At the vast majority of SNP locations, there are only two possible nucleotides that occur. Each type of nucleotide is referred to as an allele, and an individual inherits one allele from each biological parent. A person’s genotype for a particular SNP is then defined by designating one of these two alleles as the “reference allele” and counting the number of reference alleles (0,1 or 2) the person is endowed with. A typical gene contains hundreds or thousands of SNPs, and there are also many SNPs in the intergenic regions. Because entire segments of DNA are transmitted from parent to child, SNPs (more precisely, SNP genotypes) tend to be highly correlated with other SNPs in the same region of the genome. Such correlated SNPs are said to be in “linkage disequilibrium.”

4

2.2

Smoking Biology

Cigarette smoke contains thousands of chemically distinct particulates, one of which is nicotine. When nicotine is inhaled through smoking, it is absorbed into the blood stream via the lungs and is delivered to the brain within a few seconds (Benowitz 1990). Nicotine’s addictive properties come primarily from its molecular similarities to acetylcholine, which cause it to bind with the body’s nicotinic acetylcholine receptors (NAChRs). Acetylcholine is an important neurotransmitter that plays a role in a wide variety of biological processes, including muscle contraction, sweating, REM sleep, memory, attention, arousal, and reward. Normally, when acetylcholine binds with the NAChRs, it opens an ion channel, causing sodium and potassium ions to pass through, leading to a net flow of positive ions into the cell. This triggers the cell to release other neurotransmitters related to the processes listed above. Nicotine is believed to cause addiction mainly by triggering the dopaminergic neurons located in the ventral tegmental area (VTA) of the brain, which release dopamine into the nucleus accumbens core (NAC) (Glimcher 2011). The presence of dopamine in the NAC is associated with cognition, motivation, and positive reward prediction errors. Reinforcing this effect, nicotine also causes the release of other neurotransmitters (e.g. glutamate, serotonin, and norepinephrine), which may increase the responsiveness of the NAC to dopamine and also may contribute to an independent addictive effect (for Disease Control and Prevention 2010, p. 136). Counterbalancing the effect of dopamine, nicotine also triggers the GABA system and the medial habenula, which plays an inhibitory role, but it appears that the response of these systems to nicotine decays more quickly than does the response of the dopaminergic system (Fowler 2011).

2.3

Molecular Genetics and Smoking

There is a large body of work that infers heritability of a behavioral trait – the fraction of variance accounted for by genetic factors taken as a whole – by studying twins or adoptees. Smoking is one of several traits that have been studied in this literature (Gilbert 2011, Li et al 2003). Though there is now strong evidence that genetic factors taken altogether influence various aspects of smoking behavior, researchers are only now beginning to reliably identify specific genetic variants that underlie the heritability of smoking behavior.

5

Most early molecular genetic studies were candidate gene studies, which studied variation in genes in biological systems known to play an important role in nicotine addiction. For a comprehensive review of these early candidate gene studies, which were conducted beginning in the mid-1990s, see the Surgeon’s General Report (for Disease Control and Prevention 2010, chap. 4). The early studies focused almost exclusively on nicotine-metabolizing genes (primarily CYP2A6 ), nicotinic receptor genes (such as CHRNA4 and CHRNB2 ), and a handful of other genes, most prominently the dopamine receptor D2 gene DRD2 and the serotonin-transporter-linked region 5HTT. The replication record of these early studies turned out to be disappointing, and the estimates of the effect sizes were often highly heterogeneous across studies. An influential review (Munafo, Clark, Johnstone, Murphy, and Walton 2004, p. 583) concluded that the “evidence for a contribution of specific genes to smoking behavior remains modest.” Today, it is understood that a major contributing factor to the inconsistent replication record of early candidate genes for smoking (and to an even greater extent, for other behavioral traits) is that the studies relied on sample sizes far too small to ensure adequate power (Rietveld, Conley, Eriksson, Esko, Medland, Vinkhuyzen, Yang, Boardman, Chabris, Dawes, et al. 2014). Beginning around 2005, medical-genetics research began to undergo a paradigm shift, moving away from candidate-gene studies to what are called genome-wide-association (GWA) studies. In these studies, researchers run regressions of the outcome of interest for association on each of the (typically millions) measured single-nucleotide polymorphisms (SNPs). It was only recently that these studies became feasible, as genotyping technologies with dense coverage of common SNPs across the entire genome became available at modest costs. Because of the large number of hypotheses tested in a GWAS, a SNP association is considered established only if it reaches the “genome-wide significance” threshold of p < 5 × 10−8 . Adequate statistical power at this stringent significance threshold requires very large samples. Since individual samples are generally too small, many GWA studies are conducted within research consortia that meta-analyze results from multiple samples. Empirically, it is now well established that results from such GWA studies replicate very consistently (Visscher 2012). There are several reasons for the robustness of GWAS findings (see Rietveld et al 2013 for a discussion). One important reason is that, even if such a study has only modest statistical power to detect an association at the genome-wide significance level, it follows from Bayes’ rule that conditional on finding such an association, it is likely to be true (see Benjamin 6

2012 for heuristic calculations).

2.4

SNPs Selected For This Paper

A landmark event in the study of the genetics of smoking was the publication of three GWA studies in the May 2010 issue of Nature Genetics (The Tobacco and Genetics Consortium 2010, Liu, Tozzi, Waterworth, Pillai, Muglia, Middleton, Berrettini, Knouff, Yuan, Waeber, et al. 2010, Thorgeirsson, Gudbjartsson, Surakka, Vink, Amin, Geller, Sulem, Rafnar, Esko, Walter, et al. 2010). The three papers represented the culmination of the work of three separate consortia: the Tobacco and Genetics (TAG) Consortium, the European Network of Genetic and Genomic Epidemiology (ENGAGE) Consortium and the Oxford-GlaxoSmithKline (Ox-GSK) Consortium. The studies examined a range of smoking outcome variables, including age at initiation, cigarettes smoked per day while smoking, and whether the smoker had succeeded in quitting (cessation). Because the consortium studies pooled data from multiple sources, the definition of cigarettes smoked per day varied across the twelve cohorts that contributed to the meta-analysis: some cohorts asked used the maximum number of cigarettes per day, whereas others used an average constructed from panel data: we refer to the hybrid measure CCP D. Prior to publication, each consortium shared its results with the other two. Two of the published papers (The Tobacco and Genetics Consortium 2010, Thorgeirsson, Gudbjartsson, Surakka, Vink, Amin, Geller, Sulem, Rafnar, Esko, Walter, et al. 2010) meta-analyzed the results for CCP D from all three consortia. Consequently, there is much overlap between the conclusions of the two papers, and we focus our discussion below on the findings from these overall meta-analyses. All analyses of CCP D were restricted to samples of individuals who smoked regularly at some point in their life (ever-smokers). Because GWA studies attempt to find associations with SNPs scattered fairly evenly across the entire genome, it was not obvious a priori that the analyses would identify SNPs in or near genes implicated in the biological systems already understood to be relevant. However, the majority of SNPs that reached the genome-wide significance threshold were in fact in or near genes in biological systems that were known to play an important role in nicotine addiction. To select the SNPs we study in our analysis, we proceeded in two steps. First, we sought to determine the total number of independent genetic signals identified in the two studies, by using the software SNAP (Johnson, Handsaker, Pulit, Nizzari, O’Donnell, and de Bakker 2008) 7

to compute the linkage disequilibrium between all pairs of genome-wide significant SNPs reported in the meta-analyses. In the genetic literature, the standard measure of linkage disequilibrium between two SNPs is the R2 obtained from the regression of one SNP genotype on the other SNP genotype. Following the literature, we assume that any pair of SNPs whose linkage disequilibrium exceeds 0.4 reflect a single genetic signal. This criterion leaves us with five genomic regions with at least one SNP that reached genome-wide significance in at least one of the studies: the nicotinic receptor cluster on chromosome 8 (CHRNA3/CHRNA5/CHRNB4 ), the cluster on chromosome 15 (CHRNB3/CHRNA6 ), two distinct regions near the nicotine-metabolizing gene CYP2A6, and the chromosome 10 region. One of the regions near CYP2A6 contains SNPs that are not in close linkage disequilibrium with any genetic variables available in the HRS data (the best proxy only explains 22.6% of the variation in rs4105144). From each of the remaining four regions, we identified the SNP reaching the lowest p-value in all but one case. The exception is for the CHRNA3/CHRNA5/CHRNB4 cluster on chromosome 15, for which we retained the SNP with the second lowest p-value, rs16969968 (p < 6 × 10−72 ), compared to p < 5 × 10−73 for rs1051730 ). We focus on this SNP, known colloquially among researchers as “Mr. Big,” because there is reason to believe it is the biologically relevant “causal” variant (with other, nearby SNPs reaching genome-wide significance due to their correlation with it). In particular, it is known to cause an amino acid change in the alpha-5 subunit of the nicotinic receptors, and experiments have found that this change alters the responsiveness of the nicotinic receptors to nicotine (Wang 2009, Falvella 2009). . Studies have also found that the SNP influences the expression of the CHRNA5 gene in brain and lung tissue (Wang 2009, Falvella 2009). In practice, the linkage disequilibrium between Mr. Big and rs1051730 is nearly perfect, so all of our results are substantively identical if rs1051730 is used instead. This process leaves us with four SNPs: • rs16969968 (in the gene CHRNA5 in the CHRNA3/CHRNA5/CHRNB4 cluster) • rs13280604 (in the gene CHRNB3 in the CHRNB3/CHRNA6 cluster) • rs7937 (near the nicotine-metabolizing gene CYP2A6 ) • rs1329650 (in the chromosome 10 region with unknown functional significance) 8

In what follows, we refer to these as our smoking-associated SNPs. Figure XXX shows graphical illustrations, one for each SNP (pictured with a large orange diamond), of the genes located in the proximity of the SNPs.

3

Data and Reduced Form Findings

3.1

HRS and Genomic Data

The data for our analysis come from the Health and Retirement Study (HRS), which is a nationally representative longitudinal survey of Americans over 50 years of age and their spouses. The initial HRS sample was collected in 1992 and included individuals born between 1931 and 1941. The survey is administered every two years with only minor adjustments from wave to wave. More cohorts have been added over time, making the current HRS sample representative of individuals born between 1890 and 1954 who survived until the sample period. From 2006 to 2008, 12,507 HRS respondents were genotyped from saliva samples. To avoid detecting spurious genetic associations due to genotyping errors, it is important to analyze data that have undergone quality control filtering (see Beauchamp, Cesarini, Johannesson, van der Loos, Koellinger, Groenen, Fowler, Rosenquist, Thurik, and Christakis (2011) for discussion). We work with the public-release version of the genotypic data which has been quality controlled by researchers at the University of Washington (XXX 2012). We further restrict our sample to Caucasians, since the genetic associations that motivate our study are largely found in all-Caucasian samples. Our final genotyped sample consists of 68,288 person-year observations on 8,122 unique individuals. The Appendix offers a complete discussion of the criteria used to select this sample. Table 1 presents some basic cross-sectional characteristics of the individuals in our sample, with the variables measured as of an individual’s most recent appearance in the panel. Following the consortium studies, we refer to individuals who smoked at some point in their life as “ever smokers,” and others as “never smokers.” As indicated in Table 1, about 57% of our sample report having ever smoked, and conditional on smoking, the average maximum number of cigarettes consumed per day is just over 25.

9

In the descriptive analyses that follow, we estimate regressions of the following form:

yi = β0 +

X

βj SN Pji + Xi γ + i ,

(1)

j

where SN Pji ∈ {0, 1, 2} is the genotype of individual i at SNP j and Xi is a matrix of controls. In all analyses, we define the reference allele to be the allele that reduces the level of smoking, so when the dependent variable is smoking or a health outcome impaired by smoking, we expect negative coefficients. Controlling for potential confounds that may be correlated with genotype is critical in order to avoid spurious findings. In practice, the most common concern is confounding due to population stratification: different groups within the sample differ in allele frequencies and also differ in their outcome for non-genetic reasons.3 For this reason, it is common practice in genetic association studies to include as control variables the first 10 or 20 principal components of all the genotypes measured in the dense SNP chip. These principal components seem to pick up much of the subtle genetic structure within a population (Price 2006). Our analyses therefore control for the first ten principal components, provided by the HRS. We also include a dummy for Male gender, a full set of age dummies, and interactions between the Male and Age dummies. Reported standard errors are clustered at the person level in all specifications where the unit of observation is a person-year.

3.2

Genes, Smoking, and Health Outcomes

We begin by examining how each of the four smoking-associated SNPs are associated with smoking behavior and health outcomes in the HRS sample. To maximize comparability with the consortium studies, all analyses in this section are based on the sample of ever-smokers in the HRS. 3

A famous illustration of stratification is the “chopsticks effect” (Lander & Schork, 1994). Imagine a study that that tries to identify genetic markers for chopstick use by comparing a Asian population (cases) to a Caucasian population (controls). Without controlling for for population stratification, any markers which differ appreciably in frequency between the Caucasian and Asian populations will be found to be associated with chopstick use, but those associations are of course spurious. This example might seem to suggest that a simple fix would be to control for race or ethnicity. Indeed, it is standard practice to restrict a genetic association study to subjects of a common ethnic background, as we do here. It has been found, however, that allele frequencies can differ substantially even within ethnically homogeneous populations, such as different regions within Iceland (Price et al., 2009).

10

Table 2 presents estimates of the relationship between our SNPs of interest and CP DM AX : the number of cigarettes at peak consumption as measured in the HRS. In Columns 1-4, we consider the influence of each SNP separately. The estimated effect of each SNP is negative, as expected, and the estimated effect sizes are never statistically distinguishable from the effects reported in the consortium studies. For example, an additional copy of the protective allele of Mr Big reduces the number of cigarettes smoked at maximum consumption by 1.32 (s.e. = 0.39) - an effect similar to what TAG reported. For two of our SNPs - Mr Big and rs7937 - the estimated effect is statistically distinguishable from zero at the 1% level. We find no statistically significant association between rs1329650 and CP DM AX and a borderline significant association for rs13280604 and CP DM AX . Column 5 shows that the coefficient estimates do not change appreciably in a model which includes all four SNPs. Though informative, CP DM AX is only one facet of life-cycle smoking. Reduction or cessation represents another critical feature of smoking behavior. Table 3 therefore shows the results from panel regressions that exploit the longitudinal nature of the HRS data. In Column 1, our outcome variable is an indicator for smoking a non-zero quantity. We find no statistically significant association between later-life smoking on the extensive margin and any of the smoking-associated SNPs. In Column 2, the dependent variable is an indicator for not smoking. In these regressions, we restrict the sample to person-wave observations in which the person smoked a positive quantity in the previous wave and hence could quit. There are hints that rs13280604 is associated with quitting behavior. Finally, columns 3 and 4 show results from a panel specification in which the dependent variable is defined as intensive-margin smoking in each of the survey waves (CP DCON T ). We report one specification restricted to person-wave in which the sample is restricted to observations with positive values of CP DCON T and one specification which includes zeros. An important message from Table 3 is that there are no strong relationships between any of our SNPs and contemporaneous smoking quantities. A major advantage of the HRS for this analysis is the availability of life-cycle data on health outcomes and mortality. This allows us to directly estimate the relationship between specific SNPs related to cigarette consumption and major illnesses associated with smoking. We are particularly interested in non-cancerous lung disease, heart disease, and cancer, since these are the major conditions directly linked to smoking. Note that the cancer measure is non-specific: the HRS 11

only asks if an individual has ever been diagnosed with any cancer, regardless of the type. HRS respondents are asked about their current health status in each of these three categories, along with a series of follow-up questions. For example, the first question about lung disease asks subjects if they have ever been told by a doctor that they have a lung conditions such as “chronic bronchitis or emphysema.” In subsequent surveys, respondents are asked if their medical condition is improving or deteriorating and information is also collected about any treatment received or medications prescribed. Tables 4-6 reports estimates of linear probability models explaining health outcomes as a function of the SNPs and the maximum number of cigarettes smoked per day. The dependent variables are indicators for the incidence of (non-cancerous) lung illness, heart illness, and cancer. These are cross-sectional regressions with samples restricted to include only the most recent person-year observation in the HRS. The samples for these regressions only include those individuals that have reported smoking at some point in their lives. To obtain a baseline association between cigarette consumption and lung health risks, Column 1 presents an estimate of the relationship between the maximum reported number of cigarettes smoked per day and the incidence of major non-cancerous lung illness. These estimates suggest a positive and significant relationship: a one-unit increase in CP DM AX is associated with an increase in the probability of lung illness of about 0.4 percentage points. In Column 2, we regress the lung illness indicator against the SNPs. We find a large, statistically significant coefficient on Mr Big and rs1329650, but no significant relationship with our other two SNPs. The coefficient on Mr Big and rs1329650 are large and consistent with the direction of its association with smoking behavior. The coefficient estimate of 2.9 for Mr Big implies that amongst smokers, those with 2 copies of the allele that increases smoking are 5.8 percentage points more likely than individuals with 0 copies to be diagnosed with lung disease. Since the baseline probability of lung illness is 18%, the implied effect on the risk of lung illness is 30%. In Tables 5-6, we conduct similar analyses for two other health indicators: the incidence of a major heart illness, and the diagnosis of cancer. Although the maximum number of cigarettes per day is associated with elevated risks for heart disease and cancer, we generally find small, statistically insignificant relationships between our SNPs of interest and these outcomes. For the cancer outcome, this is partially explained by the fact that we are unable to specifically isolate lung cancer. Since smoking is less strongly associated with other cancers, the lack of a strong 12

relationship is not surprising. Finally, we investigate the relationship between our SNPs and mortality in Table 7. Specifically, we estimate a linear probability model to explain death in the next year. We pool all personyear observations for ever-smokers from 2006 onwards. The year restriction is imposed because the individual had to survive until 2006 in order to be genotyped. As shown in Column 1, the coefficients are always in the predicted direction, with estimates suggesting that each copy of the reference allele reduces one-year mortality risk by 0.1 to 0.3 percentage points, but no single estimate is statistically distinguishable from zero. One concern about these estimates is that the sample could be selected because it is restricted to individuals who survived until 2006 (when genotyping began). Endogenous attrition due to mortality is unlikely to be a major source of bias in the younger HRS respondents. Reassuringly, when we split the HRS sample into two cohorts, we find stronger evidence that the protective SNPs reduce mortality in the younger cohorts. Specifically, we find that rs16969968, rs13280604 and rs7937 are all associated with reduced mortality risk, with point estimates suggesting that each reference allele reduces mortality risk by 0.3 to 0.5 percentage points.

3.2.1

Interpreting the Reduced Form Evidence: Questions and Puzzles

One challenge in interpreting the gene-health associations is that our SNPs could work through channels other than smoking. For example, if Mr Big affects both smoking and other biological process related to lung health or mortality (e.g. fragility of lung tissue), it becomes difficult to credibly model the causal chain running through genes, smoking, and health. If our SNPs operate through non-smoking channels, we expect that the SNPs should be associated with health outcomes also amongst never-smokers. To test this hypotheses, we ran placebo tests in which we re-estimated our basic health and mortality specifications using the sample of genotyped never-smokers. As shown in Table 8, we find no statistically significant relationships between our SNPs and these outcomes among never-smokers. The fact that the SNPs are not predictive of health in never smokers suggests that the gene-health associations documented in the previous section are driven primarily by differences in smoking behavior. The collection of reduced form evidence presented here suggests a complicated set of relationships between individual SNPs, smoking behavior, and health. For example, we find strong effects 13

of Mr Big and rs7937 on CP DM AX and mortality, but not on late-in-life quitting. Mr Big is robustly associated with lung health, as is rs1329650, despite the fact it was not strongly related to cigarette quantity. And rs13280604 appear to be related to quitting, but not CP DM AX . The magnitude of the association between rs13280604 and lung health is particularly noteworthy, as it is much stronger than would be predicted by naively multiplying the estimated relationship between rs13280604 and CP DM AX by the estimated relationship between CP DM AX and lung illness. Specifically, such a calculation suggests a relationship of roughly 0.5, less than one fifth of our point estimates. An analogous calculation for Mr Big yield similar conclusions. How can we reconcile the modest effects on CP DM AX with the substantial effects that two of our SNPs appear to have on lung health? One possibility rests on the insufficiency of a simple metric like CP DM AX as a measurement of life-cycle smoking behavior. SNPs that have large life-cycle differences may have only modest effects on CP DM AX , as the latter is only a highly imperfect proxy of the the total accumulated damage that an individual has sustained over their lifetime due to smoking (which is captured in the health variables). Since rs13280604 has no known relationship with the functioning of lung tissue, the association with health could emerge because different SNPs differently impact life-cycle smoking patterns. An individual’s health is a function of not only maximum smoking intensity, but also the total length of time spent smoking. The lung health association might better reflect the total life-cycle effect of rs16969968 on cumulative smoking behavior than the observed associations between rs16969968 and maximum cigarettes. Indeed as shown in Figure 1, individuals who continuously smoke in the NLSY on average experience a substantial reduction in the quantity of cigarettes that they smoke per day over their life-cycle. It is possible that a SNP like rs16969968 not only affects peak quantity but also the evolution of quantity over time. The operation of dynamic behavioral channels can also potentially explain the set of associations observed for rs13280604. For example, it appears that rs13280604 affects the ease of cessation. It is possible that this association is independent of the behavioral channels that affect the maximum quantity consumed (e.g. one’s preference for nicotine). The results here highlight the promise of using GWAS results as a starting point for the further exploration of genetic relationships with behavior and health outcomes. Although the Consortium data were not sufficiently rich to investigate the health impacts of these SNPs, the results on rs16969968 and rs1329650 suggested natural hypotheses on health which could be tested in a 14

smaller but richer data set like the HRS. Rationalizing the collected associations between genes, smoking, and health requires the development of a unified dynamic model, a task to which we now turn.

4

Model Here we develop a dynamic structural model of life-cycle smoking behavior. A sizable existing

literature uses the theory of rational addiction (Becker and Murphy 1988) to organize the empirical analysis of smoking. Chapoupka (1991) and Becker et al. (1994) present evidence in favor of the model’s prediction that both past and future cigarette prices should affect current consumption. (See Chaloupka (2000) for a survey). Chaloupka (1991) also finds indirect evidence that less educated and younger individuals are more myopic because their contemporaneous cigarette demand is less related to future consumption and prices. Gilleskie and Strumpf (2005) find evidence for state dependence in cigarette consumption, consistent with the notion of habit-formation present in the Becker-Murphy model. A smaller, fully structural literature jointly models smoking decisions along with health and mortality processes. This approach allows for a rigorous quantification of how health risks (or beliefs about health risks) alter the incentives to smoke over the life-cycle. Arcidiacono et al. (2007) develop and estimate one of the first structural models of smoking, health, and mortality in a sample of mature adults from the Health and Retirement Study (HRS). They find evidence in favor of forward looking behavior and support for habit formation in the form of substantial quitting costs. Darden (2013) develops and estimates a structural model of smoking decisions and focuses on the role of individual (Bayesian) learning about the health risks of smoking. He finds evidence that smokers quit in response to the onset of chronic illnesses, but are less likely to respond to new information about individual health markers such as blood pressure and high-density lipoprotein. Our model builds on the basic framework present in Arcidiacono et al. (2007) and Darden (2013).

4.1

Choice Set and Addiction Stocks

We model smoking as a discrete choice. Each period, individuals choose one of J + 1 levels of smoking: {c0 , c1 , ..cJ }, where c0 = 0 represents the non-smoking option, and more generally

15

cj represents the quantity of cigarettes consumed per day under option j. We allow smokers to choose one of four intensities: {0, 5, 20, 30}. Let Cit represent individual i’s cigarette consumption in period t. We assume that smoking is associated with two kinds of persistent effects. First, current cigarette consumption fuels an addiction to nicotine that makes it difficult to reduce cigarette consumption in the future. The intensity of this addiction is captured by the addiction stock Sita . We assume that this evolves deterministically according to the following law of motion:

a Sit+1 =

   (1 − δa1 )S a + δa1 Cit , if Cit > S a ; it it

(2)

  (1 − δa2 )Sita + δa2 Cit , if Cit ≤ Sita .

That is, the addiction stock in the next period is equal to a weighted average of the prior addiction stock, Sita , and the current level of smoking, Cit . The weight is allowed to differ depending on whether an individual is consuming more or less than their addiction stock. This flexibly allows for differences in addiction dynamics between build-up and reduction phases. In addition to fueling a behavioral habit, smoking may also have a persistent effect on an individual’s health. We assume that such effects are related to a separate stock, Sith , which reflects the latent potential for past smoking to induce negative health events. We refer to this as the smoking health stock, and it evolves deterministically according to the following law of motion:

Sith = (1 − δh )Sith + ζh Cit

(3)

Here δh represents the annual depreciation rate for the health stock, and ζh represents the rate at which cigarette consumption builds this stock.

4.2

The Health Process

We assume that individuals can enter into two kinds of bad health states: chronic, non-cancerous conditions related to the lungs, and all other conditions. Individuals can experience both bad health conditions simultaneously. Even though both events play important roles in influencing health behaviors and mortality, the medical literature suggests that smoking most directly affects the pulmonary system. Furthermore, we would like to explain the reduced-form patterns that 16

we observe between our SNPs of interest and lung health, so we treat such illness as a separate category. Let BitS ∈ {0, 1} indicate that individual i experiences a bad health state related to the lungs in period t, and let and BitO ∈ {0, 1} indicate a bad health state related to other conditions. Since our data on lung illness indicate whether an individual has ever experienced a chronic lung S condition, we model lung illness as an absorbing state, so Bit+1 = 1 if BitS = 1. For an individual

who has never been diagnosed with a chronic lung condition, we model the joint distribution of BitS O∗ and BitO through a bivariate probit specification. Let bS∗ it and bit be continuous indices reflecting

an individual’s propensity to fall into the various bad-health states. We assume that:

s s s AgeSith + Sit bS∗ = β0s + βage Age + βage2 Age2 /100 + βhs Sith + βageh it

(4)

O O O o o o o o AgeBi,t−1 + (5) Bi,t−1 + βboa AgeSith + βbo bO∗ = β0o + βage Age + βage2 Age2 /100 + βho Sith + βageh it it

 S Here Sit , O it ∼ N (0, Σ), Σ is a variance-covariance matrix, and we allow σ12 6= 0. Then Bit = 1 if S O O∗ O S bS∗ it > 0, and Bit = 0 otherwise. Similarly, Bit = 1 if bit > 0, and Bit = 0 otherwise. If Bit−1 = 1,

then the process determining BitO collapses to the single-equation probit specified by Equation 5. At the beginning of a period, before BitS and BitO are determined, an individual dies with probability: D πit = Φ(β d Xitd )

(6)

Here Φ is the standard normal c.d.f., and Xitd is a vector of regressors that includes a quadratic in S , B O , and S h . The survival probability is given by π S = (1 − π D ). age, Bit−1 it−1 it it it

4.3

Period Utility

Here we describe the model describing the behavior of ever-smokers, or individuals who have already decided to smoke for at least one period in their lives. Later we will describe how we model the initiation process. Let Zit refer to the set of state variables for individual i’s decision problem in period t, excluding transitory shocks to utility. This vector contains the addiction and smoking health stocks, as well as age, the current health states, and a reduction cost draw a h O S cost (cost it ): Zit = {Sit , Sit , Bit , Bit , it , t}. The period utility associated with choosing option j from

17

the choice set is given by u ej (Zit , jit ) = uj (Zit ) + jit . That is, period utility for choice j is the sum of a component that depends on the state variables uj , and a random shock, jit , with the state-dependent component specified as:

uj (Zit ) =

 α0i + α0S 1{BitS = 1} ln(1 + cj ) + ln(yit − e(pt , cj ))  a − exp α1i + cost (Sit − Cit )1+α2 1 (Cit < Sita ) it +α3 1{BitS = 1} + α4 1{BitO = 1}

(7)

 Here the α0i + α0S 1{BitS = 1} ln(1 + cj ) term represents the part of period utility that an individual receives from smoking at level cj . Note that the marginal utility of cigarette consumption is influenced both by the term α0i , which is heterogeneous in the population, as well as α0S 1{BitS = 1}, which allows for the marginal utility of cigarette consumption to differ depending on the lung health of the individual.4 Utility from all other consumption goods is reflected in the term ln(yit −e(pit , cj )). Income in period t is given by yit , and e(pit , cj ) represents expenditure on cigarettes, which depends on both the level of cigarette consumption, cj , as well as pit , the cigarette price that individual i  a faces in period t. The term − exp α1i + cost (Sit − Cit )1+α2 captures the disutility that an indiit vidual receives when deciding to smoke less than their currently level of the addiction stock. Notice that this is potentially nonlinear in the distance between current consumption and the addiction stock, so that larger reductions generate increasingly greater disutility. The parameter α2 governs the curvature of this disutility term. The reduction cost also depends on a stochastic component cost it . Finally, the jit terms are shocks to each choice that are assumed to be i.i.d. across individuals, choices, and time periods and are drawn from a Type I extreme value distribution. Finally, the parameters α3 and α4 indicate the flow-disutility associated with entering the bad health smoking state and the bad health other state, respectively.

4.4

Stochastic Reduction Costs and Learning

The stochastic component of reduction costs, cost it , is assumed to be drawn i.i.d. across time 2 . This means that the term periods from a mean-zero normal distribution with variance σcost 4

This is consistent with suggestive evidence that the enjoyment of smoking may decline as respiratory function worsens.

18

 exp α1i + cost is log-normally distributed with location parameter α1i . It is assumed that α1i it is heterogeneous in the population, and that individuals are uncertain about their own value. Specifically, we assume that there are two types in the population, {α1 , α1 }. Let π low refer to the probability that an individual is a low-cost type.  Let ηit = exp α1i + cost refer to the cost parameter drawn by individual i in period t. When it an individual is informed about their own type, they correctly believe that ηit is log-normally distributed with the true location parameter for their type. However, when individuals are uninformed, they believe that ηit is drawn from a mixture distribution of the two log-normal distributions in the population. That is, individuals believe, each period that they will receive a draw from the low cost distribution with probability π low,b , and will receive a draw from the high-cost distribution with probability (1 − π low,b ). We allow π low,b to be a belief parameter that is not necessarily equal to the true mixing probability π low . We assume that individuals are initially unaware of their exact cost type, and only come to learn the true value of α1i through experiential learning. Let Inf oit represent an indicator for whether or not an individual is informed of their own type. Let Inf ocost be a dummy variable that indicates it whether or not an individual is informed in time period t. We assume that individuals start out life uninformed (Inf ocost i0 = 0). However, after every period in which an individual smokes, there is some probability π learn that an individual learns their true type. Let Ωit refer to an individual’s information set at time t. Information about an individual’s true cost type is one element of this set: Inf ocost ∈ Ωit . it We have proposed a rather crude learning mechanism. A rational, Bayesian agent would update their belief about the distribution of ηit on the basis of the observed sequence of past cost draws. However, Bayesian learning dynamics would greatly complicate their numerical solution of the model by necessitating an additional state variable - the history of past cost draws. To avoid excessive computational burden, we make the starker assumption that individuals randomly transition from the uniformed to the informed state with probability π ` after every period during which they smoke.

19

4.5

Decision Problem

The individual’s decision problem can be expressed as:

S h max u ej (Zit , jit ) + βπit+1 (Zit , Sit+1 )E [Vt+1 (Zit+1 ) | Ωit ]

j∈0,1,2,3

(8)

Here we recognize that the probability of dying between periods t and t + 1 depends on the period h . Also, V t state variables, Zit , as well as the updated smoking health stock Sit+1 t+1 (Zit+1 ) is the

value of the decision problem in period t + 1 given the state vector Zit+1 . The expectation of Vt+1 (Zit+1 ) is taken with respect to the random state vector Zit+1 and the vector of shocks it conditional on survival, Zit , and choice j. Note also that the expectation depends on the current information set Ωit . Following Rust (1987), if we assume that the shocks εjit , are additively separable, satisfy the conditional independence assumption, and follow a Type I extreme value distribution, then the the expected value of the next period’s value function (conditional on survival) can be expressed as:   E [Vt+1 (Zit+1 ) | Ωit ] = E ln 

 X

exp {νjt+1 (Zit+1 , Ωit )}

(9)

j

Here νj (Zit+1 , Ωit ) is the conditional value function associated with making choice j in time period t. This is the expected value of making the choice, net of the jit shock:    X S h νjt (Zit , Ωit ) = uj (Zit ) + βπit+1 (Zit , Sit+1 )E ln  exp {νjt+1 (Zit+1 , Ωit )}

(10)

j

In the terminal period t = T , the conditional value functions reduce down to νjT (ZiT ) = uj (ZiT , ΩiT ). The individual’s decision problem can thus be expressed as:

max νj (Zit , Ωit ) + jit

j∈0,1,2,3

(11)

The conditional choice probabilities associated with this optimization problem can be expressed as: exp(νj 0 t (Zit , Ωit )) P rob(j = j 0 | Zit , Ωit ) = P j exp(νjt (Zit , Ωit )) 20

(12)

4.6

Parameter Heterogeneity and Genes

In the population of ever-smokers, we assume that there are J = 3 cigarette preference types characterized by distinct values of α0i . Let τip indicate an individual’s preference type. The probability of being preference type 1 or 2, conditional on being smoking type and on an individual’s genotype Gi is given by P (τip = τ | Gi ) =

exp(θp Xpi ) 1 + exp(θp1 Xpi ) + exp(θp2 Xpi )

(13)

Where Xpi contains a constant, and allele counts for each of our four SNPs of interest. The probability of preference type 3 is given by P (τip = 3 | Xpi ) = (1 − P (τip = 1 | Xpi ) − P (τip = 2 | Xpi )). In addition to preference type τip , individuals also possess an addiction type τia = 1, 2. Different addiction types possess different values of the location parameter α1i in the reduction cost process. The probability of addiction type 1 is given by:

P rob(τia = 1 | Gi , τip ) =

exp(θa Xai ) 1 + exp(θa Xai )

(14)

Here Xai includes allele counts for each SNP, and dummies for preference type 2 and 3. That is, we allow for addiction types and preference types to be correlated, but impose the above nested structured for the type probabilities. Equations 13- 14 describe how genes enter the structural model. Allele counts for each of our four SNPs of interest affect the linear indices that determine an individual’s preference type probability, and the addiction type probability, conditional on preference type. That is, different genotypes are associated with different distributions of the pair < α0i , α1i > in the population.

4.7

The Initiation Process

The behavioral model and distribution of model parameters discussed so far apply to the population of ever-smokers. We choose to separately model the process of initiation. We do this because the SNPs studied here have not been linked to initiation. However, in any forward-looking economic model of smoking, changes in preference for nicotine or in the cost of reduction should affect the probability that an individual ever becomes a smoker. The presence of robust associations between

21

our SNPs of interest and various smoking outcomes, and the lack of any such correlation with initiation suggests that individuals are not aware of their preference or cost types when making initiation decisions. That is, the initiation decision can be thought of as largely separate from the processes that determine consumption and cessation later in life, at least for the channels through which these SNPs operate. We approximate the underlying behavioral model that drives initiation by assuming that it is random and uncorrelated with our SNPs of interest. With some probability π N oSmoke , an individual is a non-smoking type that will always abstain from cigarettes. With probability (1 − π N oSmoke ), and individual is a possible smoker, and the distribution of < α0i , α1i > within this population is determined by the type distribution in the previous section. Within the population of possible smokers, individuals start life at age 10 with zero values of the stocks Sita and Sith . The probability that an individual starts smoking for the first time is governed by an exogenous initiation process. Specifically, we assume a probit initiation process where Iit∗ represents a latent initiation index:

Iit∗ = γ0 + γ1 Ageit + γ2 Age2it + γ3 Y earBorni + Init it

(15)

Here Init is distributed i.i.d. standard normal, and if a never-smoker draws Iit∗ > 0, then they it receive draws for the random components of utility and solve the problem in Equation 8. If they ∗ choose to not smoke, they continue to be a never-smoker and will receive a draw for Iit+1 in the

next period. If they choose to smoke, then behavior is determined based on the smokers problem outlined in the previous section. For never-smokers, health outcomes are also determined by the system in Equations 4-6. We have assumed an exogenous initiation process for convenience and to avoid adding an extra layer of uncertainty regarding an individuals’ preference for cigarettes. The cost of this approach is that our initiation process will not be policy invariant. Thus, any counter-factual policy analysis performed with the estimated model is subject to the limitation that the interventions might affect the initiation process in an un-modeled manner.

22

4.8

Information about the Health Risks of Smoking

Our sample consists of individuals born between 1920-1959. These cohorts reached maturity and made smoking decisions during a period of tremendous change in society’s understanding of the health risks associated with smoking. Although there were concerns about the health consequences of smoking, the health risks of smoking were not entirely recognized by the medical establishment. As smoking rates rose in the 1930s and 1940s, cigarette advertisements often featured doctors, promoting the idea that smoking was safe (Gardner and Brandt 2006). A key turning point in the public perception of the health risks of smoking was the issuance of the Surgeon General’s Report on Smoking and Health in 1964. The report marshalled epidemiological evidence and precipitated a decline in smoking rates for many groups (De Walque 2010). Failure to account for this large, population-wide change in information on the health risks of smoking could bias our estimates of parameters related to the life-cycle consumption of cigarettes. To address this, we introduce a state variable to the information set Surgt ∈ Ωit , which is an indicator for the years 1964 and later. The Surgt affects beliefs about the health risks of smoking, and therefore alters the way that people evaluate the conditional expectation in Equation 8. Specifically, we assume that before 1964, optimal behavior is determined under the assumption that there are no health risks associated o s , and βhd are all assumed to be zero. During and after , βho , βageh with smoking, so that βhs , βageh

1964, individuals form expectations based on assuming the true values of these health parameters. Practically, these means solving for two sets of value functions: one under the assumption of no health risks, and one under the assumption of the true health risks. Behavior is then simulated using the Surgt = 0 value functions before 1964, and with the Surgt = 1 value functions thereafter.

5

Empirical Implementation and Estimation Results We estimate the parameters of the structural model using the Method of Simulated Moments.

We solve and simulate the model for each distinct combination of preference and addiction types for a number of birth cohorts, and search for model parameters that best match a set of moments from the empirical data. Let SP = {1, 2, 3, 4} refer to the set of preference types, with preference type 4 denoted the never-smoking type introduced in Section 4.7. Similarly, let SA = {1, 2} refer to the set of addiction types, and let SBC = {1925, 193, 1935, 1940, 1945, 1950} refer to the set 23

of birth cohorts for which the model is simulated.5 Finally, let SG refer to the set of genotypes formed by all relevant combinations of the SNPs rs16969968, rs13280604, rs7937, and rs1329650.6 Let S = SP × SA × SBC × SG refer to the combined set of distinct simulation groups. For each group f ∈ S, we simulate 1,000 histories of smoking behavior, health, and mortality. f` refer to the Let M ` refer to the empirical sample average for the `th moment, and let M corresponding simulated average. The `th simulated moment is constructed as: P f` = M

f ∈S P

ωf N`f m e `f

f ∈S

(16)

ωf N`f

Here m e `f represents the average value of the moment ` calculated from simulated observations from group f . N`f indicates how many simulation observations contributed to the group f average for moment `, and ωf represents the population-weight assigned to group f . The group weight ωf is determined by:

ωf =

   F req BC F req G piN oSmoke , f f

If never-smoker type;

  F req BC F req G (1 − π N oSmoke )P (τ p = τ p | Gf )P (τia = τ a | Gf , τ p ), Otherwise. i f f f f f (17)

Here F reqfBC measures the relative frequency of group f ’s birth cohort, and F reqfG measures the relative frequency of group f ’s genotype.7 Our estimator minimizes the weighted sum of squared distances between simulated and empirical 5

6

7

Note that SBC does not include all birth cohorts in the empirical sample. Since calendar time is a state variable in the model, every birth-cohort that is simulated requires distinct value functions, increasing computational expense. Although we use all birth cohorts when calculating our empirical moments, we only simulate the model for an evenly spaced subset of the birth years spanned by our sample. There are 81 possible genotypic combinations (four SNPs and three possible values for the allele counts at each SNP), and we observe 80 in our sample. To cut down the computational expense of searching for the type mixing parameters, we also exclude from the simulated model genotypic combinations that are extremely rare. Specifically, we exclude the 22 smallest genotypic groups in our sample. These groups together account for about 2% of our sample. We use data on all individuals for constructing the empirical moments. The F reqfBC measure is based on the sizes of these birth cohorts in U.S. Census data. Using IPUMS data from the 1960 and 1980 U.S. Censuses, we sum up the sampling weights of all individuals born in each cohort. For birth cohorts between 1920-1940, we use the 1960 Census, and for those cohorts between 1941-1960, we use the 1980 Census. We split the cohorts in this way to make sure that mortality does not bias our calculation of relative cohort sizes for older birth cohorts. The relative genotypic frequencies F reqfG are directly calculated from the HRS sample. 24

values for the 171 moments described in Appendix Section 7.1.8 These moments include agespecific smoking rates, the frequency of intensity categories conditional on smoking, the fraction of individuals who are bad health, have a major lung illness, as well as annual death rates. Moments based on maximum cigarettes ever smoked and the lung illness indicator are evaluated conditional on genotype, matching the descriptive regressions examined earlier. Tables 9- 10 present the structural parameter estimates. We estimate three distinct preference types for the parameter α0 : 0.008, 0.068, and 0.231. Conditional on being an ever-smoker type, these preferences occur with probabilities 0.72, 0.17, and 0.12, respectively. Individuals are further differentiated by addiction types. We estimate substantial differences in the two assumed addiction types, which take α1 values of -1.88 and 0.26, respectively. We estimate that 69 percent of the ever-smoking population is the low cost type, and the freely estimated belief parameter of 0.59 is quite close to this true proportion. To assess how well the model fits the data, Table 11 compares several simulated moments with their empirical counterparts. In general, the model fits the data quite well, with some noteworthy exceptions. The model seems to under-predict binary smoking in the late 50s (0.22 v.s. 0.28 in the data), and over-predict smoking later in life. The model also over-predicts light smoking and under-predicts heavy smoking at older ages. However, the model matches the decline in smoking as individuals pass through the 60s and 70s, and it matches the distribution of maximum cigarettes per day at age 55 fairly well. Table 12 displays the type probabilities associated with different genotypes. These probabilities directly inform us about the channels through which these SNPs operate. Our estimates suggest that rs16969968 has a large effect on the distribution of cigarette preference parameters. About 15 percent of individuals with no copies of the protective reference allele at rs16969968 fall into the highest preference category, while this is true for only 4 percent of individuals with two copies. Conversely, while 44 percent of individuals with zero copies are in the low preference category, this number rises to 51 percent for those with two copies. We find no clear relationship between rs16969968 and an individual’s addiction type. The results for SNP rs7937 follow a similar pattern, with extra copies being associated with a smaller probability for the highest preference category, 8

We weight all moments equally, with the exception of the extensive margin smoking moments, which receive 10 times the weight of other moments.

25

and a larger probability for the lowest preference category. Our estimates suggest no relationship between this SNP and the addiction type. For rs1329650, the estimates suggest the opposite pattern. While extra copies of the reference allele at rs1329650 do not shift the distribution of preference types, they seem to reduce the probability that an individual is in the high reductioncost category. Thus it appears that rs1329650 may be working through an addiction channel distinct from the other SNPs under study. Finally, we note that our estimates for rs13280604 are difficulty to neatly interpret. It appears that extra copies of the reference allele at this location are associated with a smaller probability of being in the highest preference category, and a much higher probability of being in the high cost addiction type. Taken together, the results in Table 12 demonstrate the feasibility of using observational data to map genotypic heterogeneity into the parameters of a dynamic model of smoking behavior. Furthermore, the results suggest that SNPs such as rs16969968 and rs1329650 might operate through distinct channels.

26

Table 1: Cross Sectional Characteristics in HRS (At Last Observation) Variable Mean Std. Dev. N Age 73.74 7.63 8140 Male 0.43 0.50 8140 Ever Smoked 0.57 0.49 8140 Max. Cigs Per Day 25.68 17.90 4603 Ever Lung Illness 0.18 0.38 8060 Ever Heart Illness 0.38 0.48 8140 Ever Cancer 0.23 0.42 8140 Bad Health 0.27 0.45 8134

Table 2: Maximum Cigarettes Per Day (MaxCigs>0) rs16969968 -1.320*** -1.326*** (0.389) (0.389) rs13280604 -0.744* -0.757* (0.434) (0.433) rs7937 -1.008*** -1.030*** (0.362) (0.361) rs1329650 -0.129 -0.136 (0.401) (0.401) Observations 4603 4603 4603 4603 4603 R2 0.097 0.096 0.097 0.095 0.100

27

Table 3: Contemporaneous Smoking Outcomes - Ever-Smokers, All Pers-Year Obs Smoke Quit Quant. Quant. (w / zeros) (given smoking) rs16969968 0.013+ 0.005 0.060 -0.680* (0.008) (0.007) (0.183) (0.349) rs13280604 -0.014+ -0.015* -0.390* -0.017 (0.009) (0.008) (0.201) (0.409) rs7937 0.008 0.006 -0.060 -0.725** (0.008) (0.007) (0.176) (0.323) rs1329650 -0.008 -0.007 -0.098 0.102 (0.008) (0.008) (0.193) (0.376) Observations 38782 8060 35114 9061 R2 0.067 0.020 0.058 0.093

Table 4: Lung Illness rs16969968 -0.029*** (0.010) rs13280604 0.003 (0.011) rs7937 -0.004 (0.009) rs1329650 -0.026*** (0.010) MaxCigsPerDay 0.004*** (0.000) Observations 4603 4633 R2 0.045 0.029

28

-0.024** (0.010) 0.005 (0.011) -0.000 (0.009) -0.027*** (0.010) 0.003*** (0.000) 4603 0.048

Table 5: Heart Illness 0.017+ (0.011) rs13280604 0.021* (0.012) rs7937 -0.007 (0.010) rs1329650 0.003 (0.011) MaxCigsPerDay 0.002*** (0.000) Observations 4603 4633 R2 0.070 0.068

0.020* (0.011) 0.022* (0.012) -0.003 (0.010) 0.003 (0.011) 0.002*** (0.000) 4603 0.071

Table 6: Ever Cancer -0.003 (0.010) rs13280604 0.012 (0.011) rs7937 -0.010 (0.009) rs1329650 -0.012 (0.010) MaxCigsPerDay 0.002*** (0.000) Observations 4603 4633 R2 0.044 0.041

-0.001 (0.010) 0.014 (0.011) -0.006 (0.009) -0.011 (0.010) 0.002*** (0.000) 4603 0.045

rs16969968

rs16969968

29

Table 7: Mortality (One-Year Death Rate), Linear Probability All Cohorts Born 1920-1939 Born 1940-1949 rs16969968 -0.003 -0.002 -0.004** (0.002) (0.003) (0.002) rs13280604 -0.003+ -0.001 -0.005*** (0.002) (0.003) (0.002) rs7937 -0.002+ -0.002 -0.003* (0.002) (0.002) (0.002) rs1329650 -0.001 -0.001 -0.002 (0.002) (0.003) (0.002) Observations 17566 11069 6497 R2 0.019 0.016 0.009

Table 8: Health Outcomes - Never Smokers Ever Lung Ever Heart Ever Cancer Bad Health rs16969968 -0.011 -0.002 -0.008 0.005 (0.008) (0.012) (0.010) (0.011) rs13280604 0.012 0.008 -0.010 0.012 (0.009) (0.013) (0.012) (0.012) rs7937 0.006 0.007 0.013 0.003 (0.007) (0.012) (0.010) (0.010) rs1329650 -0.003 0.009 -0.007 -0.002 (0.008) (0.013) (0.011) (0.011) Observations 3427 3427 3427 3426 2 R 0.031 0.064 0.040 0.083

Figure 1: Smoking Intensity by Age for Continuous Smokers - NLSY

30

Table 9: Parameter Estimates Utility Parameters α0 Pref. Type 1 0.0082 Pref. Type 2 0.0677 Pref. Type 3 0.2309 α0S -0.1115 α3 (Smoking Illness) -0.1569 α4 (Bad Health) -0.2446 log(σ ) -0.9708 Reduction Cost Params α1 Add. Type 1 -1.8752 Add. Type 2 0.2627 cost log σ -0.4029 log α2 -4.1629

- Period Utility and Stocks Initiation γ0 -5.3384 γ1 (Age) 0.4292 γ2 (Age Sq.) -0.0107 γ3 (Year Born) -0.0121 Other Parameters δa1 0.3874 δa2 0.0500 δh 0.4125 β 0.8801 π low,b 0.5906 ` π 0.1636 Avg. Type Probabilities Non-Smoking Type 0.3098 Pref 1 0.7168 Pref 2 0.1662 Pref 3 0.1171 Add 1 0.6922 Add 2 0.3078

Table 10: Parameter Estimates - Health Processes Death Process Bad Health Process d β0 -4.0472 β0o d o βage -0.0081 βage d o βage2 0.0295 βage2 d βs 0.3357 βho d o βo 1.6855 βageh o βhd 0.0017 βbo o Lung Illness Process βboa β0s -3.4998 Health Process Correlation s βage 0.0133 σ12 s βage2 0.0006 βhs 0.0259 s βageh 0.0033

31

-1.3899 -0.0001 0.0002 -0.1589 0.0365 0.0092 0.0007 0.3700

Table 11: Empirical and Simulated Moments Smoke Light Smoke Heavy Smoking (Binary) Emp. Sim. Emp. Sim. Emp. Sim. Ages: Ages: Ages: 55-59 0.2789 0.2231 55-59 0.3904 0.3636 55-59 0.2358 0.3183 60-64 0.2317 0.1983 60-64 0.4243 0.4375 60-64 0.2022 0.2309 65-69 0.1944 0.1756 65-69 0.4765 0.5347 65-69 0.1773 0.1327 70-74 0.1459 0.1542 70-74 0.5236 0.6020 70-74 0.1293 0.0724 75-79 0.1051 0.1366 75-79 0.6256 0.6379 75-79 0.0869 0.0486 80-84 0.0805 0.1312 80-84 0.6945 0.6423 80-84 0.0576 0.0414 85-89 0.0627 0.1412 Smoke Categories Age 55 Emp. Sim. Category: Medium 0.2174 0.2387 Heavy 0.2714 0.2701

Bad Health Emp. Sim. Category: 55-59 0.2114 0.2159 60-64 0.2291 0.2448 65-69 0.2504 0.2746 70-74 0.2750 0.3041 75-79 0.3159 0.3279 80-84 0.3428 0.3455 85-89 0.3782 0.3359 Death if Smoke Emp. Sim. Ages: 55-59 0.0094 0.0110 60-64 0.0152 0.0162 65-69 0.0278 0.0258 70-74 0.0418 0.0383 75-79 0.0572 0.0598 80-84 0.1069 0.0808 85-89 0.0909 0.1094

Ever Lung Illness Emp. Sim. Ages: 55-59 0.0485 0.0814 60-64 0.0779 0.1038 65-69 0.0988 0.1277 70-74 0.1326 0.1478 75-79 0.1597 0.1664 80-84 0.1742 0.1862 85-89 0.1930 0.2059 Bad Health if Smoke Emp. Sim. Ages: 55-59 0.2982 0.2521 60-64 0.3198 0.2896 65-69 0.3393 0.3164 70-74 0.3590 0.3613 75-79 0.4214 0.4003 80-84 0.4479 0.4096 85-89 0.4179 0.3905

32

Ever Lung if Smoke Emp. Sim. Ages: 55-59 0.1680 0.1594 60-64 0.2256 0.2084 65-69 0.2728 0.2523 70-74 0.2724 0.2914 75-79 0.3195 0.3250 80-84 0.3549 0.3577 85-89 0.4706 0.3722 Death Emp. Sim. Ages: 55-59 0.0030 0.0088 60-64 0.0064 0.0125 65-69 0.0107 0.0199 70-74 0.0231 0.0292 75-79 0.0390 0.0449 80-84 0.0632 0.0627 85-89 0.0960 0.0863

Genotype rs16969968 0 copies 1 copy 2 copies rs13280604 0 copies 1 copy 2 copies rs7937 0 copies 1 copy 2 copies rs1329650 0 copies 1 copy 2 copies

Table 12: Type Probabilities by Genotype Prob. Pref 1 Prob. Pref 2 Prob. Pref 3 Prob Add. 1

Prob Add. 2

0.4372 0.4784 0.5113

0.1014 0.0939 0.1355

0.1516 0.1178 0.0433

0.4725 0.5137 0.4418

0.2177 0.1765 0.2483

0.4988 0.4921 0.4670

0.0973 0.1348 0.1741

0.0941 0.0633 0.0491

0.5242 0.4252 0.3114

0.1660 0.2650 0.3788

0.4825 0.4972 0.5071

0.1079 0.1157 0.1225

0.0999 0.0773 0.0606

0.4781 0.4777 0.4772

0.2121 0.2124 0.2130

0.4920 0.4967 0.5012

0.1240 0.1072 0.0937

0.0742 0.0863 0.0952

0.4354 0.5140 0.5633

0.2548 0.1762 0.1269

33

6

Appendix

6.1

Moments used in Estimation

A total of 171 moments are used in the estimation, which we itemize below. Unless otherwise noted, we condition on age by considering means across the following seven age groups: (55-59,6064,65-69,70-74,75-79,80-84,85-89); • Initiation Moments: average age at start, fraction of starts occurring before age 15, and the fraction of starts occurring after age 30. (3 moments) • Smoking Extensive Margin: Fraction of individuals who are smoking by age group. (7 more to 10 moments) • Smoking Intensive Margin: Fraction of smokers choosing categories 1 and 3, by age group. We do not calculate these moments for the oldest age group due to concerns over small sample sizes (12 more to 22) • Quitting: Fraction of smokers quitting two years later, by age group. We do not calculate this for the oldest age group. (6 more to 28) • Death Rates: Fraction of individuals that die each year, by age group. These are also computed unconditionally, as well as conditional on current (binary) smoking status. (14 more to 42 moments) • Lung Illness: Fraction of individuals that have ever reported a major lung illness, by age group. These are also computed conditional on current (binary) smoking status. (14 more to 56 Moments) • Bad Health Status: Fraction of individuals that are currently in bad health, by age group. These are also computed conditional on current (binary) smoking status (14 more to 70 moments) • Early Death Rate: Fraction of individuals that die each year between the ages of 36 and 40. (1 more moment to 71).

34

• Ever Smoking Rates (Age ≈55): Fraction of individuals at age 55 that have ever smoked. This is calculated conditional on inclusion in each of 5 birth year groups: 1930-1934, 1935-1939, 1940-1944, 1945-1949, and 1950-1954. (5 more moments to 76). • Maximum Cigarettes Per Day: Fraction of ever smokers whose maximum past consumption was equal to the second and third intensity categories, respectively. This is evaluated at age 55. (2 more moments to 78). • Smoking Category Transitions: We look at two-year transition rates between smoking categories. For each of the three intensity categories, we calculate the fraction of smokers in each category who end up smoking in the each of the three categories two years later. We do not calculate these for the oldest two age categories. (45 more moments to 123). • Conditional Death Rates: Fraction of individuals that die each year conditional on lung illness state and bad health state. We this for each age category with the exception of the oldest group. (12 more moments to 135). • Genotype Specific Moments: For each of the four SNPs of interest, we calculate the fraction of ever-smokers whose maximum consumption at age 55 was equal to the second and third intensity categories conditional on having 0, 1, or 2 copies of the reference allele for that SNP. By allele counts for each SNP, we also calculate the fraction of ever smokers who have been diagnosed with a lung illness. (36 more moments to 171).

References Beauchamp, J. P., D. Cesarini, M. Johannesson, M. J. H. M. van der Loos, P. D. Koellinger, P. J. F. Groenen, J. H. Fowler, J. N. Rosenquist, A. R. Thurik, and N. A. Christakis (2011): “Molecular Genetics and Economics,” Journal of Economic Perspectives, 25(4), 57–82. for Disease Control, C., and Prevention (2010): How Tobacco Smoke Causes Disease: The Biology and Behavioral Basis for Smoking-Attributable Disease: A Report of the Surgeon General. Centers for Disease Control and Prevention, Atlanta, Georgia.

35

Johnson, A. D., R. E. Handsaker, S. L. Pulit, M. M. Nizzari, C. J. O’Donnell, and P. I. de Bakker (2008): “SNAP: A Web-Based Tool for Identification and Annotation of Proxy SNPs Using HapMap,” Bioinformatics, 24(24), 2938–2939. Liu, J. Z., F. Tozzi, D. M. Waterworth, S. G. Pillai, P. Muglia, L. Middleton, W. Berrettini, C. W. Knouff, X. Yuan, G. Waeber, et al. (2010): “Meta-Analysis and Imputation Refines the Association of 15q25 with Smoking Quantity,” Nature Genetics, 42(5), 436–440. Munafo, M. R., T. G. Clark, E. C. Johnstone, M. F. Murphy, and R. T. Walton (2004): “The Genetic Basis for Smoking Behavior: A Systematic Review and Meta-Analysis,” Nicotine & Tobacco Research, 6(4), 583–597. Rietveld, C. A., D. Conley, N. Eriksson, T. Esko, S. E. Medland, A. A. Vinkhuyzen, J. Yang, J. D. Boardman, C. F. Chabris, C. T. Dawes, et al. (2014): “Replicability and Robustness of Genome-Wide-Association Studies for Behavioral Traits,” Psychological Science, 25(11), 1975–1986. The Tobacco and Genetics Consortium (2010): “Genome-wide Meta-analyses Identify Multiple Loci Associated with Smoking Behavior,” Nature Genetics, 42, 441–449. Thorgeirsson, T. E., D. F. Gudbjartsson, I. Surakka, J. M. Vink, N. Amin, F. Geller, P. Sulem, T. Rafnar, T. Esko, S. Walter, et al. (2010): “Sequence Variants at CHRNB3CHRNA6 and CYP2A6 Affect Smoking Behavior,” Nature Genetics, 42(5), 448–453.

36

Suggest Documents