Flexible Matching Strategies to Increase Power and Efficiency to Detect and Estimate Gene-Environment Interactions in Case-Control Studies

American Journal of Epidemiology Copyright © 2002 by the Johns Hopkins Bloomberg School of Public Health All rights reserved Vol. 155, No. 7 Printed ...
0 downloads 0 Views 124KB Size
American Journal of Epidemiology Copyright © 2002 by the Johns Hopkins Bloomberg School of Public Health All rights reserved

Vol. 155, No. 7 Printed in U.S.A.

Flexible Matching Strategies to Detect Interactions Stürmer and Brenner

ORIGINAL CONTRIBUTIONS

Flexible Matching Strategies to Increase Power and Efficiency to Detect and Estimate Gene-Environment Interactions in Case-Control Studies

Til Stürmer1,2 and Hermann Brenner1,2 Lack of power is a pertinent problem in many case-control studies of gene-environment interactions. The authors recently introduced the concept of flexible matching strategies with varying proportions of a matching factor among selected controls (degree of matching) to increase the power and efficiency of case-control studies. In this study, they extended the concept of flexible matching strategies to the field of gene-environment interactions. They assessed the power and efficiency of such studies to detect and estimate gene-environment interactions under a variety of assumptions regarding the prevalence and effects of the environmental exposure and the genetic susceptibility as well as their association in the population. For each set of parameters, 10,000 case-control studies were simulated using varying degrees of matching. Traditional frequency matching increased the power and precision in most scenarios, but even greater gains were often obtained by increasing the prevalence of the environmental exposure in controls above the one in cases. The authors concluded that flexible matching strategies can increase the power and efficiency of case-control studies to detect and estimate gene-environment interactions compared with traditional frequency matching and therefore might help to alleviate the notorious lack of power of these studies in specific situations. Am J Epidemiol 2002;155:593–602. case-control studies; confidence intervals; efficiency; environmental exposure; epidemiologic methods; genetic predisposition; research design; sample size

Case-control studies are widely used to assess the impact of risk factors on the occurrence of disease. Interactions, especially gene-environment interactions, are of increasing interest, but power is a major concern (1–3). Matching for strong risk factors is often used to enhance the power of case-control studies, but the impact on power to detect interactions may be limited (4, 5). Nevertheless, matching for the environmental factor may often enhance power and efficiency to disclose gene-environment interactions in situations likely to be encountered in case-control studies of gene-environment interactions (6). Unmatched and frequency-matched studies are but two distinct possibilities of control selection according to the

prevalence of the environmental exposure. We recently introduced the concept of varying the proportion of an environmental exposure (the matching factor) among controls over a wide range between and beyond the proportion among cases (matched design) and among the population (unmatched design), and we concluded that it might be worthwhile to evaluate the optimum degree of matching for specific settings in the design of case-control studies (7). This study extends these findings regarding main effects to the increasingly important detection and estimation of multiplicative interactions for disease risk between an environmental exposure and a genetic susceptibility in case-control studies. MATERIALS AND METHODS

Received for publication March 13, 2001, and accepted for publication June 26, 2001. Abbreviations: CCRATIO, control-to-case ratio; DM, degree of matching; INT, gene-environment interaction; ORED 0g , odds ratio of exposure-disease association in the absence of the genetic susceptibility; ORED 0G, odds ratio of exposure-disease association in the presence of the genetic susceptibility, OREG, odds ratio of exposuregenetic susceptibility association in the population; ORGD 0e , odds ratio of genetic susceptibility-disease association in the absence of exposure; PE, prevalence of environmental exposure; PG, prevalence of genetic susceptibility. 1 Department of Epidemiology, German Centre for Research on Ageing, Heidelberg, Germany. 2 Department of Epidemiology, University of Ulm, Ulm, Germany. Correspondence to Dr. Til Stürmer, Department of Epidemiology, German Centre for Research on Ageing, Bergheimer Str. 20, 69115 Heidelberg, Germany (e-mail:[email protected]).

We assessed a wide range of scenarios regarding prevalence and association of a dichotomous environmental exposure and a dichotomous genetic susceptibility in the population and the association of each of these factors with disease (table 1). The basic scenario assumed a prevalence of the environmental exposure (PE) of 10 percent and a genetic susceptibility (PG) of 20 percent in the population, independence of the exposure and the genetic susceptibility in the population (odds ratio (OR)EG  1.0), an odds ratio of 2.0 for the exposure-disease association in nonsusceptibles (ORED0g), an odds ratio of 6.0 for the exposure-disease association in susceptibles (ORED0G) (corresponding to a geneenvironment interaction (INT)  ORED0G/ORED0g of 3.0), an 593

594

Stürmer and Brenner TABLE 1.

Parameters and their values in the basic and alternative scenarios Parameter

Notation

PE PG OREG ORED|g ORED|G ORGD|e INT CCRATIO

Parameter values Meaning

Prevalence of environmental exposure Prevalence of genetic susceptibility Odds ratio of exposure-genetic susceptibility association in the population Odds ratio of exposure-disease association in the absence of the genetic susceptibility Odds ratio of exposure-disease association in the presence of the genetic susceptibility Odds ratio of genetic susceptibility-disease association in the absence of exposure Gene-environment interaction (ORED|G/ORED|g) Control-to-case ratio

odds ratio of 1.0 for the genetic susceptibility-disease association in unexposed persons (ORGD0 e), and a control-to-case ratio (CCRATIO) of 1. These values represent a typical situation of gene-environment interaction assessed in large casecontrol studies. To evaluate the impact of each parameter on power and efficiency to detect and estimate the interaction between an environmental exposure and a genetic susceptibility by matching on the environmental exposure, we sequentially varied parameters one at a time while keeping all other parameters constant at the values given in the basic scenario. PE was varied from 5 to 50 percent and PG from 10 to 75 percent. OREG, ORED0g, ORGD0e, and INT were varied from 0.5 to 5.0, 0.5 to 5.0, 0.2 to 5.0, and 0.5 to 5.0, respectively. Finally, CCRATIO was varied from 0.5 to 5. We recently defined the degree of matching (DM) as DM  (PE0 – PE)/(PE1 – PE), where PE1, PE0, and PE are the proportions of the environmental exposure (the matching factor) among cases, among controls, and in the population, respectively (7). Defined in this way, the degree of matching equals 1 with traditional frequency matching (i.e., if PE0  PE1). DM equals 0 for an unmatched study (i.e., if PE0  PE). DM values between 0 and 1 would reflect “incomplete matching.” Values of more than 1 would reflect situations in which controls are selected so that the proportion of the environmental exposure is even higher than the one in cases. For each scenario, we selected 13 different control groups reflecting DM values of 0.0, 0.5, 0.75, 0.9, 0.95, 1.0, 1.05, 1.1, 1.25, 1.5, 2.0, 2.5, and 3.0. Doing so enabled us to assess the impact of “less-thanperfect” matching as well as of increasing the prevalence of the matching factor in controls beyond the one in cases. For all scenarios, we calculated the expected joint distributions of the environmental exposure and the genetic susceptibility in the population, in cases, and in frequencymatched controls with varying degrees of matching (refer to the Appendix). After we determined these expected distributions, 10,000 case-control studies with 400 cases and 400 controls (except when CCRATIO was the parameter varied) each were then simulated by using random numbers generated by the SAS software function RANUNI (version 6.12; SAS Institute, Inc., Cary, North Carolina) and the expected proportions as

Basic scenario

Alternative scenarios

0.10 0.20

0.05–0.50 0.10–0.75

1.00

0.50–5.00

2.00

0.50–5.00

6.00

1.00–10.0

1.00 3.00 1.00

0.20–5.00 0.50–5.00 0.50–5.00

cutpoints to assign each of the cases and controls to one of the four possible categories of environmental exposure and genetic susceptibility. Each simulated case-control study was analyzed by using unconditional logistic regression (the method of choice for both frequency-matched and unmatched studies) with the following independent variables: a binary variable reflecting the environmental exposure, a binary variable reflecting the genetic susceptibility, and their product term (interaction). If the absolute value of any of the regression coefficients was greater than 5, nonconvergence of the model was assumed and the model was excluded from the analysis. If 2 percent or more of the simulated studies met this criterion of nonconvergence, the corresponding scenario was not evaluated further. The power of the studies to detect an interaction on disease risk was calculated by determining the proportion of simulated studies in which the two-sided p value (from the Wald statistic) for the null hypothesis regarding the estimated interaction effect was smaller than 0.05. For each degree of matching, the relative efficiency of estimation (in percent) compared with the unmatched design was obtained by dividing the variance of the logarithm of INT (i.e., the regression coefficient of interaction) across replications in the unmatched studies by the variance observed in the studies with the corresponding degree of matching (multiplied by 100). RESULTS

The expected joint distributions of the environmental exposure and the genetic susceptibility among cases, unmatched controls (DM  0), matched controls (DM  1), and controls with higher proportions of the environmental exposure than those in cases (DM  2 and DM  3) under the basic scenario and three alternative scenarios are presented in table 2. Table 2 also shows the relative efficiency of the matched designs compared with the unmatched designs that was obtained from the simulation study. In the basic scenario, the stratum of controls with the environmental exposure and the genetic susceptibility would be expected to be very small, including only 2 percent (0.02) of unmatched controls. In traditional frequency matching with a degree of matching of 1.0, this percentage Am J Epidemiol Vol. 155, No. 7, 2002

Flexible Matching Strategies to Detect Interactions 595 TABLE 2. Examples of the expected joint distributions of the exposure and the genetic susceptibility in the cases and controls with varying degrees of matching under the basic and the three alternative scenarios* Cases

Controls: degree of matching

Basic scenario†

0.0 (Unmatched)

1.0 (“Matched”)

3.0

2.0

0

1

0

1

0

1

0

1

0

1

0

0.61

0.15

0.72

0.18

0.61

0.15

0.50

0.13

0.39

0.10

1

0.14

0.10

0.08

0.02

0.19

0.05

0.30

0.07

0.41

0.10

Genetic susceptibility

Exposure

RE‡

176%

100%

211%

205%

Alternative scenario I: PE§ = 0.05¶ 0

1

0

1

0

1

0

1

0

1

0

0.70

0.17

0.76

0.19

0.70

0.17

0.63

0.16

0.57

0.14

1

0.07

0.06

0.04

0.01

0.10

0.03

0.17

0.04

0.23

0.06

Genetic susceptibility

Exposure

184%

100%

RE

264%

239%

Alternative scenario II: PG§ = 0.75¶ 0

1

0

1

0

1

0

1

0

1

0

0.16

0.48

0.23

0.67

0.16

0.48

0.10

0.29

0.03

0.10

1

0.04

0.32

0.02

0.08

0.09

0.27

0.15

0.46

0.22

0.65

Genetic susceptibility

Exposure

RE

167%

100%

121%

168%

Alternative scenario III: ORED|g§ = 0.50¶ 0

1

0

1

0

1

0

1

0

1

0

0.74

0.19

0.72

0.18

0.74

0.19

0.76

0.19

0.79

0.20

1

0.04

0.03

0.08

0.02

0.06

0.01

0.04

0.01

0.01

0.00

Genetic susceptibility

Exposure

RE

100%

85%

65%

* Expected cell frequencies have been rounded. † Basic scenario using the parameter values presented in table 1. ‡ RE, relative efficiency compared with the unmatched design (100%) from 10,000 simulated case-control studies (refer to table 3). § PE , prevalence of environmental exposure; PG , prevalence of genetic susceptibility; ORED|g , odds ratio of exposure-disease association in the absence of the genetic susceptibility. ¶ Parameter changed; all other parameters correspond to the values for the basic scenario presented in table 1.

is more than twice as high (5 percent), leading to a relative efficiency compared with the unmatched design (100 percent) of 176 percent. In scenarios in which controls are selected to increase the prevalence of the environmental exposure beyond the one observed in cases (DM  2.0 and DM  3.0), 7 and 10 percent of controls can be expected to have the environmental exposure and the genetic susceptibility, and relative efficiency increases to 205 and 211 percent, respectively. Am J Epidemiol Vol. 155, No. 7, 2002

The alternative scenarios presented were selected to represent three different patterns of the relative efficiency with increasing degree of matching. In the scenario in which prevalence of the exposure is low (PE  0.05), relative efficiency increases monotonically with increasing degree of matching from 100 percent (DM  0, reference scenario) to 184 percent (DM  1), 239 percent (DM  2), and 264 percent (DM  3). In the scenario in which prevalence of the genetic susceptibility in the population is high (PG  0.75),

596

Stürmer and Brenner

a different pattern is observed; the highest relative efficiency is achieved with a DM of between 1.0 (167 percent) and 2.0 (168 percent), followed by a decline in relative efficiency with increasing degrees of matching (DM  3: relative efficiency  121 percent). Finally, in the scenario in which the environmental exposure is protective in nonsusceptibles (ORED0g  0.50), relative efficiency declines monotonically with increasing DM. However, in this scenario, the proportion of the environmental exposure in controls declines with increasing DM because of the inverse association between the environmental exposure and the disease, and a scenario in which DM is 3.0 could not be evaluated since less than 1 percent of controls had the environmental exposure and the genetic susceptibility. In all scenarios presented, relative efficiency is highest when the number of controls expected in the smallest of the four cells is the largest. In table 3 we present results regarding power to detect a multiplicative interaction between the environmental exposure and the genetic susceptibility regarding risk of disease and the relative efficiency of estimating this interaction compared with the unmatched design for different degrees of matching (0.0–3.0) from 10,000 simulated case-control studies under various assumptions about the parameter values. In some scenarios, either a degree of matching of 2.0 or higher would not have been compatible with possible values of 0.0–1.0 for the prevalence of the matching factor in selected controls or logistic regression did not converge in 2 percent or more of the studies; therefore, results for these scenarios are not presented. In this table, the parameter values for the basic scenario are italicized and were held constant at the given level, when one of the other parameters was varied. The median estimated INT and the coverage of its 95 percent confidence interval were very close to the nominal level in all scenarios presented (data not shown). In most scenarios, the relative efficiency of estimation compared with the unmatched design can be increased by traditional frequency matching (DM  1.0). In many of the scenarios assessed, even greater gains can be achieved by using degrees of matching greater than 1.0. This additional gain particularly applies to scenarios in which the prevalence of the exposure is very low (PE  0.05), in which there is an inverse association between the environmental exposure and the genetic susceptibility in the population (OREG  0.5), or in which only 200 controls are sampled for 400 cases (CCRATIO  0.5). The latter situation might arise, for example, when blood samples can be obtained easily from cases but only at great expense from controls. To better illustrate these patterns, figures 1–7 present the results regarding the relative efficiency of estimation compared with the unmatched design (DM  0.0) from table 3. With respect to the prevalence of the environmental exposure in the population (PE, figure 1), efficiency gains by matching for this exposure are most pronounced in the scenarios in which the exposure is rare and less pronounced or even absent when the exposure is more prevalent. Increasing the prevalence of the environmental exposure in selected controls beyond the one observed in cases (DM > 1.0) further increases efficiency when the exposure is rare. In the scenario in which PE  0.05, efficiency increases monotoni-

cally with increasing degree of matching. In the basic scenario in which PE  0.10, optimum DM seems to be about 2.5, with a slight decline in relative efficiency with a DM of 3.0. Figure 2 shows the results for the scenarios in which we varied the prevalence of the genetic susceptibility in the population (PG). In all scenarios assessed, matching increases the efficiency of estimation. Efficiency increases monotonically with increasing DM in the scenario in which prevalence of the genetic susceptibility is low (PG  0.10). With increasing prevalence of the genetic susceptibility, the optimum DM approaches 1.0 but is still 2.0 in the scenario in which PG  0.50 and is about 1.5 in the scenario in which the genetic susceptibility is very prevalent (PG  0.75). The stronger the positive association between the environmental exposure and the genetic susceptibility in the population (OREG, figure 3), the less pronounced are the efficiency gains obtained by using a degree of matching greater than 1.0 compared with 1.0. Whereas efficiency increases monotonically with increasing DM in the scenario with a moderate, inverse association (OREG  0.5), the highest efficiency gains are achieved with a DM of about 1.5 in the scenario with a strong positive association (OREG  5.0). With respect to the association between the environmental exposure and the disease in nonsusceptibles (ORED0g, figure 4), the patterns are less uniform. In the scenario with an inverse association (exposure protective in nonsusceptibles, ORED0g  0.50), efficiency declines steadily with increasing DM (also refer to table 2). However, in the scenario in which there is no association (ORED0g  1.00), efficiency increases steadily with increasing DM; the highest efficiency is obtained with a DM of 3.0. In the scenario in which the association is stronger (ORED0g  5.0), even greater efficiency gains can be expected by matching, but relative efficiency declines sharply with degrees of matching greater than approximately 1.0 (despite a further increase in the number of exposed susceptible controls, the number of unexposed susceptible controls becomes very small as DM increases). In the scenarios we assessed in which there was an inverse association between the genetic susceptibility and the disease (ORGD0e  0.2 or 0.5, figure 5), efficiency increases monotonically with increasing DM. With an increasing positive association between the genetic susceptibility and the disease, optimum DM approaches 2.0 in the scenarios assessed, and the decline in efficiency with higher degrees of matching becomes more pronounced. With increasing positive interaction (INT, figure 6), efficiency gains obtainable by matching increase in the scenarios we assessed. The optimum DM is about 2.0 in the scenario in which the interaction is strong (INT  5.0). In the scenarios with an inverse interaction, no interaction, or a moderate positive interaction, relative efficiency increases monotonically with increasing DM. Finally, the efficiency gains obtained by matching in the scenarios assessed are most pronounced when CCRATIO is low (figure 7). When one control per case (CCRATIO  1.0) or only half the number of controls as cases (CCRATIO  0.5) can be recruited, increasing the degree of matching above 1.0 leads to a further increase in efficiency in the scenarios Am J Epidemiol Vol. 155, No. 7, 2002

Am J Epidemiol Vol. 155, No. 7, 2002

TABLE 3. Results regarding the power and relative efficiency to detect and estimate an interaction compared with the unmatched design from 10,000 simulated casecontrol studies with 400 cases, as a function of the parameters and degree of matching Degree of matching Parameter varied* PE

PE1 (%)†

PE (%)†

0.5

0.0

1.25

1.0

0.75

2.0

1.5

3.0

2.5

Power

RE‡

Power

RE

Power

RE

Power

RE

Power

RE

Power

RE

Power

RE

Power

RE

Power

RE

13 24 41 55 65 74

5 10 20 30 40 50

35.3 63.7 82.8 88.0 87.9 85.4

100 100 100 100 100 100

54.2 76.1 87.9 89.4 88.0 84.1

142 147 122 109 101 99

58.3 79.7 89.4 89.1 87.7 82.1

164 164 129 110 102 95

62.2 81.6 89.9 88.9 85.8 78.8

184 176 134 110 98 89

64.9 83.2 90.2 88.1 83.6 74.2

202 185 137 108 90 78

67.5 84.6 89.7 86.9 78.7 67.1

216 192 135 104 82 65

70.5 86.0 88.5 78.5 58.5

239 205 131 86 47

72.6 86.6 84.1 55.0

253 212 116 45

74.0 86.3 73.3

264 211 88

0.1 0.2 0.3 0.5 0.75

21 24 26 31 36

10 10 10 10 10

36.7 63.7 72.8 75.8 57.7

100 100 100 100 100

50.6 76.1 85.3 89.3 74.6

123 147 147 146 147

54.0 79.7 88.0 91.4 78.0

138 164 159 159 164

56.7 81.6 90.0 93.0 80.4

150 176 172 168 167

58.8 83.2 91.3 93.9 82.6

165 185 181 174 172

60.7 84.6 92.0 94.3 82.8

172 192 186 178 172

63.3 86.0 92.6 94.7 82.4

189 205 195 179 168

64.8 86.6 92.5 93.8 79.2

199 212 195 174 153

65.2 86.3 92.5 92.5 68.5

205 211 197 160 121

OREG 0.5 1.0 2.0 5.0

22 24 27 31

10 10 10 10

46.2 63.7 70.9 70.9

100 100 100 100

60.5 76.1 83.0 83.3

138 147 141 139

65.1 79.7 85.7 85.9

155 164 153 150

68.3 81.6 87.5 87.1

174 176 164 156

70.4 83.2 88.8 87.7

187 185 172 160

72.5 84.6 89.0 87.9

201 192 176 160

75.4 86.0 89.5 86.7

221 205 177 154

76.8 86.6 88.7 84.4

236 212 175 145

77.6 86.3 87.6 79.8

245 211 172 128

ORED|g 0.5 1.0 2.0 5.0

7 14 24 44

10 10 10 10

47.4 56.8 63.7 66.0

100 100 100 100

44.2 60.7 76.1 86.8

91 113 147 192

42.8 62.2 79.7 89.4

86 118 164 212

40.4 63.3 81.6 89.8

82 120 176 224

38.8 64.1 83.2 89.5

78 122 185 223

35.0 66.4 84.6 88.3

73 133 192 218

68.1 86.0 80.5

141 205 178

69.9 86.6 44.6

148 212 65

71.1 86.3

154 211

ORGD|e 0.2 0.5 1.0 2.0 5.0

20 21 24 27 32

10 10 10 10 10

44.5 57.4 63.7 65.4 64.6

100 100 100 100 100

49.7 67.2 76.1 80.9 81.2

121 135 147 159 166

51.8 70.1 79.7 84.3 84.4

128 147 164 179 182

53.1 72.5 81.6 86.4 86.5

133 157 176 193 195

53.9 73.7 83.2 87.6 87.6

137 164 185 202 204

55.1 74.9 84.6 89.0 88.0

140 169 192 212 209

56.4 76.5 86.0 89.3 87.7

145 177 205 223 212

57.5 77.4 86.6 89.0 85.7

148 185 212 222 204

57.6 78.0 86.3 88.2 80.3

150 189 211 219 177

INT 0.5 1.0 2.0 3.0 5.0

17 18 21 24 29

10 10 10 10 10

21.6 4.4 27.5 63.7 94.2

100 100 100 100 100

23.3 4.6 34.3 76.1 98.8

116 124 137 147 161

24.1 4.7 37.0 79.7 99.3

122 133 152 164 181

24.4 4.7 39.0 81.6 99.6

128 142 163 176 194

25.0 5.1 41.0 83.2 99.6

132 148 171 185 204

25.2 4.7 42.0 84.6 99.7

136 153 178 192 214

26.3 4.8 43.7 86.0 99.7

140 162 188 205 218

27.6 4.8 44.9 86.6 99.5

144 167 199 212 214

28.2 4.6 45.7 86.3 99.3

146 172 202 211 210

CCRATIO 0.5 1.0 2.0 3.0 5.0

24 24 24 24 24

63.2 81.6 92.4 95.4 97.0

180 176 143 133 122

230 205 157 142 130

71.3 86.6 94.4 96.4 97.5

237 212 162 144 131

70.9 86.3 94.4 96.3 97.7

238 211 164 143 131

PG

208 193 70.1 67.5 66.1 192 185 86.0 84.6 83.2 153 149 94.1 93.6 93.2 139 136 96.4 96.0 95.7 128 126 97.6 97.4 97.3 * For the definitions of these parameters, refer to table 1. Only one parameter is varied at a time, whereas all other parameters are kept constant at the level of in table 1. † Prevalence of environmental exposure (the matching factor) among cases (PE1) and in the population (PE). ‡ RE, relative efficiency. 10 10 10 10 10

36.3 63.7 82.9 90.0 94.1

100 100 100 100 100

54.0 76.1 89.9 93.8 96.2

141 147 128 121 115

59.9 79.7 91.4 94.8 96.6

162 164 136 129 120

the basic scenario (italic typeface) presented

Flexible Matching Strategies to Detect Interactions 597

0.05 0.1 0.2 0.3 0.4 0.5

598

Stürmer and Brenner

FIGURE 1. Relative efficiency of estimation of interaction according to the degree of matching and the prevalence of the environmental exposure (the matching factor) in the population (PE).

FIGURE 3. Relative efficiency of estimation of interaction according to the degree of matching and the association (odds ratio) of the environmental exposure and the genetic susceptibility in the population (OREG).

assessed. With increasing number of controls per case, efficiency gains obtained by matching are less pronounced, and there is little additional gain with increasing DM in the scenarios assessed. DISCUSSION

FIGURE 2. Relative efficiency of estimation of interaction according to the degree of matching and the prevalence of the genetic susceptibility in the population (PG).

To our knowledge, this is the first study examining power and efficiency to detect and estimate geneenvironment interactions according to the proportion of the environmental exposure (the matching factor) in selected controls, that is, the degree of matching in case-control studies with frequency matching for the environmental exposure. Starting from a basic scenario in which matching would usually be considered, we simulated a large number of case-control studies with a wide range of parameters and found that, whereas traditional frequency matching increased power and efficiency to detect and estimate multiplicative interactions in most scenarios assessed, even greater gains in power and efficiency can often be obtained by increasing the degree of matching above 1, that is, by increasing the prevalence of the environmental exposure in selected controls beyond the one observed in cases. Thus, the recently introduced concept of flexible matching strategies (7) may be particularly useful for assessing geneenvironment interactions. Gene-environment interactions are of increasing interest in epidemiologic research, but adequate power is a major Am J Epidemiol Vol. 155, No. 7, 2002

Flexible Matching Strategies to Detect Interactions 599

FIGURE 4. Relative efficiency of estimation of interaction according to the degree of matching and the association (odds ratio) of the environmental exposure with the disease in nonsusceptibles (ORED 0g).

FIGURE 5. Relative efficiency of estimation of interaction according to the degree of matching and the association (odds ratio) of the genetic susceptibility with the disease in unexposed persons (ORGD 0e).

concern (1–3). We recently showed that matching on the environmental exposure may considerably increase the power and efficiency of case-control studies to detect and estimate interactions (6). Our present results indicate that this gain may be enhanced substantially by oversampling exposed controls in many situations. Since the validity of case-control studies is not influenced by the matching process (assuming that the matching factor is adequately controlled for in the analysis) (8), the highest efficiency at the lowest cost should be the criterion for choosing the optimum distribution of a matching factor in controls. Our observation that the highest efficiency is often obtained by using a degree of matching greater than 1.0 will of course depend on our assumptions of the basic scenario, particularly a rare exposure that is a strong risk factor for the disease. If the prevalence of the exposure is high (e.g., PE  0.5; refer to figure 1) or the exposure is protective in nonsusceptibles (ORED0g  0.5; refer to figure 4), matching with a positive DM may also decrease efficiency (albeit in these situations, matching with a negative DM might be considered to increase efficiency). Given the strong dependence of the power and efficiency gains by matching on the multiple parameters, general recommendations as to the best degree of matching in all settings are difficult, if not impossible. However, if some assumptions are made regarding the parameters presented,

the expected distribution of the environmental exposure and genetic susceptibility in controls can be easily calculated for varying degrees of matching (refer to the Appendix), and analyses such as those presented in this paper may help to find the optimum DM for a specific setting. Although in practice not all of these parameters will always be available when the study is designed or controls are enrolled, approximative values for PE, PG, and ORED are usually available, and OREG can often be assumed to be 1 (independence of the environmental exposure from the genetic susceptibility in the population). Furthermore, our results suggest that the optimum degree of matching seems to be fairly independent of the remaining parameters usually unknown, that is, the quality and quantity of the gene-environment interaction (INT) as well as of the association between the genetic susceptibility and the disease (ORGD). The assessment of interactions is dependent on the scale of the model, that is, additive or multiplicative (9, 10). Unfortunately, matching eliminates the ability to examine additive risk interactions in case-control studies (since the odds ratios of the exposure-disease association cannot be estimated unless further information regarding the population disease rate within factor levels or relative case-control sampling fractions within the factor levels are available) in contrast to multiplicative interactions, which can still be estimated. Whether justified or not, the multiplicative scale

Am J Epidemiol Vol. 155, No. 7, 2002

600

Stürmer and Brenner

FIGURE 6. Relative efficiency of estimation of interaction according to the degree of matching and the magnitude of interaction (INT).

FIGURE 7. Relative efficiency of estimation of interaction according to the degree of matching and the control-to-case ratio (CCRATIO).

of interaction seems to be the one used most often in casecontrol studies of gene-environment interactions and is also the basis for the proposed case-only study design to detect gene-environment interactions (11). We looked at matching for only a dichotomous variable. The situation might be more complex when matching for a continuous variable, for example, age. We also dichotomized the genetic susceptibility, although separate categorization of homozygotes and heterozygotes might be warranted in certain situations. Despite the wide range of parameters assessed, the generalizability of our results nevertheless is limited. Although we used 400 cases, we were unable, for example, to look at environmental exposures or genetic susceptibilities with a prevalence of less than 5 and 10 percent, respectively, since the percentage of unmatched studies without model convergence was too high. Nevertheless, a DM of 3.0 compared favorably with traditional frequency matching (DM  1.0) in all of these “rare exposure” scenarios (data not shown), indicating that the trend toward higher efficiency with higher degrees of matching for small values of PE and PG observed may not be limited to the parameter range presented. Finally, we did not consider general pitfalls or additional costs of matching, since they are either well known or depend strongly on the specific situation at hand. The decision concerning whether to match for an environmental factor in a specific case-control study should consider both pitfalls and costs. There might be situations in which the costs per control differ according to the presence or absence of the matching factor (12), which is likely to influence the optimum degree of matching. Other sampling strategies for controls, such as countermatching (13–17) or multistage sampling (18–22), might also offer the potential for substantial efficiency gains when surrogates of the exposure and/or the genetic susceptibility are available. The efficiency of countermatching for both the environmental exposure and the genetic susceptibility in studies of geneenvironment interactions was recently assessed for a variety of situations (17). Under the assumption that a sensitive and specific surrogate for the genetic susceptibility is available, the calculated efficiency gains were of the same magnitude as those observed when higher degrees of matching were used in our simulations for many parameter constellations. Whereas this assumption might often be realistic for family history and rare major susceptibility genes, it might be rarely met for common, low-penetrance genes. Thus, a degree of matching over 1 might be an interesting alternative to countermatching for the genetic susceptibility specifically in casecontrol studies of gene-environment interactions for complex chronic diseases. In conclusion, our analyses suggest that traditional frequency matching on strong environmental risk factors for disease can increase the power and efficiency to detect and estimate gene-environment interactions in case-control studies and that even greater gains can often be anticipated by increasing the prevalence of the matching factor in controls beyond the one observed in cases. Flexible matching strategies, that is, matching strategies in which degrees of matching other than 1.0 are used, should be considered as possible means to enhance power and efficiency of case-control studAm J Epidemiol Vol. 155, No. 7, 2002

Flexible Matching Strategies to Detect Interactions 601

ies on gene-environment interactions beyond the level that can be achieved by using traditional study designs.

REFERENCES 1. Hwang SJ, Beaty TH, Liang KY, et al. Minimum sample size estimation to detect gene-environment interaction in casecontrol designs. Am J Epidemiol 1994;140:1029–37. 2. Foppa I, Spiegelman D. Power and sample size calculations for case-control studies of gene-environment interactions with a polytomous exposure variable. Am J Epidemiol 1997;146: 596–604. 3. García-Closas M, Lubin HL. Power and sample size calculations in case-control studies of gene-environment interactions: comments on different approaches. Am J Epidemiol 1999;149: 689–92. 4. Smith PG, Day NE. The design of case-control studies: the influence of confounding and interaction effects. Int J Epidemiol 1984;13:356–65. 5. Thomas DC, Greenland S. The efficiency of matching in casecontrol studies of risk-factor interactions. J Chronic Dis 1985; 38:569–74. 6. Stürmer T, Brenner H. Potential gain in power to detect geneenvironment interactions by matching in case-control studies. Genet Epidemiol 2000;18:63–80. 7. Stürmer T, Brenner H. Degree of matching and gain in power and efficiency in case-control studies. Epidemiology 2001;12: 101–8. 8. Miettinen OS. Matching and design efficiency in retrospective studies. Am J Epidemiol 1970;91:111–18. 9. Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic research. New York, NY: Van Nostrand Reinhold, 1982.

10. Rothman KJ, Greenland S, eds. Modern epidemiology. 2nd ed. Philadelphia, PA: Lippincott-Raven, 1998. 11. Khoury MJ, Flanders WD. Nontraditional epidemiologic approaches in the analysis of gene-environment interaction: case-control studies with no controls! Am J Epidemiol 1996; 144:207–13. 12. Nam JM, Fears TR. Optimum allocation of samples in stratamatching case-control studies when cost per sample differs from stratum to stratum. Stat Med 1990;9:1475–83. 13. Langholz B, Clayton D. Sampling strategies in nested casecontrol studies. Environ Health Perspect 1994;102(suppl 8): 47–51. 14. Langholz B, Borgan O. Counter-matching: a stratified nested case-control sampling method. Biometrika 1995;82:69–79. 15. Steenland K, Deddens JA. Increased precision using countermatching in nested case-control studies. Epidemiology 1997;8: 238–42. 16. Cologne JB. Counterintuitive matching. (Editorial). Epidemiology 1997;8:227–9. 17. Andrieu N, Goldstein AM, Thomas DC, et al. Counter-matching in studies of gene-environment interaction: efficiency and feasibility. Am J Epidemiol 2001;153:265–74. 18. Cain KC, Breslow NE. Logistic regression analysis and efficient design for two-stage studies. Am J Epidemiol 1988;128: 1198–206. 19. Flanders WD, Greenland S. Analytic methods for two-stage case-control studies and other stratified designs. Stat Med 1991;10:739–47. 20. Zhao P, Lipsitz S. Designs and analysis of two-stage studies. Stat Med 1992;11:769–82. 21. Schill W, Drescher K. Logistic analysis of studies with twostage sampling: a comparison of four approaches. Stat Med 1997;16:117–32. 22. Whittemore AS, Halpern J. Multi-stage sampling in genetic epidemiology. Stat Med 1997;16:153–67.

APPENDIX

Let pij, pijc, and pijm be the proportion of persons with level of the environmental exposure (the matching factor) i (i = 1(0) if the environmental exposure is present (absent)) and genetic susceptibility j ( j  1(0) if the genetic susceptibility is present (absent)) in the population, among cases, and in frequency-matched controls, respectively. Solving the following system of four equations with four unknown parameters 1  p11  p00  p10  p01 PE  p11  p10 PG  p11  p01 OREG  1p11  p00 2>1p10  p01 2 leads to the expected values of pij in the population: P11 1OREG  12  PE  PG

p11 1OREG 7 12   31  11  OREG 2  1PE  PG 2 4> 3 11  OREG 2  24 

SQRT55 31  11  OREG 2  1PE  PG 2 4> 3 11  OREG 2  24 62  OREG  PE  PG>11  OREG 26

Am J Epidemiol Vol. 155, No. 7, 2002

602

Stürmer and Brenner

p11 1OREG 6 12   31  11  OREG 2  1PE  PG 2 4> 3 11  OREG 2  24 

SQRT55 3 1  11  OREG 2  1PE  PG 2 4> 3 11  OREG 2  24 62  OREG  PE  PG>11  OREG 26

p01  PG  p11 p10  PE  p11 p00  1  1p01  p10  p11 2. Solving the following system of four equations with four unknown parameters 1  p11c  p00c  p10c  p01c ORED0g  1p10c  p00 2>1p00c  p10 2

ORED0G  1p11c  p01 2>1p01c  p11 2 ORGD0e  1p10c  p00 2>1p00c  p10 2

then leads to the expected values of pijc in cases: p00c  1> 3 1ORED0G  p11>p01  12  ORGD0e  p01>p00  ORED0g  p10>p00  14 p01c  p00c  ORGD0e  p01>p00

p10c  p00c  ORED0g  p10>p00

p11c  p01c  ORED0G  p11>p01.

Let PE1  p10c  p11c be the prevalence of the environmental exposure among cases, let PE0  p 10m  p11m be the prevalence of the exposure among matched controls, and let DM  (PE0 – PE)/(PE1 – PE) be the degree of matching. Solving the following system of four equations with four unknown parameters ORGD0e  1p01c  p00m 2>1p00c  p01m 2

DM  3 1p10m  p11m 2  PE 4> 3 1p10c  p11c 2  PE 4

OREG  1p11m  p00m 2>1p10m  p01m 2 1  p11m  p00  p10m  p01m then leads to the expected values of pijm in controls:

p00m  ORGD0e  p00c  51  3DM  1p10c  p11c  PE 2 4  PE 6>1p01c  ORGD0e  p00c 2 p01m  1  5 3DM  1p10c  p11c  PE 2 4  PE 6  p00m

p11m  OREG  p01m  11  p00m  p01m 2>1OREG  p01m  p00m 2 p10m  1  1p11m  p00m  p01m 2.

Am J Epidemiol Vol. 155, No. 7, 2002

Suggest Documents