Air pollution-mortality models: A demonstration of the effects of random measurement error

Quality and Quantity, 21: 37-48 (1987) 0 Martinus Nijhoff Publishers (Kluwer), Dordrecht - Printed in the Netherlands Air pollution-mortality models...
Author: Andra McCormick
2 downloads 1 Views 376KB Size
Quality and Quantity, 21: 37-48 (1987) 0 Martinus Nijhoff Publishers (Kluwer), Dordrecht

- Printed in the Netherlands

Air pollution-mortality models: A demonstration of the effects of random measurement error

KENNETH A. BOLLEN & RICHARD C. SCHWING 'University of North Carolina at Chapel Hill; 2 ~ e n e r aMotors l Research Laboratories

Abstract Errors of measurement have long been recognized as a chronic problem in statistical analysis. Although there is a vast statistical literature of multiple regression models estimating the air pollution-mortality relationship, this problem has been largely ignored. It is well known that pollution measures contain error, but the consequences of this error for regression estimates is not known. We use Lave and Seskin's air pollution model to demonstrate the consequences of random measurement error. We assume a range of 0% to 50% of the variance of the pollution measures is due to error. We find large differences in the estimated effects on mortality of the pollution variables as well as the other explanatory variables once this measurement 'error is taken into account. These results cast doubt on the usual regression estimates of the mortality effects of air pollution. More generally our results demonstrate the consequences of random measurement error in the explanatory variable of a multiple regression analysis and the misleading conclusions that may result in policy research if this error is ignored.

Introduction A crucial step in policy research is. to estimate the impact of the "target" variable, which is to be changed, on the dependent variable. For instance, your dependent variable might be automobile fatalities which you hope to diminish through the target variable of reduced alcohol consumption by drivers. Before instituting policies to reduce alcohol consumption, an assessment of the influence of intoxicated drivers on fatalities must be estimated. Or to take another example, air pollution policies need to be based on the mortality and health effects of the specific pollutants targeted for control. Virtually all of these effects must be estimated in a nonexperimental situation where multivariate relations are typical. Regression techniques have proved to be a powerful tool to statistically control the potentially confounding factors involved in these nonexperimental settings. Despite the widespread use of regression analysis in policy research, more often than not these analyses fail to consider the impact that random measurement error in the explanatory variables has on regression estimates. For example, in the relationship between health status and air pollution, it has been necessary to use pollution variables which contain both random and nonrandom measurement error. Deficiencies in technique, chemical inter-

ferences, random weather fluctuations, and human error all detract from the quality of pollution measurements. The consequences of measurement error for regression models using this deficient data is a natural concern. The bivariate regression model with the explanatory variable contaminated by random measurement error has been well-studied in the general statistical literature. In air pollution/health research, Lave and Seskin (1977) showed that random measurement error leads to underestimates of the coefficient relating pollution and human health for the two variable case. The more realistic situation of multiple regression models with one or more explanatory variables measured with error, has received less attention (cf. Pickles, 1982). This paper has a dual purpose. The more general one is to demonstrate the consequences or random measurement error in multiple regression models used in policy research. Although some researchers are aware of the consequences as presented analytically, there are few illustrations of these results with actual data. We show the changes in regression estimates that can occur when one or two variables in a multiple regression are measured with random error. Our second purpose is more specific. It is directed toward research on the air polution-health relationship. Our illustration of the sometimes large effects of random measurement error on the estimates of the associations between pollution and health points to the need for further research on the extent of measurement error in air pollution measures. Continued neglect of this error will generate potentially misleading evaluations of the health effects of air pollution. Consequences of random measurement error The two major types of measurement error are nonrandom and random. Some forms of nonrandom error are surely present in air pollution variables (cf. EPA, 1980) and these can seriously distort analyses. However, to evaluate the effects of nonrandom measurement error we need to know which of the many forms it is likely to take (e.g., see Blalock, 1970). For our purposes we ignore nonrandom errors and demonstrate the significant disruptive effects that random error may have. Our measurement model relates the true, unobserved (latent) variable (5) to the observed variable (X) as:

(Here and throughout the paper we have omitted the observation index for all variables. For example, we write X instead of X, with j = 1, 2,. .., N where N is the total number of observations.) In [I] all variables including 5 are random variables. The 6. term represents random measurement error and it is. assumed to be nonautocorrelated, homoskedastic, and to have an expected value of zero. In

b

+

our case X is a pollution measure, 5 is the actual pollution level, and 6 is a random measurement error that leads X to be less than perfectly correlated with 5. We assume that 5 and S are independently distributed. To simplify the algebra, 5 is written in deviation form so that it has a mean of zero and so that the constant term in [l] is implicitly zero. The relationship between the random explanatory variable, 5, and a dependent variable, q is:

where q = true dependent variable, 5 = true independent variable, { = random disturbance, y = regression coefficient. Equation [2] formulates the relationship between the true (or latent) variables of the model unobscured by random measurement error. The random variables q and 5 are the perfectly measured (latent) dependent and independent variables. The random disturbance (i.e., {) between the latent variables is not one of measurement error but rather one of "error in the equation". That is, the random hsturbance represents the nondeterministic nature of most relationships that exists even if the errors of measurement are completely removed. As usual we assume that the expected value of {, the error in the equation is zero, and that it is nonautocorrelated and homoskedastic. We also assume that E, {, and S are independently distributed. The q variable, like 5, is written in deviation form so that the constant term of [2] is implicitly zero. Equations [I] and [2] combined represent a simple case of a "latent variable model" described in [2] along with a "measurement model" detailed in [I]. A generalization of these two types of models, referred to as "lisrel models", is well-known in sociology and the other social sciences. The interested reader may refer to Joreskog (1973) and Joreskog and Sorbom (1978) for further details. For now we stay with the simple model presented in [I] and [2]. In practice most researchers implicitly assume that the observed variables, X and Y, are perfect measures of E and q. So instead of estimating equation (2) to obtain an estimate of y, an equation [3] is used:

Random measurement error in Y alone does not affect the consistency of the regression estimate 9*, so to simplify we assume no measurement error in Y (i-e., Y = q). The critical question is how "close" is f * to y? As is well-known (cf. Johnston, 1972: 282-83), as N goes to infinity 9* converges to:

where bEx is the regression coefficient of 5 on X and plirn designates the probability limit. In this case bcx is equal to the "reliability" (p,,) of X where

The reliability of X is the ratio of the true score variance war(,$)) to the observed score variance (Var(X)). This can also be interpreted as the squared correlation between the perfectly measured variable (5) and the observed measure (X): the greater the correlation, the higher the reliability. Also note that the observed score variance (Var( X)) is the sum of the true score variance and the error variance (i.e., Var(5) + Var(8)). The reliability coefficient must vary between 0 and 1. This means that bcx from [4] is between 0 and l.and that f * tends to underestimate y. Thus, if the X variable contains random error we have an inconsistent estimate of y that is "biased" toward zero. In this simple case dividing y * by the reliability of X corrects for the inconsistency (i.e., (l/p,,)y* = y). A more realistic situation is one in which multiple regression is used. For virtually all epidemiological models more than one explanatory variable is needed. The consequences of measurement error in multiple regression is complicated (cf. Blalock 1965; Johnston 1972 :281-83), yet some generalizations are possible. Consider the case where all the explanatory variables but one are perfectly measured. The equation of interest with perfectly measured variables is:

The estimated equation with XI as an imperfect measure of 5, but all other variables perfectly measured is:

The first question of interest is how the TT associated with the Xl measure compares to y, the coefficient associated with the true variable. Based on Theil's (1957; 1971) and Griliches' (1957) specification analysis techniques, it can be shown that the relationship of f: to y, is:

represents the partial regression coefficient associated E2E3, In [8] the bEIX,. with XI when S1 is regressed on all the explanatory variables in [7]. Levi (1973) shows that under our conditions the absolute value of this coefficient lies between 0 and 1. This means that in a multiple regression where all but one of the variables are perfectly measured, the coefficient of the proxy variable with random error is asymptotically biased toward zero, and hence ,,

underestimates cl's effect on q. In general, bt1Xl.t2t3 ,. will not equal the reliability of X. Thus, we cannot expect that 9: divided by px, will correct the inconsistency of f: as is true in the two variable case. If the impact of random measurement error in this situation was limited only to the coefficient associated with the proxy variable (Bf), then the problem might be manageable. However, unless the omitted true variable t1 is uncorrelated with the other ['s, then the 9*'s associated with the perfectly measured 5's may also be asymptotically biased. Since in most epidemiological data it is extremely unlikely that 5, is uncorrelated with the other c's, our estimates of the effects of the errorless variables are also distorted. To illustrate consider y; estimated from equation [7]. Its large sample value is : ,Ep

Rather than CI,* estimating y2, it will tend to reflect y2 plus the relationship of times q.The direction of this asymptotic bias is difficult to predict in advance since it depends on the overall magnitude and sign of the second term on the right-hand side of [9]. An analogous situation exists for the other 9*'s. The practical implications of these results are illustrated in the next section.

5, to

An example

One of the most widely known models of air pollution's association with mortality is that of Lave and Seskin (1970). This seminal study has had immense impact in that it motivated a number of exploratory statistical studies (see McDonald and Schwing 1973; Whittenberger 1976; Tukey 1976; Gibbons and McDonald 1980) as well as attracted the attention and study of policy analysts from the National Commission on Air Quality Benefits Panel and the National Academy of Sciences (see Ricci and Wyzga 1979; Graves, Krumm, and Violette 1979; Thibodeau et al. 1980; Lipfert 1980). None of the studies cited above has included a quantification of the effects of random measurement error. We examine Lave and Seskin's (1977) seven variable model. The dependent variable is the Crude Death Rate (sometimes called the Total Mortality Rate) while the independent variables are SMIN (smallest biweekly sulfate reading). PMEAN (arithmetic mean of biweekly suspended particulate readings), PM2 (SMSA population density), GE65 (% of SMSA population at least 65 years old), PNOW (91; of nonwhites in SMSA population), POOR (% of SMSA families with incomes below poverty line), and LOGPOP (log of SMSA population). SMIN through LOGPOP correspond to [, through t7 in the notation used in equation [6]. The crude death rate is q. SMSA stands for Standard Metropolitan Statistical Area. Air pollution is indicated by SMIN and PMEAN. We use the sample of cases used by Gibbons and McDonald

0.064 to 0.017 if half the variance in XI is attributable to random measurement error. The general observation from Table 1 is that even random error in one explanatory variable can affect the regression coefficients of many of the other variables - sometimes quite seriously and in opposite directions. Table 2 is constructed in the same manner as Table 1 except that we now assume that Xl is a perfect measure of SMIN but X2 as a measure of PMEAN has varying reliabilities from 1.0 to 0.5. We find that the regression coefficient (y2) for PMEAN increases as the degree of measurement error increases and this measurement error is taken into consideration in the estimation. For instance, the coefficient for X2 (T; = 0.090) failing to consider measurement error is 13.5% less than the coefficient obtained (9, = 0.104) when a 0.9 reliability is taken into account. If X2's reliability is 0.5 the estimated coefficient of PMEAN (T2= 0.257) is underestimated by 65%. Table 2 also shows that the effect of random measurement error in one variable is not isolated to only the variable with error. The effect is dispersed across all the variables. The regression coeffiqjent (f5) for PNOW for instance, increases from 0.370 when all variables are assumed perfectly measured to 0.439 when X2 is assumed to have a reliability of 0.5. The coefficient of SMIN diminishes from 0.107 to 0.062 when X,'s reliability is assumed to move from 1.0 to 0.5, we see then, that even random measurement error in a single variable can seriously affect the estimates of many of the coefficients. As was true in the previous case the squared multiple correlation coefficient changes only slightly when the random measurement error is considered. It moves from a low of 0.839 to a high of 0.851 when X2's reliability is assumed to be 0.5. The most likely situation is that measurement error is present not in only one of the pollution measures, but in both Xl and X2. This raises the question of the impact that allowing for measurement error in both has on the estimated effects. This more complicated question is briefly examined in Table 3 where the reliability of Xl and X2 is simultaneously varied from 1.0 to 0.5, while assuming uncorrelated errors of measurement. With measurement error in both variables Table 3 shows that the estimated effects (i.e., 7, and T2) of SMIN and PMEAN increase. This differs from Tables 1 and 2 where the variable assumed to be measured with error had an increased effect while the effect of the other pollution measure diminished. These results suggest that if both pollution variables have random measurement error then the effects on mortality of both variables are underestimated when this error is ignored. Note also that the coefficients of the other variables are affected by measurement error in the two pollution variables. In some cases the effect is small (e.g., POOR) while in other cases the change is more significant (e.g., PM2). Also, these other effects can be either toward zero or away from zero. Once again a slight increase in R~ occurs when measurement error is considered. The R2 has a maximum of 0.857 when Xl and X2 have reliabilities of 0.5 and a minimum of 0.839 if their reliabilities are assumed to be 1.0.

,

1

1

Conclusions Regression-based models of air pollution's effects on mortality have received considerable attention. The results of these analyses have contributed to the formulation of policy decisions and regulations. Our analysis has raised a problem that has been largely ignored in these models. Although it is known that pollution measures contain random error, the consequences of this error for the regression estimates have not been previously examined. Our paper has demonstrated that random measurement error can alter our evaluation of pollution effects. These results were described analytically as well as illustrated with a well-known air pollution model by Lave and Seskin (1977). We find that assuming plausible amounts of random measurement error in Lave and Seskin's pollution measures can lead to large increases in the estimated effect of pollution on mortality for the variable measured with error while leading to large decreases in the estimated effect of pollution on mortality for a pollution variable measured without error. Assuming both pollution variables are measured with error leads to the finding that the estimated effect of pollution on mortality can increase for both variables. In addition the coefficients for the remaining explanatory variables not containing error are altered. These results are disturbing since they weaken our confidence in the usual regression estimates of the pollution/mortality relationship. We caution the reader that our results are meant to illustrate the effects of random measurement error and not to present definitive estimates of pollution's effects on mortality. It would be erroneous to conclude that measurement error in air pollution variables has led to either the consistent underestimation or overestimation of the coefficients relating pollution to mortality. The direction of the bias could be determined if: (1)measurement error were absent from the nonpollution variables, (2) nonrandom error were absent, and (3) the Lave and Seskin model were an adequate specification of the pollution-mortality relation. Each of these three assumptions may be questioned so that we cannot say with confidence that pollution's effects are either greater or less than usually estimated. Our study makes clear the need to determine the extent of measurement error in the comm?nly used pollution measures as well as the nonpollution variables. If estimatb of the extent of measurement error are derived, then this information should be taken into consideration when estimating the mortality effects of air pollution. Such knowledge would bring us closer to evaluating the health effects of various pollutants. If measurement error in the explanatory variables is ignored, biased estimates of the impacts of pollution will continue to be generated. More generally our results illustrate some of the statistical consequences of using explanatory variables containing random measurement error in multiple regression models used in policy research. If only a single variable is measured with random error then its regression coefficient is asymptotically biased

toward zero. The direction of bias of the coefficients for the remaining explanatory variables measured without error will depend on the interrelationship of the explanatory variables as we described in the text. Generalizations of the asymptotic bias of estimated regression coefficients are more difficult to make if more than one explanatory variable is measured with error. Researchers would be well-advised to investigate the measurement properties of the explanatory variables used in their analyses and if at all possible to incorporate this information into their estimation procedures.

Acknowledgements We wish to thank Diane Gibbons, Dana Kamerud and Gary McDonald of the General Motors Research Laboratories for their helpful comments on an earlier draft of this paper.

Notes 1. A light scattering device called a nephlometer, for example, is frequently used to measure particulate pollution. Nephlometers, however, have been found to contain numerous sources of error (see Horvath and Charlson 1969). Yet, in some cases, nephlometer readings are the only sources of particulate pollution data available. 2. Derivations of this and the other results in the pages are available from the authors.

References Blalock, H.M., Jr. (1970), A causal approach to nonrandom measurement errors, American Political Science Review, 64, 1099-1111. Blalock, H.M., Jr. (1965), Some implications of random measurement error for causal inferences, American Journal of Sociology, 71, 37-47. Charlson, R.J., Ahlquist, N.C., and Horvath, H. (1968), On the Generality of Correlation of Atmospheric Aerosol Mass Concentration and Light Scatter, Atmospheric Environment, 2, 455-464. Crocker, T.D., W. Schulze, S. Ben-David, and A. Kneese (1979), Method Development for Assessing Air Pollution Control Benefit - Volume I, EPA-60015-79-001a. Washington, D.C.: Environmental Protection Agency. Gibbons, D.I. and G.C. McDonald (1980), Examining regression relationships between air pollution and mortality, General Motors Research Publication GMR-3278. Warren, MI: General Motors Research Laboratories. Graves, P.E., J. Krumm, and D.M. Violette (1979), Estimating the benefits of improved air quality, Prepared for National Commission on Air Quality. Griliches, Zvi (1957), Specification bias in estimates of production functions. Journal of Farm Economics, 39, 8-20. Horvath, Helmuth and Charlson, R.J. (1%9), The Direct Optical Measurement of Atmospheric Pollution, American Industrial Hygiene Journal, 30, 500-509. Johnston, J. (1972), Econometric Method, New York: McGraw-W.

Joreskog, K.G. (1973), A General Method for Estimating A Linear Structural Equation System, In A.S. Goldberger and O.D. Duncan (eds), Structural Equation Models in the Social Sciences, New York: Seminar Press. Jdreskog, K.G. and D. Sorbom (1978), LISREL IV: Analysis of Linear Structural Relationships by the Method of Maximum Likelihood, Chicago: National Educational Resources. Lave, L.B. and E.P. Seskin (1977), Air Pollution and Human Health, Baltimore: Johns Hopkins University Press. Lave, L.B. and E.P. Seskin (1970), Air pollution and human health, Science, 169, 723-33. Lipfert, F.W. (1980), Sulfur oxides, particulates, and human mortality: Synopsis of statistical correlations, Journal of the Air Pollution Control Association, 30, 366-71. Levi, M.D. (1973), Errors in the variables bias in the presence of correctly measured variables, Econometrica, 41, 985-86. McDonald, R.C. and Schwing R.C. (1973), Instabilities of regression estimates relating air pollution to mortality, Technometrics, 15, 463-481. Madansky, A. (1959), The Fitting of Straight Lines When both Variables are Subject to Error, Journal of the American Statistical Association, 54, 173-205. Pickles, J.H. (1982), Air Pollution Estimation Error and What It Does to Epidemiological Analysis, Atmospheric Environment, 16, 2241-45. Reilly, P.M. and H. Patino-Leal (1981), The Bayesian Study of the Error-in-Variables Model, Technometrics, 23, 221-31. Ricci, P.F. and R.E. Wyzga (1979), The health effects of reduced air pollution. Prepared for the National Commission on Air Quality. Theil, Henri (1971), Principles of Econometrics, New York: John Wiley & Sons. Theil, Henri (1957), Specification errors and the estimation of economic relationships, Review of the International Statistical Institute, 25, 41-51. Tukey, J.W. (1976), Discussion of the paper by Lave and Seskin, Presented at the Fourth Symposium on Statistics and the Environment, Washington, D.C. Whittenberger, J.L. (1976), Discussion of the paper by Lave and Seskin, Presented at the Fourth Symposium on Statistics and the Environment. Washington, D.C.

Suggest Documents