Bivariate Correlation with Spatial Data

Robert Haining Bivariate Correlation with Spatial Data The paper examines the use of the Pearson product moment correlation coefficient and the Spea...
0 downloads 1 Views 930KB Size
Robert Haining

Bivariate Correlation with Spatial Data

The paper examines the use of the Pearson product moment correlation coefficient and the Spearman rank correlation coefficient on map data that are spatially correlated. Research has shown that in this case standard estimators underestimate the true sampling variance of the Pearson coeffwient. The Clifford and Richardson “effective sample size” solution to this problem is described, methods for implementing the procedure are c&red, and properties of the technique examined. The method is also used to adjust the Spearman test. Two applications are given and the method of Clifford and Richardson contrasted with the method of data “prewhitening. ”

Correlation coefficients are calculated to measure the linear association between two variables ( X and Y). When data are collected on X and Y for n points or areas on a map, such coefficients measure the association between the two sets of map observations but the measures are aspatial in that they disregard the spatial distribution of the pairs of values. Any reallocation of the pairs of values, denoted { x i , y i } where i references the ith location, over the map will yield the same correlation value. The standard hypothesis testing procedures for parametric and nonparametric tests of association assume that the pairs of observations are independent. A parametric test such as for the significance of the Pearson product moment correlation coefficient further assumes the observations are drawn from the same, approximately bivariate normal, distribution with constant expected value (mean) and finite variance. A statistically significant test of association does not necessarily indicate a causal link between X and Y, but in the case of spatial data there is an additional problem in interpreting an apparently significant measure of association between two data sets. Values for either X or Y or both may be drawn from n pairs of random variables with means that vary across the map. Also, the random variables may be spatially correlated and hence not independent. In the first case, data values may show evidence of trend; in the second, similar-sized values may cluster together on the map. Both properties may be evident simultaneously. In these two situations Part of the research for this paper was carried out while the author was in receipt of a Nuffield Social Science Research Fellowship. The author wishes to thank the referees for many helpful comments. Robert Haining is Re&r

in geography, University of Shefield.

Geographical Analysis, Vol. 23, No. 3 (July 1991) 0 1991 Ohio State University Press Revised version accepted 9/90.

Robert Haining / 211 the assumptions underlying the standard theory of inference have been violated and it is not clear what meaning should be attached to the results of any inferential analysis. What is constructed as a test for the presence of intrasite second order covariation between X and Y is contaminated by larger-scale patterns in the data. Further, if the areas differ in size and so involve different amounts of aggregation, the variances may not be the same for all areas. Correlation coefficients are widely used to measure association between variables measured across a set of spatial locations. There has been particular interest in their use in geographical epidemiology for exploring links between environmental factors and the incidence of certain diseases. (For brief reviews see Clifford, Richardson, and Hemon 1989; Lazar 1981.) Typically, areal units are large and irregular and the measure is between pairs of values, each member of the pair being at the same location. Cross-correlation measures can also be obtained by correlating pairs of values that are separated by fixed lags or distances. This would seem particularly useful in studies involving small areas and where effects (such as environmental pollutants) can spill over between areas. In the earth sciences the cross-semivariogram is preferred for exploring associations between variables (for example, McBratney and Webster 1983). This approach reflects the adoption in the earth sciences of the semivariogram for characterizing spatial variation (for example, Burgess and Webster 1980). In the next section research findings on the problem of testing for significant bivariate correlation using spatially correlated data are reviewed. It is shown that standard inferential methods underestimate the true sampling variance of the Pearson correlation coefficient if both data sets are positively spatially correlated. In such cases type I error levels are larger than the analyst has specified. Clifford and Richardson (1985) have suggested a method of adjusting the Pearson coefficient although there have been few applications of the method. In section 2 we discuss implementation of the method, extend it to the Spearman rank correlation coefficient,and consider its properties using simulated data. Section 3 applies the Clifford and Richardson method and the method of prewhitening to specific data sets. 1. BIVARIATE CORRELATION WITH SPATIAL DATA

The sample Pearson product moment correlation coefficientis denoted r^ and

where 5 and X are the respective sample means and n is the sample size. When X and Y are measured at the ordinal scale and ranks are distinct, then

where di is the difference in the ranks of yf and xi. This is the Spearman rank correlation coefficient and a derivation of (2) from (1) is given in Siege1 (1956, pp.

203-204). Assuming independent observations, the significance of the Pearson correlation coefficient is tested by using Fisher's 2 transformation which behaves as a

212 /

Geographical Analysis

standard normal deviate. An alternative test statistic is given by

( n - 2) 1/2131(1 - i 2 ) - ll2

(3)

which under the null hypothesis (r = 0) is t-distributed with (n - 2) degrees of freedom. The t-test is also recommended for Spearman’s coefficient when n > 10 (Siegal 1956, p. 212). Correlation coefficients are measures of second-order covariation in a set of data. “Spurious” correlation arises when the effects of intervening variables or the effects of other influential exogeneous variables are not allowed for in measuring the association between two variables. When data series refer to observations in time or space, then another problem may arise. In the case of time series data, a time-dependent mean may be present in one or both of the series. Computing (1) without first removing this time-dependent mean (detrending the data) can give rise to what Yule (1926) terms “nonsense” correlations in which the measure of correlation reflects covariation of the first-order properties of the data. The same difficulty can arise with spatial data. Measures of correlation may reflect covariation in the location-dependent properties of the means of the two series. A common approach to removing these influences in time series is to difference the data, but higher-order differencing is not a well-defined-operation in spatial analysis since space, unlike time, has no natural ordering (Ripley 1988). Continuity effects associated with the structured nature of time (and space) may show up in a data series in another way. Neighboring observations in a series may be correlated. The presence of spatial correlation in X and Y affects the distribution of 3. Suppose X and Y are Gaussian processes (in space or time) with constant (zero) means, unit variances, and covariance matrices Z, and Z, respectively. If X and Y are independent of each other (Clifford and Richardson 1985),

E[i] = 0 and Var[P] = trace(ZXZ,)/n2, Supposing X and Y to be stationary lattice processes (n = n, X n,), then,

where C x ( k l ,k,) and C,,(k,, k,) are the covariances at lag (kl, k,) for X and Y, respectively. The above result shows how the variance of 3 increases if both series show positive correlation. If only one of the series is correlated, then the variance of i is unaffected since for the uncorrelated series C(k,,k,) = 0 unless k, = k, = 0. In an early paper Bartlett (1935) showed that if both X and Y are stationary first-order temporal autoregressions with parameters p, and p,, then for large samples,

He notes that lack of independence appears to be a more serious problem than lack of normality in carrying out significance tests. For spatial data Richardson and Hemon (1981) obtain the asymptotic (large lattice) variance of 3 when sampling from two mutually independent processes ( X and Y), which are both stationary autonormal fields (Ripley 1981). In a later paper

RobertHaining / 213 Richardson and Hemon (1982) give an asymptotic formula for Martin’s doubly geometric process (a product of independent orthogonal time series autoregressions in the plane) and for Whittle’s stationary autoregressive process. This paper also tabulates values of the asymptotic sampling variance for stationary autonormal and autoregressive processes for different spatial parameters and different lag-one spatial correlations. Bivand (1980) and Haining (1980) report on the behavior of F where X and Y are small-lattice (5 x 5,7 x 7) realizations of Gaussian spatial autoregressive models. (These processes are not covariance stationary due to edge effects.) Again the variance of F increases when both X and Y are spatially correlated. Clifford, Richardson, and Hemon (1989) report the results of a series of simulation experiments on larger regular (12 x 12 up to 20 X 20) and irregular (n = 82) lattices. They show that the actual type I errors for a 5 percent test can be as large as 60 percent when both series are strongly correlated, but are significantly larger than 5 percent even for quite low levels of spatial correlation. The problem is rather more serious in the case of irregular lattices than regular lattices, a point also noted by Bivand (1980). While indicating the nature of the problem, none of these studies identify appropriate remedial action. Usually the underlying generating process for spatial data is unknown. The asymptotic results of Richardson and Hemon relate to very special classes of (stationary) models. One approach to the problem of determining the effect of spatial correlation on the properties of statistical estimators is to consider the “information loss” incurred by dependent data relative to an equal-sized sample of independent observations (see for example, Haining 1988). Another route is to identify for any sample (n) of that contain dependent observations the number of independent observations (N’) the equivalent amount of statistical information. Clifford and Richardson (1985) call N‘ the “effective sample size.” N‘ will depend on the correlation structure in the data and for positive spatial correlation N’ < n. For the Pearson correlation coefficient Clifford and Richardson (1985) show, under fairly mild conditions, that

where 8 , and 8 , are the estimated n X n spatial correlation matrices for the two processes. If either X or Y are spatially uncorrelated trace (R,R,) = n and N’ n. The ith diagonal entry in (R,R,) is the term-by-term product of the ith row of R, with the ith column of R, ,associated with the ith site. If both X and Y are spatially uncorrelated, each diagonal entry will be equal to one. If both X and Y have monotonically declining spatial correlation functions, then the largest contribution to trace (8,8,) will be from the highly connected interior sites, and the contribution of any site will be smaller the closer it is to the edge of the spatial system. So in such cases it is the interior sites that contribute most to the reduction in n to N’.This effect will be more marked the slower the decline in the correlation function. Assuming approximate normality, and null hypothesis (H,)T = 0, then H, is rejected if

exceeds the critical value of the t-statistic with N’- 2 degrees of freedom. (N‘will usually not be an integer.) In the next section, implementation of (6) and (7) is examined and properties of the method investigated.

214

/

Geographical Analysis

Finally, note that any spatial (or temporal) series may represent a realization of a process with a nonconstant mean, and, in addition, the deviations from the mean may be correlated as well, thereby simultaneously raising both the problems described in this section. We consider a possible treatment of this case for spatial series in section 3 of the paper in which both data series are detrended first and the Clifford-Richardson procedure carried out on the detrended series. 2. IMPLEMENTATION OF THE CLIFFORD-RICHARDSON METHOD

A. Determining the Effective Sample Size Determination of N' requires the evaluation of R, and R,. We consider three problems: (i) finding appropriate estimators for the elements of R, and R,; (ii) determination of the number of elements to estimate; (iii) deciding whether to replace individual estimates by a model or correlation function cstimated from the data. Suppose data are available on a regular lattice. An estimator for the spatial covariance of a variable, Y, at lags j and k in the two directions of the lattice, assuming second-order stationarity of deviations from the mean is given by

where (sl, s,) denotes a lattice site, @ is the mean value of Y and the lattice dimensions are p X 9. M, = min(p,p - j ) ; M, = min(g,q - k); m, = max(1, 1 - j); m2 = max(l,1 - k). In the case of small lattices, the estimator

allows for boundary effects and is usually preferred. If the-mean value of the surface is known, both estimators are unbiased; however, Cy(j,k) has a large variance, particularly for long lags, and-may no! give a positive definite function. Neighboring values of both estimators, C , and C,, are strongly correlated. Usually, of course, the mean (which may not be locatio: invariant) is-not known and has to be estimated from the data, In this case both C(, j , k) and Cy(j, k) are downward biased, but the bias in -C,( j, k) is (pg/(p - jX9 - k)) times larger than for 6,(j, k). The estimator C , may be more sensitive to a misspecified mean. These results are discussed further in Ripley (1981, p. SO), Mardia and Marshall (1984), and Cressie (1984), and a method of estimating the mean in data series where spatial correlation is also present is described in Haining (1987). The estimator for spatial correlation assuming second-order stationarity of differences from the mean is given by

In the case of observations arising from a set of irregular areas, spatial covariance estimates are obtained by considering pairs of areas separ%ted by a specified number of intervening areas or by a specified distance band. C,(g) is obtained by considering all pairs of areas separated by a shortest-distance path of ( g - 1) intervening areas and then dividing through by the number of such pairs (see Agterberg 1970; Cliff and Ord 1981, pp. 118-19; Ripley 1988).

Robert Haining / 215 Given the downward bias and large variance of the covariance estimators at larger distances, it does not seem advisable to include many estimates of the correlation structure at larger distances. Clifford and Richardson (1985) set g = 4, although in their 1989 paper they estimate all lags; in the work ;eported below, g = 6. The number of pairs of observations available to estimate C , or 6, at first increases and later, at larger distance, declines due to boundary effects. Another criterion might be to stop after correlation estimates have consistently fallen to near zero. Correlation estimates may be used to specify a theoretical correlation model. Although model selection can be difficult and time consuming (Ripley 1981, pp. 57-58), this approach is sometimes used in preference to working with the correlation estimates directly because the sampling variation of these estimates can be large. This approach is often taken in the geostatistical literature, for example, where a theoretical model is fitted to a sample variogram [see, for example, Davis and Borgman (1979, 1982) for a discussion of the behavior of the sample variogram and Myers et al. (1982) and Webster (1985) for applications of model fitting]. Some trials were performed using the correlation matrix of the autoregressive model and estimating the unknown parameter of the model from the data, rather than the correlations per se. Results suggested that (even when the underlying model is a spatial autoregression) the Clifford-Richardson test procedure does not perform any better than when correlation estimates are substituted into the matrices. This model-based approach usually overcorrected and gave values of N' that were too low. This may be due to overestimation of spatial correlatiocs at lon_ger distances. In the work reporte! here, the crude estimates based on C, and C, are used up to lag six in R, and R,. Application of (6) does not require identification of any model to represent spatial correlation.

B. Adjustments to the Critical Value The test of hypothesis on P is carried out using (7) rather than (3). Clifford, Richardson, and Hemon (1989) simulated first-order Gaussian isotropic simultaneous autoregressive fields for X and Y five hundred times on a 26 X 26 lattice (border values set to zero). They then restricted the analysis t? 12 x 1,2, 16 X 16, and 20 X 20 sublattices in order to lessen the edge effects. R, and R , were obtained by including spatial correlations for all lags. They found that the actual Type I errors for a 5 percent test ranged from 4.1 percent to 5.9 percent using (7). Application of (7) to an irregular lattice situation ( n = 82) where again the data were Gaussian produced Type I errors for a 5 percent test that ranged from 3.0 percent to 7.0 percent. The improvement, though evident, was less consistent.' Further evidence has been given in Haining (1990). First-order spatial autoregressive fields for X and Y were simulated three hundred times on 11 x 11 and 7 X 7 lattices (Haining, Griffith, and Bennett 1983). Border values were set to the expected value of theAprocessA (zero) so the fields were not second-order stationary (Haining 1988). R, and R, were obtained using spatial correlations up to lag six. The adjustment improved the correspondence between the selected (5 percent) and actual significance level of the test. The latter ranged from 4 percent to 5.6 percent (for the 11 X 11 case) and from 4.2 percent to 7.7 percent (for the

'

For finite lattice realizations of these processes, neither the variances nor the spatial covariances are location invariant (Haining 1988). Unlike the case of time series autorypessions, edge effects are even more pervasive in the case of s atial autoregressions. Cutting off a buffer" zone will h e n edge effects but it will not eliminate &em for this class of models. Given that the denominator of 3 is an estimate of the variances of X and Y the im lication is that i should only be used where X and Y are both second-order stationary. Where site vafues are areal a egates of differing sizes, variances may not be identical and a data transformation may be necessary%aining 1990; Weisberg 1985).

216

/

Geographical Analysis

7 x 7 case). Haining (1990) also shows that the Clifford-Richardson adjustment improves the large sample test for the Spearman rank correlation coefficient. The simulated surfaces were converted to rank data and bivariate correlations computed using (2). The critical value obtained from (7) (with N’ obtained as described above, computing spatial correlations from the ranked data) was compared with that obtained from (3). As with the Pearson correlation, using (3) can lead to serious underestimation of the actual type I error level. For the 11 x 11 lattice case, using (7), type I errors ranged from 3.3 percent to 6.6 percent and for the 7 x 7 case type I errors ranged from 4.0 percent to 7.0 percent. The main effort required, in order to implement the Clifford-Richardson adjustment, is in the estimation of the “effective sample size,” N‘. Although, in a thorough analysis, there can be no shortcut to estimating N’, there may be occasions when a quick test will be helpful. Haining (1990) gives some results on the location and spread characteristics of NI for the cases examined. 3. TWO APPLICATIONS

A. Cancer Mortality and Social Class in Glasgow: Pearsun Correlation Data (see Table 1) were obtained on standardized mortality rates (SMRs) for cancer for eighty-seven community medicine areas ( C M h ) in Glasgow (1980-1982). SMRs are obtained by dividing the observed number of deaths for the ith area (Oil by the number expected given the age and sex composition of the area (Ei) and multiplying by one hundred. For each of the eighty-seven areas the percentage of households living in poor accommodation lacking basic amenities (2) was also available from the 1981 census. Stress has been suggested as a possible contributory factor to the incidence of cancer, that is, stress induced by a low quality of life (poor housing, unemployment, etc.). The observed number of deaths, Oi, is a Poisson variate so the mean and variance are related. The SMRs were log transformed (Y). This transformation stabilizes the variances and makes the distribution of rates more nearly normal. This judgment was based on a visual inspection of the rankit plot. The observations are unlikely to be independent so more formal tests are inappropriate. The distribution of households living in poor-quality housing is highly skewed with large numbers of CMAs having very low percentages. Three data transformations were tried:

x = 21/2+ ( Z +

1)y

x=

log(Z

+ 1);

x=

1/(Z

+ 1)

Each is more severe than the one before and the addition of the value 1 is because of zero values. Subsequent results are for the logarithmic transformation of the housing quality data which improved the symmetry of the distribution of values. The unadjusted Pearson correlation coefficient between Y and X is 0.252 with a t-value of 2.40. The relationship appears to be significant. However, both series show visual evidence of declining levels with increasing distance from the center. There are two possible routes to obtaining a measure of bivariate correlation that allows for possible spatially varying means and spatial correlation in the two data series: (1) prewhiten the two data series; (2) apply the Clifford-Richardson adjustment after detrending each data series. Data prewhitening means finding univariate models for X and Y that enable the mean and correlation to be extracted from each series leaving normal residuals that are (almost) independent (“noise”). The “noise” elements of X and Y are then correlated.

Robert Haining

/ 217

Prewhitening The two data sets were, independently, subjected to the data analysis described in Haining (1987). The analysis simultaneously estimates the mean and spatial correlation elements of a univariate series by the method of iterated generalized least squares. The representation for the mean was a trend surface model and the correlation model was a first-order spatial autoregression, which is reasonable given the correlation structure of the “signal” (see table 5b). The model, therefore, was defined as (for Y)

+ uy

Y

=

AF,

uy

=

pyWuy

+ ey

(8)

where A is the matrix of CMA coordinates and depends on the order of the surface; 8, is the vector of trend surface parameters and W is an unscaled binary connectivity matrix where w i j = 1 if CMAs i and j share a common boundary; 0, otherwise The autoregressive parameter p E [ - 0.32,0.16] and e y is IN (0,u;I). The (log) cancer and (log) housing quality data are both described by secondorder trend surface models with correlated errors. Details of the fit are summarized in Table 2A. The maximum points of these surfaces lie near the center of the city (see Figure 1). The surfaces are oriented with a long axis running from the southwest to the northeast possibly reflecting frame effects (Haining 1990). The Pearson correlation between the two estimated noise series, 6, and 6, is 0.115 which has a t-value of 1.06 (n = 87). The relationship between X and Y is not significant. Figure 2 shows the rankit plots for the two residual series indicating that the distributions are symmetric. Of course, since the data series are regression residuals, they are not independent. The exact effects, both on formal tests for normality and the significance of the correlation coefficient of this form of dependency, are not known (Wetherill 1986, p. 196). Cltfford-Richardson Adjustment The Clifford-Richardson adjustment is carried out by estimating the spatial correlation structure in 1, and ii,. The correlations are given in Table 2B and these are substituted into (6) to give N‘ = 40.73. The Pearson correlation is computed for the series ii, and ii, and the estimate is 0.134. The (adjusted) t-value is 0.847 (N’= 41). Again the relationshjp i s not significant. Table 2C describes properties of the diagonal elements of (R,R ,) and Figure 3 shows their spatial distribution. Those CMAs with large values contribute most to the reduction in the degrees of freedom for the test. Figure 3 shows the plot of the diagonal score for CMA i (i = 1, . . . ,871 against its numbers of neighbors. The correlation between these two variables is 0.927. The CMAs that contribute most to the reduction in the degrees of freedom are the highly connected interior sites. Finally, Figure 4 shows the rankit plots for 5, and iiy. These residuals are spatially correlated SO the plot should not be interpreted as firm evidence of normality. The rankit plot for ii, is the more worrying. However, the evidence of both analyses is that the significant correlation previously obtained was a consequence of large-scale trends and patterns in the data series rather than second-order covariation. Once

17.580 13.380 13.210 21.531 4.820 7.791 17.910 12.560 14.337 12.176 18.580 3.083 3.401 7.333 13.515 18.341 10.856 18.206 4.651 9.483

11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

21.

~~

25. 26. 27. 28. 29. 30.

24.

23.

22.

~~

10.444 7.862 11.557 7.953 9.075 7.196 11.370 16.535

104.071 109.086 123.652 101.169 128.096 103.705 127.659 117.403 111.948 72.362

91.773

83.923

101.187 89.572 97.417 95.324 98.327 73.222 78.775 82.029

86.463 79.686 97.378 101.540

85.533

108.443 95.071 76.951 97.257 96.814

7.411

1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

6.233

SMR

Population

CMA

TABLE 1 Glasgow Data

3.89 6.71 0.56 6.36 4.33 0.75 10.62 0.68 0.19 1.29

5.72 0.13 0.21 0.94 5.56 0.26 0.48 0.13 0.21 0.09

17.36 7.09 7.21 8.34 18.23 1.18 8.39 0.92 3.51 0.09

lacking amenities

%households

0.38 0.45 0.44 0.49 0.47 0.54 0.60 0.70 0.69

0.43

0.24 0.23 0.25 0.19 0.13 0.21 0.12 0.30 0.35 0.30

0.41 0.43 0.41 0.40 0.34 0.36 0.30 0.37 0.33 0.28

E-W

Digitized Coordinates

N-S

0.69 0.68 0.63 0.58 0.62 0.68 0.60 0.63 0.81 0.83

0.63 0.74 0.71 0.69 0.71 0.78 0.83 0.78 0.73 0.87

0.54 0.56 0.59 0.57 0.57 0.59 0.61 0.63 0.66 0.66

3,8,19,22,26,3.5 8,9,19,21 24,25,26 2,3,23,25,26 23.24.26.27.41 21; 23; 24; 25; 27,28,33,35 25,26,28,41,43 26,27,33,34,39,43,48,55 30.31 29,31,35,36,39

7,10,14,80 13,14,16,18,19,20 10,12,14,19 10, 11,12,13,15,16,17 14,17 14.17.20 14,15,16,20 12,19,20 9,10,12,13,18,u),21,22 12,16,17,18,19,35

2,4,5,40,84 1,3,4,24,40,41 2,4,6,8,21,24 1,2,3,5,6 1,4,6,7,80 3.4.5.7.8 5; 6;8; 9; 10,11,80 3,6,7,9,21,22 7,8,10,19,22 7,9,11,13,14,19

CMAs

Contiguous

8.974 8.431 3.845 10.573 8.607 15.996 8.062 2.862 8.178 9.352 10.903 ~. ... 8.129 16.848 12.496 19.554 16.394 14.634 12.488 22.805 21.423

41. 42.

59. 60.

58.

55. 56. 57.

54.

53.

50. 51. 52.

48. 49.

46. 47.

45.

44.

43.

6.181 6.481 11.458 11.856 6.iig 7.585 5.357 6.556 8.175 3.629

31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 114.191 146.401 140.519 117.230 145.028 102.550 95.897 119.363 118.599 101.650 96.913 .. . ~ . 74.337 107.088 95.905 115.673 122.531 93.809 130.448 103.382 86.860

86.398 86.576 78.998 105.982 79.578 94.385 88.527 106.200 82.980 100.121 0.29 7.04 1.67 16.60 9.24 5.77 0.29 0.51 6.19 5.89 9.66 3.27 0.14 1.84 0.21 0.32 1.50 ~ 1.03 5.92 0.30

0.30 0.14 0.17 0.08 0.96 0.52 0.68 0.00 0.43 12.85

.

0.50

0.61 0.60 0.62 0.61 0.65 0.69 0.69 0.68 0.70 0.79 0.76 0.49 0.48

0.55

0.49 0.47 0.47 0.37

0.55

0.51 0.58

0.54

0.48 0.44

0.46

0.53 0.55 0.56 0.51

0.57 0.50 0.46

0.57

0.49

0.90 0.66 0.73 0.69 0.52

0.80 0.76 0.74 0.68 0.94

0.54 0.54 0.56 0.55 0.56

0.73 0.67 0.52 0.57 0.52 0.71 0.81 0.82 0.69 0.49

59,61,62,64,65,66,67,70,74

2,24,25,27,40,42,43 40,41,43,44,46,48 27,28,41,42,48 40,42,45,46,49,50,61 40,44,58,61 42,44,47,49 46,48,49,53,55 28,42,47,55 44,46,47,50,51,53,54 44,49.51,52,61 49.50.52.54 50; 51; 54;57,61,63 47,49,54,55,56,57 49,51,52,53,57 28,37,39,47,48,53,56 37,53,55,57 52.53.54.56 40;45; 59; 61,84,87 58,60,61,64,83,87

29,30,32,36,39 31,36,39 26.28.34.35.39 28: 33: 39 19;20; 21,26,30,33,36,39 30,31,32,35,38,39 38,39,55,56 36,37,39 28.30.31.32.33.34.35.~,37,38,~

81. 82. 83. 84. 85. 86. 87.

78. 79. 80.

77.

71. 72. 73. 74. 75. 76.

69. 70.

68.

65. 66. 67. ..

64.

7.009 15.981 6.648 16.489 17.781 15.623 7.247

4.793 7.867 7.997 9.857 7.811 10.158 11.057 10.939 13.088 13.905

96.044

18.379 13.834 21.193 22.369 15.042 13.809 9.938 8.733 7.591 7.417

61. 62. 63.

118.660 119.862 123.551 92.429

84.286

123.275 108.630

0.24 0.22 1.57 11.57 0.18 0.09 9.36

0.18 0.23 0.08 0.43 0.57 0.02 0.32 0.20 0.24 6.31

75.269 93.768 65.359 84.797 76.098 102.234 93.972 118.235 82.826 122.038

76.000

1.67 0.13 0.42 0.30

0.25

6.33 0.19 0.90 6.75 0.17

lacking amenities

% households

77.086 77.245

97.090 89.003 96.967 111.477 111.485 84.785

SMR

Population

CMA

TABLE 1 Continued

0.24 0.29 0.42 0.41 0.29 0.27 0.44

0.34 0.39 0.30

0.37 0.22 0.36 0.44 0.38 0.29 0.24

0.57 0.57 0.69 0.43 0.53 0.49 0.43 0.39 0.34 0.43

E-W

Digitized Coordinates

~~

0.50 0.42 0.43 0.50 0.32 0.37 0.47

0.25 0.44 0.49 0.45 0.46 0.37 0.54

0.03 0.15 ._ _ 0.21

~

0.34 0.36 0.40 0.32 0.31 0.36 0.32 0.29 0.30

N-S 0.40

76 75,77.78.79.86 59: 64:75: 87

76; 78; 82; 86 75,76,77,80,82,84 64,67,68,69,75,82,85,86 5,7,11,76,78,81,84

72,73 69-73

44,45,50,52,58,59,60,62, 60,61,63,65 52,61,62 59,60,67,75,79,83 60,62,66,74 60,65,74 60,64,68,70,79 67,69,70,74,79 68,72,73,74,79,85 60,67,68,74

Contiguous CMAE

~

Robert Haining / 221

FIG. 1. Map of Diagonal Elements of (8,8,), Also Indicating Maximum Points of the Two Trend Surfaces

these large-scale features are removed or controlled there is no significant association between these two variables.

Area Data with Variable Support The CMAs are not of equal size. The smallest CMA has a population ( p ) of 2,862, the largest 22,805. In general, measurements relating to large areas

A. Univariate Models for the Cancer (Y)and Social Class ( X )

Variables 1. LungCancer f = 4.58 - 0.55AE 0.79AN - 0.27A; - 1.35Ak 1.46AEAN A, = East-West axis variable A, = North-South axis variable p, = 0.11 (significant) Re (of fit) X 100 = 40.7% 2. Housing Quality f = - 1.26 4.03AE 4.83AN - 3.19A; - 3.13AL - 2.38AEAN px = 0.12 (Significant) Re (of fit) X 100 = 46.84%

+

+

+

+

B. Correlation Structure in the Residuals from the Trend Surface (0)

If, u,

No. of

NO)

&I)

AtZ,

tic9

R(4)

Rr.5,

R(6)

1.0 1.0

0.31 0.43

0.00 0.14

- 0.06 - 0.00

- 0.05

-0.11

-0.06 - 0.20

- 0.07 - 0.17

228

471

650

738

654

481

Pain

C. Properties of Diagonal Elements of (fixfi,) Mean = 2.189 Standard deviation = 0.235 Minimum value = 1.614 Maximum value = 2.795 Distribution of values: Midpoint Count

1.6 1

1.7 1

1.8 6

1.9 7

2.0 9

2.1 11

2.2 16

2.3 17

2.4 9

.2

2.5 4

2.6 4

....

222

3232 -3 333 4

...

243 ,23343.2222

2..

- 1.0

.20

0.0

10

2.0

Normal s0Jr.s

22.2.

.. .

2322

23 .34.

-1342

032 033

*322

.....

22.

.2.0

. 1.0

2.2

0.0

1.o

2.0

Nond Scares

FIG. 2. Rankit Plots of 8, and 8 ,

2.7 1

2.8 1

Robert Haining / 223

. . . . . . 2

2.45

*

..

2 4

0

A

(Bx R")

' 4 5 5 2 2

2 10..

2

. 2 5

2 6 5 4 . =

6 0 2 =

.

2

2

175..

I

FIG. 3. Plot of Diagonal Elements of Each CMA

(8,8,) against

Corresponding Numbers of Neighbors for

.2

...

..

22

1.0.

32322

."

3

_U.

2

032 0.0.

22

. . ...

232 4332 2223233

2.2

-1.0..

025.. .22.2

.. ...

-32.

.3. 33342

;1, ow..

2343. 34.

.0.25..

......

23 223

..22

FIG. 4. Rankit Plots of G, and ti,

encompass more (intra-area) variation than measurements relating to small areas. (The overall effect of increasing area size is to reduce variation between measurements.) The effect of regularization, as it is called, in the estimation of the semivariogram and kriging has been studied in the geostatistical literature; see, for example, Webster (1985). The estimate of the spatial correlation structure is conditional on the areal partition.

224

/ Geographical Analysis

The estimated bivariate correlations will also be conditional on this partition and, in particular, the variability that has been smoothed by the partition. Where, as here, areas in the partition differ in size (variable support), some differential weighting may be appropriate across the set of paired observations, either to reflect differences in the degree of reliability the analyst places on each individual observation, or to reflect differences in the variance of the associated random variables, for example (Weisberg 1985, p. 83).2 Figure 5 (top) shows a plot of 6 , against population size. Although not conclusive (there are more smaller CMAs than larger ones), there is evidence that the variance of 6 , decreases as population size ( p ) increases. This relationship is not evident in the case of 6 , [Figure 5 (bottom)]. The correlation between the vector 6 , and the vector Bp, where

BP,

=

( e',p,, e',p,,

..., e',P")

9

which eliminates this variance size relationship, is 0.103 (t-value 0.954) suggesting a still weaker relationship between X and Y. A plot of fiy against p also suggests that residual variance decreases with increasing population. The correlation between ii, and fipy is 0.115 with an adjusted t-value of 0.769 (N'= 45.78).3

B. Residential Burglury in Santu Barbara, Cal$mia: S p e a m n Rank Currelution Hubert et al. (1985) report a study of per capita residential burglary levels in sixty-four police reporting districts in Santa Barbara, California, in 1980 and 1981. The data are in ranked form and the Spearman rank correlation between the two sets of sixty-four observations is 0.66 which is significant at the 0.01 level using conventional test procedures. The authors further show, using a form of Tjostheims index of association, that areas with similar ranks are found close together in both years (treating each year as a univariate data set). So, police reporting districts with high levels of reported burglary in 1980 will tend to have high levels in 1981 as will their neighboring districts (in 1980 and 1981). The significance levels for the univariate index of association in both years are derived under randomization of pairs of observations. This gives a measure of spatial association that is conditional on a given level of area-to-area (or bivariate) association. The test of bivariate association does not, however, allow for the spatial correlation in the two data series. The same test conclusions would be obtained for any permutation of the data pairs across the sixty-four police districts. It is at least arguable that just as an index of spatial pattern is needed that is not "contaminated' by different levels of area-to-area association, so an index of bivariate association is needed that allows for spatial correlation. Table 3 reports the spatial correlation structure in the two data sets. (Note that the 1981 data show only weak evidence of spatial correlation.) The effective sample size N' is much smaller than n but there is still strong evidence of pairwise correlation between the two data sets. 'Such differential weighting may also be used to estimate parameters in the model defined by (8). This has not been done here but the procedure is described in Haining (1990). 3Nonconstantvariance (across the set of regions) may arise for reasons other than differences in the size (for example, areal extent or population) of areas. Even if all the areas are of the same size the variance of the random variable association with any given area may be related to the locotron of the area relative to the edge of the regional system across which the spatial process, that is generating the set of values, is defined. In this case the process must be nonstationary (see Haining 1988). So variances may differ across the set of areas as a function of area size and relative location. In the first instance the underlying process may be stationary and the nonconstant variance is the consequence of different amounts of areal aggregation. In the second instance the process is nonstationary.

Robert Haining / 225

. ... .. . . . . . ....... ..... .. .... ... .. . . ... . .. .. . . ..

.. .

0.20..

3" 0.00..

2

9

.

2

2

- 0.20

..

.. . . .

.2

. . . ... . .. . .. ... .. .. . . . .. .. . . .. . .. .... .. .. . .. 2

.2

2.

.

2

2

3 .

2.

. ?

I

4.0

80

12 0

20 0

16.0

24 0 PopyIaIwn sue

FIG.5. Plots of EIy and 8, against Population Size TABLE 3 Spearman's Rank Correlation Measure for the 1980 and 1981 Burglary Data Adjusted for the Spatial Structure of the Data: Lag Correlations 1980 1981 No. of data pairs

A@)

NO)

R(1)

R(3)

A(4)

A(5)

A(6)

1.0 1.o

0.275 0.055

0.111 -0 . m

0.054 0.059

0.015 0.010

- 0.067 - 0.044

- 0.273

64

146

286

384

390

329

249

-0.111

Spearman rank correlation = 0.66 n = 64 adjusted n (A") = 47.37 t-transform: 6.98 (no adjustment) 5.97 (Clifford-Richardson adjustment) 4. CONCLUSIONS

The presence of spatial correlation in map data undermines the standard statistical theory of inference. This paper has reviewed one aspect of this problem concerned with the measurement and assessment of bivariate correlation. The central finding is that the presence of either a nonconstant mean or spatial correlation (or both) increases the risk of incorrectly rejecting the null hypothesis, that is, concluding that a significant second-order association exists when such a conclusion is not warranted at the stated significance level. The apparent significant association is the result of large scale variation or pattern in the two data sets.

226

/ Geographical Analysis

This paper has discussed the Clifford-Richardson approach to adjusting for spatial correlation. It has been noted that before applying the method it may be necessary to detrend the data. A method has been suggested and applied although it does require simultaneous estimation of the two components of spatial variation. Attempting to detrend simply by fitting a standard trend surface by ordinary least squares will lead to biased estimates of the trend and possible order misspecification (Haining 1987). Other methods of detrending have been suggested including filter mapping (Tobler 1969; Cressie and Read 1989) and median polish (Cressie 1984; Cressie and Read 1989) which are robust to data problems (such as outliers and data errors), but the properties of these methods with respect to the required decomposition of the series require further e~amination.~ The Clifford-Richardson approach is based on the estimation of spatial correlations and assumes second-order stationarity. This assumption may be particularly difficult to justify when data values refer to aggregates of differing sizes. In such cases variances may not be constant over the set of areas. Where size differences are substantial it may be necessary to transform the data in order to obtain constant variances (see also Cressie and Read 1989). It may also be necessary to allow for this in estimating the trend surface coefficients (Haining 1990). An alternative line of development might be to explore the methods of geostatistics. Correlation analysis raises other interpretation problems: the sensitivity of findings to different areal aggregations (particularly at equivalent scales of aggregation); the relationship between findings identified at one scale of sampling to the underlying continuous fields (arising in, for example, geostatistical applications). There are other problems that merit consideration. To what extent are descriptive techniques that are based on sets of correlation coefficients (such as factor analysis and principal components analysis) subject to these problems (see Griffith 1988)? Each variable may have a distinctive pattern of spatial variation that contaminates the measure of bivariate correlation. This problem may further undermine comparative studies where different areas display diflerent patterns of spatial variation. LITERATURE CITED Agterberg, F. P. (1970). “Autocorrelation Functions in Geology.” In Geostatistics, edited by D. F. Merriam, pp. 113-41. New York: Plenum. Bartlett, M. S. (1935). “Some Aspects of the Time Correlation Problem in Regard to Tests of Significance.” Jouml, Royal Statistical Society 98, 536-43. Bivand, R.(1980). “A Monte Car10 Study of Correlation Coefficient Estimation with Spatially Correlated Observations.” Quuesliones Geographical 6, 5-10. Burgess, T. M., and R. Webster (1980). “Optimal Interpolation and Isarithmic Mapping of Soil Properties.” Jd of Soil Science 31, 315-31. Cliff, A. D., and J. K. Ord (1981). Sparial Processes. London: Pion. Clifford, P., and S. Richardson (1985). “Testing the Association between Two Spatial Processes.” Statistics and Decisions, Supplement 2, 155-60. Clifford, P., S. Richardson, and D. Hemon (1989). “Assessing the Significance of the Correlation between Two Spatial Processes. Biometries 45, 123-34. Cressie, N. (1984). “Towards Resistant Geostatistics.” In Geostatistics for Natural Resources Churacterisotion, edited by G. Verly, M. David, A. G. Journel, and A. Marechal, pp. 21-44. Dordrecht Reidel. Cressie, N.,and T. R. Read (1989). “Spatial Data Analysis of Regional Counts.” Biomehical Journal 6, 699-719.

4A further research issue is to examine whether the Clifford and Richardson procedure for significance testing might still work on data with trend without removing that trend (P. Diggle, personal communication).

Robert Haining / 227 Davis, B. M., and L. E. Borgman (1979). “Some Exact Sampling Distributions for Variogram Estimators.” Mathemutical Geology 11, 643-53. (1982). “A Note on the Asymptotic Distribution of the Sample Variogram.” Mathematical Geology 14, 189-93. Grifith, D. A. (1988). Aduanced Spatial Statistics. Dordrecht: Kluwer. Haining, R. P. (1980). “Spatial Autocorrelation Problems.” In Geography and the Urban Enuironment, edited by D. T. Herbert and R. J. Johnston, pp. 1-44. New York: Wiley. (1987). “Trend Surface Analysis with Regional and Local Scales of Variation with an Application to Aerial Survey Data.” Technomettics 29, 461-69. (1988). “Estimating S atial Means with an Application to Remotely Sensed Data.” Communications in Statistics: TLory and Methods 17,573-97. (1990). Spatial Data Analysis in the Social and Enuironmental Sciences. Cambridge: Cambridge University Press. Haining, R. P., D. A. Grifith, and R. J. Bennett (1983). “Simulating Two-dimensional Autocorrelated Surfaces.” Geographical Analysis 15,247-55. Hubert, L. J.. R. G. Golledge, C. M. Costanzo, and N. Gale (1985). “Measuring Association between Spatially Defined Variables: An Alternative Procedure.” Geographical Analysis 17, 36-46. Lazar, P. (1981). “Geographical Correlations between Disease and Environmental Exposures.” In Perspectiues in Medical Statistics, edited by J. F. Bithell and R. Coppi. London: Academic Press. Mardia, K. V., and R. J. Marshall (1984). “Maximum Likelihood Estimation of Models for Residual Co-variance in Spatial Regression.” Biometrika 71, 135-46. McBratney, A. B., and R. Webster (1983). “Optimal Interpolation and Isarithmic Mapping of Soil Properties: Co-regionalization and Multiple Sampling Strategy.” Journal of Soil Science 34, 137-62. Myers, D. E., C. L. Begovich, T. R. Butz, and V. E. Kane (1982). “Variogram Models for Regional Ground Water Geochemical Data.” Mathematical Geology 14, 629-44. Richardson, S., and D. Hemon (1981). “On the Variance of the Sample Correlation between Two Independent Lattice Processes.” Journal of Applied Probability 18, 943-48. (1982). “Autocorrelation Spatiale: Ses Consequences sur la Correlation Empirique de Deux Processus Spatiaux.” Reuue de Statistique Appliquee 300, 41-51. Ripley, B. D. (1981). Spatial Statistics. New York: Wiley. (1988). Statistical Znfmence for Spatial Processes. Cambridge: Cambridge University Press. Siegel, S. (1956). Non Parametric Statistics for the Behauioural Sciences. New York: McCraw-Hill. Tobler, W. (1969). “Geographical Filters and Their Inverses.” Geographical Analysis 1, 234-53. Webster, R. (1985). “Quantitative Spatial Analysis of Soil in the Field.” Aduances in Soil Science 3, 1-70. Weisberg, S. (1985). Applied Linear Regression. New York: Wiley. Wetherill, G. B. (1986). Regression Analysis with Applications. London: Chapman and Hall. Yule, G. U. (1926). “Why Do We Sometimes Get Nonsense Correlations between Two Time Series?” J O W M ~ of the Royal Statistical Society 89, 1-69.