Statistical aspects of analysing effects of rainfall on water quality

Statistical aspects of analysing effects of rainfall on water quality S. Penev1 1 The 2 National D. Leonte2 Z. Lazarov3 University of New South Wa...
Author: Julian Day
0 downloads 2 Views 402KB Size
Statistical aspects of analysing effects of rainfall on water quality S. Penev1 1 The

2 National

D. Leonte2

Z. Lazarov3

University of New South Wales Sydney, Australia

Industrial Chemicals Notification and Assessment Scheme Sydney 3 Boronia

Capital Pty Ltd Sydney

Dublin/ 58th Congress of ISI, August 2011

Outline 1

Introduction Background Problem formulation

2

The approach Least squares Full Maximum Likelihood (FML) Rainfall influence

3

Implementation and results

4

Acknowledgements

Introduction Background

Outline 1

Introduction Background Problem formulation

2

The approach Least squares Full Maximum Likelihood (FML) Rainfall influence

3

Implementation and results

4

Acknowledgements

Introduction Background

The University of New South Wales (UNSW) was awarded a collaborative research grant by Sydney Catchment Authority (SCA) with the objective to statistically analyse the SCA water monitoring data bases (water quality, flow and rainfall) to improve the understanding and reporting of water quality variations in SCA’s waterways. The authors of this paper were a part of the team.

Introduction Background

Main objectives were: Accounting for rainfall and flow as additional regressors in the long-term statistical trend model for water quality variables predict concentrations of water quality variables during extreme rainfall conditions performing statistical analysis of lakes/reservoir water quality data by taking into account the depth intervals at lake sites modelling pathogen data in lakes while accounting for autocorrelation

Introduction Background

Main objectives were: Accounting for rainfall and flow as additional regressors in the long-term statistical trend model for water quality variables predict concentrations of water quality variables during extreme rainfall conditions performing statistical analysis of lakes/reservoir water quality data by taking into account the depth intervals at lake sites modelling pathogen data in lakes while accounting for autocorrelation

Introduction Background

Main objectives were: Accounting for rainfall and flow as additional regressors in the long-term statistical trend model for water quality variables predict concentrations of water quality variables during extreme rainfall conditions performing statistical analysis of lakes/reservoir water quality data by taking into account the depth intervals at lake sites modelling pathogen data in lakes while accounting for autocorrelation

Introduction Background

Main objectives were: Accounting for rainfall and flow as additional regressors in the long-term statistical trend model for water quality variables predict concentrations of water quality variables during extreme rainfall conditions performing statistical analysis of lakes/reservoir water quality data by taking into account the depth intervals at lake sites modelling pathogen data in lakes while accounting for autocorrelation

Introduction Background

New non-standard statistical approaches were tried with the aim for a finer modelling and taking into account the difficulties arising from the structure of the data base. Main contributions Non-parametric (model-free) procedure for outlier detection in time series, based on the wavelet transform. Focus: on analysing the magnitude of the highest-level coefficients in the wavelet decomposition of time series Modelling water quality trends in count data (eg. pathogens) using GLM combined with accounting for autocorrelation via a latent process as additional regressor Using Mixed Data Sampling (MIDAS) regression to model impact of rainfall on trends in water quality In this presentation, we discuss the third contribution only.

Introduction Background

New non-standard statistical approaches were tried with the aim for a finer modelling and taking into account the difficulties arising from the structure of the data base. Main contributions Non-parametric (model-free) procedure for outlier detection in time series, based on the wavelet transform. Focus: on analysing the magnitude of the highest-level coefficients in the wavelet decomposition of time series Modelling water quality trends in count data (eg. pathogens) using GLM combined with accounting for autocorrelation via a latent process as additional regressor Using Mixed Data Sampling (MIDAS) regression to model impact of rainfall on trends in water quality In this presentation, we discuss the third contribution only.

Introduction Background

New non-standard statistical approaches were tried with the aim for a finer modelling and taking into account the difficulties arising from the structure of the data base. Main contributions Non-parametric (model-free) procedure for outlier detection in time series, based on the wavelet transform. Focus: on analysing the magnitude of the highest-level coefficients in the wavelet decomposition of time series Modelling water quality trends in count data (eg. pathogens) using GLM combined with accounting for autocorrelation via a latent process as additional regressor Using Mixed Data Sampling (MIDAS) regression to model impact of rainfall on trends in water quality In this presentation, we discuss the third contribution only.

Introduction Problem formulation

Outline 1

Introduction Background Problem formulation

2

The approach Least squares Full Maximum Likelihood (FML) Rainfall influence

3

Implementation and results

4

Acknowledgements

Introduction Problem formulation

Water quality can be substantially affected during a rainfall day. It is also possible that persistent rainfall would have an impact on the long-term water quality trends. A statistically significant trend in a water quality variable might disappear when the rainfall is accounted for or the reverse could occur. The standard approach in the water quality literature ([4]) for accounting for the impact of the rainfall: LOESS (locally weighted scatterplot smoothing). However: Rainfall is used as an input to derive a new time series from the original water quality series. Original time series is assumed to be free from rainfall influence→ unrealistic. Does not allow to explicitly quantify the influence of rainfall, if present. In particular, how to test for rainfall effect?

Introduction Problem formulation

Water quality can be substantially affected during a rainfall day. It is also possible that persistent rainfall would have an impact on the long-term water quality trends. A statistically significant trend in a water quality variable might disappear when the rainfall is accounted for or the reverse could occur. The standard approach in the water quality literature ([4]) for accounting for the impact of the rainfall: LOESS (locally weighted scatterplot smoothing). However: Rainfall is used as an input to derive a new time series from the original water quality series. Original time series is assumed to be free from rainfall influence→ unrealistic. Does not allow to explicitly quantify the influence of rainfall, if present. In particular, how to test for rainfall effect?

Introduction Problem formulation

Other simplistic alternatives: take the averages of the past values of rainfall and use them as an exogenous regressor in a trend model. Pros: A test for the presence of rainfall effect: easy. Cons: The dynamic effect of rainfall can not be observed in this simplistic alternative but should ideally be taken into account. One reasonable assumption: the rainfall on the current day has the highest influence on water quality, and this influence decreases monotonically on previous days. However, more complex patterns of influence could occur. Model them!

The approach

Approach We investigate the potentially complex patterns of rainfall influence through the use of Mixed Data Sampling (MIDAS) regression ([1]). Why Mixed? Typically, data collection for water quality variables: fortnightly or even monthly intervals but rainfall data is collected daily. Even without any missing data, a five years collection of data: just about 110 fortnightly of 60 monthly observations: a very small portion of the rainfall measurements! The small number of observations necessitates the use of parsimonious time series models→ MIDAS regression. With three parameters only→ model a variety of shapes, including humped shapes, for the weight coefficients of the lags in the rain impact.

The approach Least squares

Outline 1

Introduction Background Problem formulation

2

The approach Least squares Full Maximum Likelihood (FML) Rainfall influence

3

Implementation and results

4

Acknowledgements

The approach Least squares

Water quality variables of interest: Aluminum Total, Manganese Total, Iron Total, Nitorgen Oxidised etc. In practice→ missing data appears routinely→ not all fortnightly records will be available. Substituting missing values in short time series: too risky and should be avoided. Have two procedures. First procedure. Let X1 , X2 , . . . , XM be the time series of all monitoring records (including missing) of a water quality variable. When not missing, the records are positive→ take logarithms. This helps towards normality towards stationarity avoids the need to put restrictions on the residuals to guarantee positivity of the dependent variable. Let d1 , d2 , . . . , dM be a sequence of days when the records are taken.

The approach Least squares

Water quality variables of interest: Aluminum Total, Manganese Total, Iron Total, Nitorgen Oxidised etc. In practice→ missing data appears routinely→ not all fortnightly records will be available. Substituting missing values in short time series: too risky and should be avoided. Have two procedures. First procedure. Let X1 , X2 , . . . , XM be the time series of all monitoring records (including missing) of a water quality variable. When not missing, the records are positive→ take logarithms. This helps towards normality towards stationarity avoids the need to put restrictions on the residuals to guarantee positivity of the dependent variable. Let d1 , d2 , . . . , dM be a sequence of days when the records are taken.

The approach Least squares

Water quality variables of interest: Aluminum Total, Manganese Total, Iron Total, Nitorgen Oxidised etc. In practice→ missing data appears routinely→ not all fortnightly records will be available. Substituting missing values in short time series: too risky and should be avoided. Have two procedures. First procedure. Let X1 , X2 , . . . , XM be the time series of all monitoring records (including missing) of a water quality variable. When not missing, the records are positive→ take logarithms. This helps towards normality towards stationarity avoids the need to put restrictions on the residuals to guarantee positivity of the dependent variable. Let d1 , d2 , . . . , dM be a sequence of days when the records are taken.

The approach Least squares

Denote by N the number of actual observations given by the sub-sequence Xt1 , Xt2 , . . . , XtN for the set of indices t1 < t2 < · · · < tN . We take a subset f1 < f2 < · · · < fP , (P ≤ N) from the set of indices t1 < t2 < · · · < tN such that Yfi = log(Xtj ) for some Xfi = Xtj > 0 and tj = tj−1 + 1. In other words, we choose all recorded observations for which there exists another recorded observation taken a fortnight apart Yfi , i = 1, 2, . . . , P as the output data.

The approach Least squares

A typical feature of water quality data→ short-term autocorrelation (see [5], p. 95)→we capture it via ARMA(p, q) with small p, q. Below→ the simplest (basic) model when the small number of available observations and the parsimony are to be observed: Yfi

= α + β1 fi + β2 sin(2πdfi /365) + β3 cos(2πdfi /365) + β4 Yfi−1 + εfi , i = 1, 2, . . . , P.

α, βi , i = 1, 2, 3, 4 : parameters, εfi , i = 1, 2, . . . , P : i.i.d. zero-mean errors. Significance and sign of β1 indicate the presence and direction of a trend. The regressors sin(2πdfi /365), β3 cos(2πdfi /365) account for intra-yearly seasonality. The lagged Yft−1 controls for autocorellation.

(1)

The approach Least squares

The model→ estimated by (OLS)→advantage: robustness and simplicity. However→cases are used only when a neigbouring observation recorded a fortnight apart exists→ loss of efficiency. Apply it with a very small proportion of missing values or to generate initial estimators for (iterative) Full MLE procedure that uses all available non-missing data points by taking into account the time gap between them.

The approach Full Maximum Likelihood (FML)

Outline 1

Introduction Background Problem formulation

2

The approach Least squares Full Maximum Likelihood (FML) Rainfall influence

3

Implementation and results

4

Acknowledgements

The approach Full Maximum Likelihood (FML)

Suppose for a moment that instead of the time series of monitoring records X1 , X2 , . . . , XM we had the series X1∗ , X2∗ , . . . , XM∗ where no observation was missing. Let Y1∗ , Y2∗ , . . . , YM∗ be the corresponding log-transformed values. Again we assume that the dynamics the form Yi∗ = α + β1 i + β2 sin(2πdi /365) + β3 cos(2πdi /365) ∗ + β4 Yi−1 + ηi , i = 1, 2, . . . , M,

with uncorrelated normal zero-mean errors ηi , i = 2, 3, . . . , M with variance σ 2 . Missing data means→ only a subset Y1 , Y2 , . . . , YN is observed. Relations between adjacent recorded variables→ by recursively applying (2).

(2)

The approach Full Maximum Likelihood (FML)

Pick a pair of neighbouring observations Yj−1 = log(Xtj−1 ) and Yj = log(Xtj ) and j = 2, 3, . . . , N. Set sj = tj − tj−1 . Let µi = α + β1 i + β2 sin(2πdi /365) + β3 cos(2πdi /365) for i = 2, 3, . . . , M. We have Yj = log(Xtj ) = µtj + β4 log(Xt∗j −1 ) + ηtj

(3)

Applying (2) to the lagged term log(Xt∗j −1 ) : Yj = µtj + β4 µtj −1 . + β42 log(Xt∗j −2 ) + β4 ηtj −1 + ηtj . Continue applying recursively (3)→ end up: sj −1

Yj =

sj

∑ β4k µt −k + β4 Yj−1 + ζt , j = 2, 3, . . . , N. j

k =0

2sj

where ζtj ∼ N(0, σ 2 1−β ). 1−β 2

j

(4)

The approach Full Maximum Likelihood (FML)

Working in a similar fashion→ derive similar relationships for all pairs Yj , Yj−1 , j = 2, 3, . . . , N → can write down the Likelihood of the observed data→can estimate the parameters (α, β1 , β2 , β3 , β4 , σ 2 )0 by ML. The variance-covariance matrix can be evaluated by the inverse of the Fisher information matrix. Iterative procedure in SAS→ implemented. Initial guesses→ by the former LS method.

The approach Rainfall influence

Outline 1

Introduction Background Problem formulation

2

The approach Least squares Full Maximum Likelihood (FML) Rainfall influence

3

Implementation and results

4

Acknowledgements

The approach Rainfall influence

How to incorporate the impact of rainfall on water quality? Assume: no missing observations first. Denote by Vi,k , k = 1, 2, 3, . . . , l the rainfall on the current and all the previous days, up to day l − 1. Note: the rainfall is often zero, and also can take widely varying positive values ranging from 0.5 to 200-300mm. Hence scaling: ri,k = log(1 + Vi,k ) instead of Vi,k . Introduce ri,k as exogenous regressors in (2) via Yi

= α + β1 i + β2 sin(2πdi /365) + β3 cos(2πdi /365) + β4 Yi−1 + γ1 ri,1 + γ2 ri,2 + · · · + γl ri,l + εi ,

(5)

i = 1, 2, . . . , M, di denotes the day record. We refer to f (k ) = γk as the rainfall impact function. But (5) can be impractical for inference purposes for large l especially since M is usually small. How to accommodate complex patterns?

The approach Rainfall influence

Many parameterizations from the Distributed Lag Models literature are surveyed in ([3]) but: restrictive by allowing decreasing weights only whereas full impact of a heavy rainfall might be felt a few days after. We propose MIDAS: parsimonious yet allows for flexible shapes of the weights. k γk = δ B( ; θ1 , θ2 ). l θ1 −1

θ2 −1

Γ(θ1 +θ2 ) B(x; θ1 , θ2 ) = x (1−x) . The θ1 and θ2 are always Γ(θ1 )Γ(θ2 ) positive→ a statistically significant impact of the rainfall on water quality is determined by the significance of δ . High significance of θ1 and θ2 is yet helpful to access model adequacy. We implemented an iterative SAS procedure.

(6)

Implementation and results

Tested for water quality variables at catchment sites from the Shoalhaven supply system and catchments in the Sydney region. Brief summary of empirical analysis for a catchment site coded as E822 and water quality variables Aluminium Total, Turbidity Field and Manganese Total. Three different lag sizes of 10, 15 and 20 days used. Main interest: whether the trend coefficient changes value and statistical significance when accounting for rainfall. Also: the actual sign and shape of the rainfall impact function. Used FML, without and with a rainfall impact function.

Implementation and results

For Aluminium total: δ is highly significant and positive (0.0873, s.e. = 0.0132, P ≈ 0 when l = 10; 0.0629, s.e. = 0.0073, P ≈ 0 when l = 15; 0.0524, s.e. = 0.0074, P ≈ 0 when l = 15) indicating a significant rain impact. The trend coefficient in the FML model (with rain impact excluded) is also highly significant and negative (−0.00318, s.e. = 0.00136, P ≈ 0.019). In this case, including the rain impact did not change significantly the trend (−0.00374, s.e. = 0.0103, P ≈ 0.00026 when l = 10 and similar values when l = 15, l = 20). The graph with lags l = 10, 15, 20 suggest a hump shaped rainfall impact function→time is needed for certain chemical reactions to take place during a rainfall.

Implementation and results

Figure: Rainfall Impact Coefficients for Aluminium Total.

Implementation and results

For Turbidity Field. Trend coefficient in the model with rainfall excluded was non-significant (−0.0019, s.e. = 0.00125, P = 0.129) but became slightly significant when rainfall’s influence is included (−0.0025, s.e. = 0.00106, P ≈ 0.0018 for l = 10, with similar values for l = 15, 20). The coefficient δ is highly significant and positive (0.0726, s.e. = 0.0018, P ≈ 0 for l = 10 with similar values for l = 15, 20). This time the rainfall impact is at its maximum on the rainfall day and then gradually decreases.

Implementation and results

Figure: Rainfall Impact for Turbidity Field.

Implementation and results

For Manganese Total. Again, there is no substantial change in the trend’s coefficient size and significance with rainfall included or excluded, δ is negative this time and is again highly significant. The rain impact function: pronouncedly hump shaped, maximum at about 6-8 days after rainfall and approaching zero after about 14 days.

Implementation and results

Figure: Rainfall Impact Coefficients for Manganese Total.

Implementation and results

Conclusion: Rainfall may have little or no significant effect on the general trend, however in all cases it has a more complicated impact on the water quality. In particular, the hump shaped or decreasing rainfall impact function indicates that models that use arithmetic averages of the rainfall could be grossly miss-specified.

Acknowledgements

The authors would like to thank Sydney Catchment Authority (SCA), who provided funding for the work reported here through a collaborative research grant. The work of Dr Zdravetz Lazarov was sponsored by the grant. We thank Dr Rob Mann (SCA) who managed the grant on behalf of SCA.

References

Ghysels, E., Sinko, A., and Valkanov, R. (2007) MIDAS Regressions: Further Results and New Directions. Econometric Reviews 26 (1), 53–90. Greene, W. (2000) Econometric Analysis, 4th Edition. Prentice Hall. Gujarati, D. (2003) Basic Econometrics, 4th Edition, McGraw Hill. Helsel, D.R. and Hirsch, M., (2000) Statistical Methods in Water Resources, Studies in Environmental Science, 49. Elsevier, New York. Ward, R., Loftis, J. and McBride, G. (1990) Design of water quality monitoring systems. New York, Van Nostrand Reinhold.

References

Ghysels, E., Sinko, A., and Valkanov, R. (2007) MIDAS Regressions: Further Results and New Directions. Econometric Reviews 26 (1), 53–90. Greene, W. (2000) Econometric Analysis, 4th Edition. Prentice Hall. Gujarati, D. (2003) Basic Econometrics, 4th Edition, McGraw Hill. Helsel, D.R. and Hirsch, M., (2000) Statistical Methods in Water Resources, Studies in Environmental Science, 49. Elsevier, New York. Ward, R., Loftis, J. and McBride, G. (1990) Design of water quality monitoring systems. New York, Van Nostrand Reinhold.

References

Ghysels, E., Sinko, A., and Valkanov, R. (2007) MIDAS Regressions: Further Results and New Directions. Econometric Reviews 26 (1), 53–90. Greene, W. (2000) Econometric Analysis, 4th Edition. Prentice Hall. Gujarati, D. (2003) Basic Econometrics, 4th Edition, McGraw Hill. Helsel, D.R. and Hirsch, M., (2000) Statistical Methods in Water Resources, Studies in Environmental Science, 49. Elsevier, New York. Ward, R., Loftis, J. and McBride, G. (1990) Design of water quality monitoring systems. New York, Van Nostrand Reinhold.

References

Ghysels, E., Sinko, A., and Valkanov, R. (2007) MIDAS Regressions: Further Results and New Directions. Econometric Reviews 26 (1), 53–90. Greene, W. (2000) Econometric Analysis, 4th Edition. Prentice Hall. Gujarati, D. (2003) Basic Econometrics, 4th Edition, McGraw Hill. Helsel, D.R. and Hirsch, M., (2000) Statistical Methods in Water Resources, Studies in Environmental Science, 49. Elsevier, New York. Ward, R., Loftis, J. and McBride, G. (1990) Design of water quality monitoring systems. New York, Van Nostrand Reinhold.

References

Ghysels, E., Sinko, A., and Valkanov, R. (2007) MIDAS Regressions: Further Results and New Directions. Econometric Reviews 26 (1), 53–90. Greene, W. (2000) Econometric Analysis, 4th Edition. Prentice Hall. Gujarati, D. (2003) Basic Econometrics, 4th Edition, McGraw Hill. Helsel, D.R. and Hirsch, M., (2000) Statistical Methods in Water Resources, Studies in Environmental Science, 49. Elsevier, New York. Ward, R., Loftis, J. and McBride, G. (1990) Design of water quality monitoring systems. New York, Van Nostrand Reinhold.

Suggest Documents