Dependence in Hydrological Extreme Value Analysis

Arenberg Doctoral School of Science, Engineering & Technology Faculty of Science Department of Mathematics Dependence in Hydrological Extreme Value A...
Author: Blanche Dawson
4 downloads 3 Views 4MB Size
Arenberg Doctoral School of Science, Engineering & Technology Faculty of Science Department of Mathematics

Dependence in Hydrological Extreme Value Analysis

Edwin Rutalebwa BONIPHACE

Dissertation presented in partial fulfillment of the requirements for the degree of Doctor of Science

March 2012

Dependence in Hydrological Extreme Value Analysis

Edwin Rutalebwa BONIPHACE

Jury: Prof. dr. Prof. dr. Prof. dr. Prof. dr. Prof. dr.

Mia Hubert, chair Jan Beirlant, supervisor Patrick Willems, co-supervisor Jozef Van Dyck Tim Verdonck

Prof. dr. Goedele Dierckx (Hogeschool Universiteit Brussel) Prof. dr. Maria Ivette Gomes (University of Lisbon)

March 2012

Dissertation presented in partial fulfillment of the requirements for the degree of Doctor of Science

© Katholieke Universiteit Leuven – Faculty of Science Celestijnenlaan 200B box 2400, B-3001 Heverlee (Belgium) Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar gemaakt worden door middel van druk, fotocopie, microfilm, elektronisch of op welke andere wijze ook zonder voorafgaande schriftelijke toestemming van de uitgever. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the publisher. D/2012/10.705/29 ISBN 978-90-8649-512-2

Dedication To the memory of my Parents. To my wife Coletha. To my children: Rosemary, Chrisant, and Godwin.

i

Acknowledgments Completing a research project and writing a dissertation is obviously not possible without support of numerous people. First of all, I would like to express my gratitude to Prof. Jan Beirlant for giving me the chance to work with him on the extreme value analysis and for his valuable consultation and support throughout the research work. I am also deeply grateful to Prof. Patrick Willems for his never ending patience, enthusiasm and encouragement in guiding me in the application of extreme value analysis to hydrological extremes, in particular to river flow series. Dr. Christian B. Alphonce has played a great role not only guiding me on research work but also on administrative tasks during my stays in Tanzania. Their extensive comments, many discussions, and the interactions with them had a direct impact on the final form and quality of the thesis. They have got all the best qualities that any graduate student could expect from his/her research supervisor. I am very lucky to have them as my advisor and learned a lot from them. I am one of recipients of the Belgium Technical Cooperation (BTC) PhD scholarship and it is gratefully acknowledged. I must acknowledge as well my employer Dar es Salaam Institute of Technology (DIT) for the support given to me during my study period. The members of my PhD committee (Prof. dr. Jan Beirlant, Prof. dr. Patrick Willems, Prof. dr. Mia Hubert, Prof. dr. Jozef Van Dyck, Prof. dr. Maria Ivette Gomes, Prof. dr. Goedele Dierckx and Prof. dr. Tim Verdonck) have generously given their time and expertise to make my work better. I thank them for their contribution, support, careful reading the entire manuscript and providing valuable comments. I acknowledge the professors, students, librarians and others at K.U.Leuven who assisted and advised me during my study period. In general, K.U.Leuven is a wonderful atmosphere for intellectual growth and accomplishment.

iii

iv

ACKNOWLEDGMENTS

Many other people have contributed to my success directly or indirectly by their advice, moral and material support. In particular, I thank Prof. dr. Goedele Dierckx and others colleagues in the Mathematics Department at K.U.Leuven. I would also like to thank my wife and my children for their unconditional love and continuous support of my academic career. Finally, I owe whatever I have accomplished to God without Whom I could accomplish nothing.

Contents Dedication

i

Acknowledgments

iii

Contents

v

List of Figures

vii

1 Introduction

1

1.1

Extreme value theory

. . . . . . . . . . . . . . . . . . . . . . . .

1.2

Sum plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Quantile-quantile plots . . . . . . . . . . . . . . . . . . . . . . .

5

1.4

Impact of dependence in data on qq-plots . . . . . . . . . . . .

6

1.5

Scope of this thesis . . . . . . . . . . . . . . . . . . . . . . . . .

10

2 Generalized Sum Plots

1

11

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2

Sum plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2.1

The Hill sum plot . . . . . . . . . . . . . . . . . . . . .

13

2.2.2

The generalized Hill sum plot . . . . . . . . . . . . . . .

15

2.2.3

The moment sum plot . . . . . . . . . . . . . . . . . . .

17

v

vi

CONTENTS

2.2.4 2.3

Simulation results . . . . . . . . . . . . . . . . . . . . .

18

Regression estimators of the extreme value index . . . . . . . .

20

2.3.1

Hill sum plot estimators . . . . . . . . . . . . . . . . . .

20

2.3.2

Generalized Hill sum plot estimators . . . . . . . . . . . . 21

2.3.3

Simulation results . . . . . . . . . . . . . . . . . . . . . . 21

2.3.4

Some practical examples . . . . . . . . . . . . . . . . . .

3 Impact of dependence on flood frequency analysis

22 29

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.2

Effect of dependence on the qq-plots . . . . . . . . . . . . . . .

34

3.3

Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.4

Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

3.5

Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . .

50

Bibliography

53

Publications

59

List of Figures 1.1

Illustration of bending behavior due to dependence . . . . . . .

9

2.1

Sum plots: Hill, generalized Hill, moment sum plot . . . . . . .

19

2.2

Fréchet distribution: weighted LS, trimmed and non trimmed estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

Burr distribution: weighted LS, trimmed and non trimmed estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Gamma distribution: weighted LS, trimmed and non trimmed estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

Reversed Burr distribution: weighted LS, trimmed and non trimmed estimator . . . . . . . . . . . . . . . . . . . . . . . . .

26

Zaventem daily maximum wind speed: weighted LS, trimmed and non trimmed estimator . . . . . . . . . . . . . . . . . . . .

27

AoN claim size data: weighted LS, trimmed and non trimmed estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.1

Variation of ρ with j for k = 100 and θ = 0.2 . . . . . . . . . .

40

3.2

Pareto qq-plot of a sample from a max-autoregressive process .

42

3.3

Pareto qq-plot of mean behavior of order statistics from a maxautoregressive process . . . . . . . . . . . . . . . . . . . . . . .

43

Exponential qq-plot of a sample from a Markov chain process .

43

2.3 2.4 2.5 2.6 2.7

3.4

vii

viii

LIST OF FIGURES

3.5

Exponential qq-plot of a mean behavior of order statistics of a Markov chain process . . . . . . . . . . . . . . . . . . . . . . .

44

3.6

Return level plot with dependence effect adjusted . . . . . . . .

45

3.7

Exponential qq-plot of Molenbeek river discharge exceedances .

47

3.8

Return level plot of Molenbeek with GPD fitted to cluster maxima 47

3.9

Return level plot with GPD fitted to all exceedances . . . . . .

49

Chapter 1

Introduction This thesis focuses on sum plots as well as on the impact of dependence in data sequences on quantile-quantile plots (qq-plots) in extreme value analysis. The qq-plot is a statistical tool which is useful in extreme value analysis. It is used in the tail behavior investigation as well as in the parameter estimation. The sum plot is a recent developed diagnostic tool in extreme value analysis. It is used to determine an appropriate number of extremes (sample fraction) to be used in parameter estimation. In this chapter we will provide some background information on extreme value theory. We will also provide an introduction on the sum plot as well as on the impact of dependence on qq-plots. In the last section we will briefly describe the contents of the subsequent chapters of this thesis.

1.1

Extreme value theory

Extreme value theory (EVT) is basically dedicated to characterizing the behavior of extreme observations. Extreme value theory can be traced back to the pioneering work of Fisher and Tippett in 1928. At that time the possible limits for the distributions of maxima of samples of independent and identically distributed random variables were derived. In 1958, Gumbel exploited this result in his famous book ’Statistics of Extremes’. From that time on major developments have been made in theory and applications. EVT has applications in several fields such as hydrology, finance and engineering. Some examples are (1) 100 year flood level estimation in hydrology, (2) valueat-risk determination in finance, (3) premium calculation for excess of loss over

1

2

INTRODUCTION

high retention levels reinsurance treaties, (4) studies on fatigue, corrosion, and structural failure in engineering. There are many excellent books on the subject including Embrechts et al. (1997), which gives a comprehensive mathematical background of the theory, Beirlant et al. (2004), and Coles (2001) which focuses on applications and data analysis. Much of the theory underlying EVT techniques arises from studying block maxima. Suppose one has a sample X1 , X2 , · · · , Xn of independent and identically distributed (iid) random variables from an unknown distribution function F . The main objective of EVT is to try to model the tail of F . Let Mn be a sample maximum i.e. Mn = max(X1 , · · · , Xn ). If there exist normalizing constants an > 0 and bn ∈ < such that   M n − bn lim P ≤ x = G(x) n→∞ an a non degenerate limiting distribution function, then it is known that G must be one of the three types of extremal distributions (Fisher and Tippet, 1928; Gnedenko, 1943). The three types of extremal distributions are Gumbel, Fréchet and Weibull and their distribution functions are: Gumbel :

Λ(x)

Fréchet :

Φ(x)

Weibull :

Ψ(x)

exp(− exp(−x)) − ∞ < x < ∞  0, x≤0 = exp(−(x)−α ), x > 0

=

 =

exp(−(−x)α ), x < 0 1, x≥0

where α > 0. The three extremal distributions describe very different limiting behavior in the tail of the distribution. Moreover, the three types of distributions can be combined into one parametric family: the generalized extreme value (GEV) family of distributions (the von Mises-Jenkinson representation of extreme value distributions). The cumulative distribution function of the GEV distribution is given by Gγ (x) = exp(− (1 + γx)

−1/γ

)

(1.1)

for all x such that 1 + γx > 0 where γ is a shape parameter. The shape parameter γ, often called the extreme value index, controls the tail behavior: if γ is positive then the underlying distribution is heavy tailed; if γ is negative then the distribution has a finite right bound. The intermediate case γ = 0 can correspond to exponentially decreasing tails or to distributions with finite right

EXTREME VALUE THEORY

3

endpoints. To fit the appropriate function to the tail of the distribution, one has then to decide on the shape, and moreover on the appropriate location and scaling constants. The GEV distribution provides a starting point of extreme value analysis. In terms of statistical practice however, the GEV distribution has an important shortcoming. For example, assume that one has 40 years of river runoff flow data, then to use a GEV model, the data must be divided into blocks large enough such that the model is appropriate for block maxima. Common practice is to use year-long blocks. Hence for this example, the GEV distribution would be fitted to only 40 annual maxima. One can imagine that among the discarded data there would be runoff events that could yield additional information on the tail of the distribution. To overcome this shortcoming, a Generalized Pareto distribution (GPD) is usually used. Like the GEV, the GPD has an asymptotic justification. Assume that a normalized Mn converges to the GEV distribution with a parameter vector (µ, σ, γ). Pickands (1975) proved that as u → x∗ , where x∗ denotes the right endpoint of the distribution function F, P (X − u ≤ x|X > u) can be approximated by H(x) where H is the GP distribution function given by  γx − γ1 (1.2) H(x) = 1 − 1 + σ ˜ with σ ˜ = σ + γ(u − µ) . Given data, then one fits the GPD to the excesses (some times called exceedances) over the threshold u. This approach is also known as the ’peak over threshold’ (POT) method. Let X1,n ≤ X2,n ≤ · · · ≤ Xn−j+1,n ≤ · · · ≤ Xn,n , j = 1, 2, · · · , n, denote the order statistics of a random sample (X1 , X2 , · · · Xn ) from a distribution function F . Most estimators of the model parameters are constructed on the basis of k upper extremes (i.e. Xn−k+1,n , · · · , Xn,n ). The use of k upper extremes is an attempt to use as much information as possible in the estimation of parameters. Most of the estimators that use k upper extremes are motivated by the following asymptotic condition (the extreme value condition): there exists a positive function a such that, for all s > 0,  log s γ = 0, U (sx) − U (x) lim = (1.3) sγ −1 γ 6= 0 x→∞ a(x) γ with U (x) = Q(1 − 1/x) and Q is the quantile function. Condition (1.3) is required on F for the normalized Mn to have the limiting distribution G

4

INTRODUCTION

(Beirlant et al., 2004, 1996a). The list of estimators of γ that are based on condition (1.3) (also known as semi-parametric estimators) includes among others the estimator by Pickands (1975), the estimator by Hill (1975), the moment estimator by Dekkers et al. (1989) and the estimators based on the generalized quantile plot proposed in Beirlant et al. (1996b). One of the challenging problems in using GPD models or methods based on k upper extremes is how to select the optimal threshold (or the optimal k). By choosing high thresholds (or small k) usually one meets the asymptotic assumptions. However, in this case, few observations will be used in estimation procedures. As a result large variances are associated with the resulting estimates. On the other hand low thresholds (or large k values) reduce variances, but they lead to biased estimates. There exist a variety of diagnostic plots and adaptive estimation methods that assist in threshold selection or choosing an optimal number k of upper extremes. Adaptive selection procedures for instance minimizing the AMSE 1 , such as bootstrap, sequential methods among many other procedures are listed for instance in Beirlant et al. (2004, 2005); Neves and Fraga Alves (2004). The last reference includes a discussion on the performance of Reiss and Thomas’ heuristic method of choosing k where the optimal value of k minimizes the mean distance encapsulating penalty term as defined by Reiss and Thomas (Neves and Fraga Alves, 2004). The list of plots includes Zipf, Hill, empirical mean-excess and sum plots. In the following subsection we briefly introduce the sum plot as it was defined in case of heavy tailed distributions. Generalized sum plots for all other cases are proposed in chapter 2.

1.2

Sum plots

The sum plot introduced in de Sousa and Michailidis (2004) has been developed for heavy tails. Let X1,n ≤ X2,n ≤ · · · ≤ Xn,n be the order statistics of a random sample (X1 , X2 , . . . , Xn ) drawn from a distribution function F that satisfies 1 − F (x) = x−1/γ `F (x), (1.4) where γ > 0 and `F (x) is a slowly varying function at infinity satisfying `F (λx)/`F (x) → 1 when x → ∞ for all λ > 0. A distribution function F that satisfies property (1.4) is said be to a heavy-tailed distribution. Sousa and Michailidis (2004) proposed the sum plot defined through the random variables 1 AMSE

means asymptotic mean squared error

QUANTILE-QUANTILE PLOTS

5

Sk for k = 1, . . . , n: Sk :=

k X

j log

j=1

Xn−j+1,n . Xn−j,n

The plot of Sk against k then is the sum plot. The theoretical properties as well as the performance in practical situations of sum plots are examined in that paper. In the range of k values for which relationship (1.4) holds, the sum plot is approximately linear. The slope of the linear part can be used as an estimate of γ. Moreover, one can use the sum plot to choose a good number of order statistics to be used in estimators that are based on relationship (1.4), e.g. the Hill estimator. The Hill estimator is given by Hk,n :=

k Xn−j+1,n 1X j log . k j=1 Xn−j,n

Observe that the random variable Sk defined above can be written as Sk = kHk,n and hence the points in the sum plot are given by (k, kHk,n ). In the range of k values for which Hk,n stays constant, the sum plot is approximately linear. In chapter 2, we extend this idea defining generalized sum plots for all types of tails based on estimators of γ ∈ < that use k upper order statistics.

1.3

Quantile-quantile plots

The popularity of qq-plots as a diagnostic tool has emerged from the observation that, for important classes of distributions, the quantiles Q(p) are linearly related to the corresponding quantiles of a standard example from this class. For example, let Fλ be the exponential distribution family indexed by λ, i.e. Fλ (x) = 1 − exp(−λx), x > 0, λ > 0. Then the quantile function is given by 1 Qλ (p) = − log(1 − p), λ

for p ∈ (0, 1).

Hence

1 Q1 (p), for p ∈ (0, 1) λ with Q1 the quantile function of the standard exponential distribution. The plot of Qλ (p) against Q1 (p) is linear with slope equal to 1/λ. In practice the population quantile Q is not known. Starting with a given data set (x1 , x2 , · · · , xn ), the unknown Q is replaced by the empirical approximation Qn defined by Qλ (p) =

Qn (p) = xn−j+1,n ,

for

n−j n−j+1 1). The order statistics of a sample of size n drawn from this distribution can be represented as follows (Beirlant et al., 1999): d

−1 −1 −1 ln Xn−j+1,n = ln U (Uj,n ) = γ ln Uj,n + ln l(Uj,n ),

(1.7)

where U1,n ≤ U2,n ≤ · · · ≤ Un,n denote the order statistics of uniform (0, 1) d

sample of size n and = denotes equality in distribution. For iid sequences, j replacing Uj,n by its expected value n+1 , a Pareto quantile plot with coordinates  − ln(

 j ), ln xn−j+1,n , j = 1, 2, . . . , n n+1

is approximately linear with slope γ ultimately for smaller j.

IMPACT OF DEPENDENCE IN DATA ON QQ-PLOTS

9

8

6

4

2 2

4

6

8

Figure 1.1: Illustration : points on the scatter plot shifting gradually form a line with higher intercept to a line with small intercept hence causing bending behavior of the locus of the points.

From (1.6) there exists a constant 0 < τ ≤ 1 which depends on the correlation of observations such that ˜n−j+1,n ). E(Un−j+1,n ) = τ E(U j In equation (1.7), replacing Uj,n with its expectation τ n+1 the Pareto quantile plot will have the same slope as in the independent case but the intercept will differ by γ ln(τ ). If this intercept is constant (i.e. constant correlation), the slope in the qq-plot will not be affected by correlation. However, in most cases for time series, the pairwise correlation depends on the separation in time between observations. One would expect the pairwise correlation to decrease as you move away from the intermediate observations to the extreme observations and the quantity −γ ln(τ ) decreases as a result. This causes the points on the scatter plot to shift gradually from a line with large intercept to a line with small intercept. Hence the locus of points on the scatter plot shows downwards bending behavior as it is illustrated on figure 1.1.

The difficulty of using the correlation coefficient in modelling the effect of dependence on the qq-plot, is the unequal pairwise correlation between observations in many time series. The pairwise correlation in stationary time series is a function of the separation in time between observations involved. In this thesis, the effect of serial dependence on the qq-plot is modelled through an extremal index. Based on this, a new plotting position is proposed. The

10

INTRODUCTION

proposed plotting position is given by n−j+1 n+1

n−j+1 n+1/θ .

This can be seen as adjusting

n−j+1 n+1/θ

from to in order to account for dependence in data. When it is applied to real data sets as well as to synthetic dependent data, the bending behavior in qq-plots at higher quantiles is almost removed or minimized.

1.5

Scope of this thesis

de Sousa and Michailidis (2004) developed the sum plot as a diagnostic tool for selecting the optimal number k of upper order statistics when the distribution is heavy tailed and in particular when an estimator Tk,n is the Hill estimator. We generalize their method to any consistent estimator for any tail type (heavy, Gumbel and light tailed). This generalization of the sum plots is developed in chapter 2. The last chapter of this thesis mainly focuses on the impact of dependence in sequences, particulary in river flows, on the qq-plots. In this chapter, a method of correcting the effect of dependence on the qq-plots is developed. As result, a new plotting position is proposed which accounts for the amount of dependence in sequences through the extremal index θ, which is to be estimated on the basis of the available sample.

Chapter 2

Generalized Sum Plots Co-authors: Beirlant, J. and Dierckx, G. REVSTAT-Statistical Journal, Volume 7, Number 2, June 2011, 181-198

Abstract de Sousa and Michailidis (2004) developed the sum plot based on the Hill (1975) estimator as a diagnostic tool for selecting the optimal k when the distribution is heavy tailed. We generalize their method to any consistent estimator with any tail type (heavy, normal and light tail). We illustrate the method associated to the generalized Hill estimator and the moment estimator. As an attempt to reduce the bias of the generalized Hill estimator, we propose new estimators of the extreme value index based on the regression model which are based on the estimates of the generalized Hill estimator. Also here weighted least squares and weighted trimmed least squares is proposed. The bias and the mean squared error (MSE) of the estimators is studied using a simulation study. A few practical examples are proposed. Keywords: Sum plot, generalized sum plot, extreme value analysis, generalized quantile plot, regression model

2.1

Introduction

In order to estimate a tail index using k upper order statistics, one needs to determine an appropriate value of k. There exist a variety of diagnostic plots and adaptive estimation methods that assist in threshold selection. The list of plots includes Zipf, Hill, empirical mean-excess and sum plots. Adaptive

11

12

GENERALIZED SUM PLOTS

selection procedures are listed for instance in Beirlant et al. (2005). The aim of this paper is to generalize the graphical tool developed by Sousa and Michailidis (2004) assisting in choosing a sensible estimate or a value of k. Their sum plot is based on the assumption that the distribution is heavy tailed. We extend the approach to all estimators which use a set of extreme order statistics in the estimation of a real valued extreme value index. Here we illustrate the approach using the generalized Hill estimator introduced in Beirlant et al. (1996b) and the moment estimator proposed by Dekkers et al. (1989). In this paper we also propose new estimators of the extreme value index based on the regression associated to the estimates of the (generalized) Hill estimator for various k = 1, . . . , K for some K. The article is organized as follows. In section 2.2, we first specify the original sum plot in subsection 2.2.1. Then we generalize it using the generalized Hill estimator in subsection 2.2.2 and using the moment estimator in subsection 2.2.3. In subsection 2.2.4 we illustrate the method with some simulation results. The new estimators based on regression models are introduced in section 2.3, first for the original Hill sum plot in subsection 2.3.1 and then for the generalized Hill sum plot in 2.3.2. Finally, some simulations and practical examples are presented in subsections 2.3.3 and 2.3.4.

2.2

Sum plots

The sum plot by de Sousa and Michailidis (2004) and Henry III (2009), are examples of the following principle. Let γˆk,n (which uses k upper order statistics from the total sample of size n) be a consistent estimator of γ as k, n → ∞ and k/n → 0. Assume first that γˆk,n is an unbiased estimator i.e. Eˆ γk,n = γ. Define the random variables Sk , for k = 1, 2, · · · , n by Sk := kˆ γk,n

(2.1)

then ESk = kγ. Therefore the plot (k, Sk ) is approximately linear for the range of k where γˆk,n ≈ γ, i.e. γˆk,n is constant in k. The slope of the linear part of the graph (k, Sk ) can then be used as an estimator of γ. Assume now that γˆk,n is a consistent estimator but biased, that is Eˆ γk,n = γ + (bias), then ESk = kγ + k(bias). If the bias is constant in k then (k, Sk ) is again linear with the slope equal to γ + (bias). Typically though the bias is not constant in k and hence the path of (k, Sk ) will depend on the non constant function in k defining the bias. The sum plot introduced in de Sousa and Michailidis (2004) is based on the Hill estimator (Hill, 1975). The sum plot by Henry III (2009) is based on a harmonic

SUM PLOTS

13

moment estimator. Both proposals were limited to the family of Pareto-type distributions. So given γˆk,n , any consistent estimator of γ based on k top order statistics, we propose a sum plot (k, Sk ) based on γˆk,n with Sk defined in (2.1). The only strong assumption on γˆk,n is consistency which is a natural requirement on any estimator. This plot could be helpful in identifying an appropriate region of k, the number of order statistics to be used in γˆk,n . While the plots (k, Sk ) and (k, γˆk,n ) are statistically equivalent, the sum plot naturally leads to the estimation of the slope whereas (k, γˆk,n ) entails horizontal plots and hence estimation of an average. Here we consider the case of a real-valued γ and hence increasing or decreasing sum plots allow to assess the sign of γ. Since each estimator will have its own sum plot, we hereafter name the associated sum plot along the name of the estimator. For example the sum plot based on the Hill estimator is named the Hill sum plot. In the following subsections we illustrate the proposed sum plot principle using the Hill, the generalized Hill and the moment estimator. We also illustrate the performance of these sum plots on simulated data and on some real data sets.

2.2.1

The Hill sum plot

Let X1,n < X2,n < · · · < Xn,n denote the order statistics of a random sample (X1 , X2 , · · · Xn ) from a heavy tailed distribution F with 1 − F (x) = x−1/γ `F (x), x > 0,

(2.2)

where `F is a slowly varying function at infinity satisfying `F (λx)/`F (x) → 1 when x → ∞ for all λ > 0. H Let the random variables Sk,n (k = 1, · · · , n) be defined as

H Sk,n =

k X j=1

Zj :=

k X j=1

j log

Xn−j+1,n . Xn−j,n

(2.3)

H de Sousa and Michailidis (2004) introduced the diagnostic plot (k, Sk,n ), the sum plot for estimating the tail index γ. This plot is called the Hill sum plot since the Hill (1975) estimator Hk,n satisfies

Hk,n =

k 1X 1 H log Xn−j+1,n − log Xn−k,n = Sk,n . k j=1 k

(2.4)

14

GENERALIZED SUM PLOTS

To understand the behavior of the Hill sum plot we rely on a representation of the variables Zj from (2.3) (j = 1, . . . , n) provided in Beirlant et al. (2001). We remind that the model (2.2) is well-known to be equivalent to U (x) = xγ `U (x),

(2.5)

where U (x) = inf{y : F (y) ≥ 1 − 1/x}(x > 1) and with lU again a slowly varying function. Often, the following second order condition on lU is assumed tρ − 1 `U (tx) = 1 + b(x) (1 + o(1)), `U (x) ρ where b is a rate function satisfying b(x) → 0 as x → ∞ and ρ < 0. Under this second order condition, Beirlant et al. (2001) have shown that  −ρ ! j Ej + βj = oP (bn,k ) , (2.6) Zj − γ + bn,k k+1 uniformly in j ∈ {1, . . . , k}, as k, n → ∞ with k/n → 0, where (E1 , · · · , Ek ) is a vector of independent and standard exponentiallyPdistributed random variables, k bn,k := b((n + 1)/(k + 1)), 2 ≤ k ≤ n − 1 and k1 j=1 βj = oP (bn,k ). Pk Hence for SkH = j=1 Zj ,   k X H Sk − kγ + kbn,k /(1 − ρ) + γ  (E − 1) + o (k b ) j P n,k j=1 = oP (kbn,k ) since

1 k

Pk



j=1

j k+1

−ρ



1 1−ρ

as k, n → ∞ with k/n → 0, where as usual,

an ∼ bn is equivalent to an /bn → 1 as n → ∞. In the specific case where b(x) = Cxρ (1 + o(1)) for some real constant C (Hall, 1982), then we obtain   k X H Sk − kγ + Cnρ k 1−ρ + γ  (Ej − 1) + oP (k bn,k ) (2.7) j=1 = oP (kbn,k ) . de Sousa and Michailidis (2004) only considered the case C = 0.

SUM PLOTS

15

The Hill sum plot is a graphical tool in which one is searching for a range of k where the sum plot is linear, or equivalently where Hk,n is constant in k, if such a behavior becomes apparent. Whereas the Hill estimator can be seen as an estimator of the slope in a Pareto quantile plot (see for instance Beirlant et al. (1996a) and Kratz and Resnick (1996)), the sum plot now can be viewed as a regression plot from which new estimators can be constructed by regression of H Sk,n on k, as suggested by (2.7). As the regression error will turn out smaller on the sums of noise variables γ(Ej − 1) rather than on extreme log-data, regression on the sum plot appears to be an interesting alternative approach. In practice we put ρ = −1 so that in that case we will fit a quadratic regression model as discussed in Section 2.3. The second order parameter could be replaced by estimators such as discussed in Fraga Alves et al. (2003). In the simulation study the case of the Burr distribution with ρ = −0.5 gives an idea of the loss of accuracy by setting ρ = −1.

2.2.2

The generalized Hill sum plot

Using a similar approach we derive the generalized sum plot for γ ∈ < based on the generalized Hill estimator by Beirlant et al. (1996b). Here the underlying model is that the distribution belongs to a maximum domain of attraction: there exist sequences of constants (an ; an > 0) and (bn ) such that   Xn,n − bn lim P ≤ x = exp(−(1 + γx)−1/γ ), 1 + γx > 0, γ ∈ U (x)) .

(2.8)

This function possesses the regular variation property for the full range of γ. The empirical counterpart of U H at x = n/k is given by U Hk,n := Xn−k,n

 P k 1 k

i=1 log Xn−i+1,n − log Xn−k,n

= Xn−k,n Hk,n .

 (2.9)

Using the property of regular variation of U H, Beirlant et al. (1996b) proposed an estimator of γ ∈ < by fitting a constrained least-squares line to the points with coordinates (− log(j/n), log U Hj,n ) (j = 1, . . . , k) to obtain the generalized

16

GENERALIZED SUM PLOTS

∗ ∗ Hill estimator Hk,n . Similarly as in (2.4), Hk,n is given by

∗ Hk,n =

 k  1X U Hi,n i+1 i+1 (i + 1) log + − (i + 1) log . k i=1 U Hi+1,n i i

(2.10)

Define random variables SkU H , for k = 1, · · · , n − 2 as SkU H :=

k X

    Xn−i,n Hi,n i+1 1 (i + 1) log . + − log Xn−i−1,n Hi+1,n i i i=1

(2.11)

∗ Since SkU H = kHk,n , we obtain the generalized Hill sum plot (k, SkU H ), and we ∗ expect that for the range of k where Hk,n is constant (or stable) the plot will be linear. Note that the range of k where the Hill estimator is constant, the term Hj,n /Hj+1,n → 1, and (2.11) is almost reduced to SkH , except that the term including the largest observation is deleted.

Under general second order regular variation conditions, in Dierckx (2000) it is shown that for 1 ≤ j ≤ k, 2 ≤ k ≤ n − 2 it holds for   1 j+1 ∗ Zj := (j + 1) (log U Hj,n − log U Hj+1,n ) + − log j j that ∗ Zj −

γ + ˜bn,k



j+1 k+1

−ρ ! Ej+1 + γ (Ej+1 − 1)

 ¯j E j+1 1 ˜ + + βj − log +(j + 1) log ¯ j j Ej+1  = oP ˜bn,k 

(2.12)

as k, n → ∞ with k/n → 0, where (E1 , · · · , Ek ) is a vector of independent and ¯j denotes the sample standard exponentially distributed random variables, E ˜ mean of (E1 , · · · , Ej ), bn,k is some generic notation for a function decreasing Pk to zero, ρ < 0 and k1 j=1 β˜j = oP (bn,k ). Note also that for γ < 0, the above expression only holds for j → ∞.

SUM PLOTS

17

  ¯ E 1 Let us denote ej := γ (Ej+1 − 1) + (j + 1) log E¯ j − log j+1 j + j . In Dierckx j+1

(2000) it is shown that Eei = 0 Cov(ei , ej )

=

γ ;i < j j

V ar(ei )

=

(γ − 1)2 +

1 + 2i . i2

(2.13)

Model (2.12) is a direct generalization of the regression model (6) used in the Hill sum plot, leading to a generalized Hill sum plot regression approach. In practice we fit the regression model SkU H = kγ + Cρ k 1−ρ +

k X

ej .

(2.14)

j=1

In the simulations below we will replace ρ by the canonical choice −1 so that later on we fit a quadratic regression model to the responses SkU H , k = 1, . . . , K for some K > 0.

2.2.3

The moment sum plot

(2)

Let Hk,n be defined as follows (2)

Hk,n :=

k 1X 2 (log Xn−i+1,n − log Xn−k,n ) . k i=1

The moment estimator Mk,n (Dekkers et al. (1989)) is given by Mk,n

1 := Hk,n + 1 − 2

where Hk,n is the Hill estimator from (2.4).

1−

2 Hk,n (2)

Hk,n

!−1 ,

(2.15)

18

GENERALIZED SUM PLOTS

(m)

Let the random variables Sk (m) Sk

:=

k X

, for k = 1, · · · , n be defined as !

log Xn−i+1,n − log Xn−k,n

i=1

k +k − 2

1−

2 Hk,n (2)

!−1 .

(2.16)

Hk,n (m)

(m)

Definition (2.16) is equivalent to writing Sk = kMk,n , hence (k, Sk ) is the moment sum plot. However here a regression model has not been established to the best of our knowledge.

2.2.4

Simulation results

The different sum plots have been applied to some simulated data sets. Six distributions are considered: • The strict Pareto distribution given by F (x) = 1 − x−1/γ , x > 1, γ > 0. We have chosen γ = 1. Here b(x) = 0. • The standard Fréchet distribution given by F (x) = exp(−x−1/γ ), x > 0, γ > 0. We have chosen γ = 1. Here ρ = −1.  λ • The Burr distribution F (x) = 1 − η+xη −τ , x > 0, η, τ , λ > 0. We have chosen η = 1, τ = 0.5, λ = 2, such that γ = 1. Here ρ = −1/λ. R x a−1 1 • The gamma distribution F (x) = ba Γ(a) x exp(−x/b)dx, x > 0, a, 0 b > 0. Here we have chosen a = 2, b = 1. Always, γ = 0. • The uniform distribution F (x) = x, (0 < x < 1). Here γ = −1.  λ • The reversed Burr distribution F (x) = 1 − β+(x+β−x)−τ , x > 0, η, τ , λ > 0, x+ denotes the right endpoint of the distribution. We have chosen η = 1, τ = 0.5, λ = 2, x+ = 2, such that γ = −1. The Hill sum plots, the generalized Hill sum plots and the moment sum plots of these distributions are shown in Figure 2.1. For γ > 0, the three sum plots are comparable for the linear parts of the plots. For γ = 0 and γ < 0, the generalized Hill sum plot and the moment sum plot

SUM PLOTS

19

1200

(b)

0

0

200

100

400

200

600

300

800

400

1000

500

(a)

0

100

200

300

400

500

0

100

200

400

500

(d)

0

−100

200

−50

400

0

600

50

800

100

1000

(c)

300

0

100

200

300

400

500

0

100

200

400

500

300

400

500

(f)

−600

−400

−500

−300

−400

−200

−300

−100

−200

0

−100

0

100

(e)

300

0

100

200

300

400

500

0

100

200

Figure 2.1: The Hill sum plot (full line); generalized Hill sum plot (dashed line) and the moment sum plot (dotted line) are plotted for simulated data sets of size n=500 from the (a) strict Pareto; (b) Fréchet; (c) Burr; (d) gamma; (e) uniform; (f) reversed Burr distribution. A line with the correct slope is added.

are comparable on these particular data sets. However the generalized Hill sum plot seems to be less volatile. The sum plots can be used to identify the sign of γ for a given data set: increase in k indicates γ > 0, a horizontal pattern

20

GENERALIZED SUM PLOTS

indicates γ = 0 and decrease in k indicates γ < 0. In future work this could be used to test the domain of attraction condition. For an overview of this problem we can refer to Neves and Fraga Alves (2008). Moreover, the linear part of these generalized sum plots can be used to estimate the value of extreme value index.

2.3

Regression estimators of the extreme value index

As indicated before, we propose regression estimators for the extreme value index γ based on the slope of sum plots.

2.3.1

Hill sum plot estimators

Huisman et al. (2001) introduced a new estimator for γ > 0 based on the Hill sum plot which can be understood from (6). It indeed follows from (6) that for some constant D  Hk,n − γ + Dk −ρ + k = oP (bn,k ) , (2.17) where k = γ/k

Pk

j=1 (Ej

− 1), leading to the regression model

Hk,n = γ + Dk −ρ + k , k = 1, . . . , K;

(1 ≤ K ≤ n).

(2.18)

Since the variance of the error term Var(k ) = γ 2 /k is not constant, a weighted least √ regression is applied with a KxK diagonal weight matrix √ squares W = diag 1, . . . , K. Note that in this way, Huisman et al. (2001), did not take into account that the error terms are not independent. In practice, we put ρ = −1. Huisman et al. (2001) assumed that ρ = −1/γ which is the case for an extreme value distribution. Remark that when deleting the second order term Dk −ρ in the regression model, one obtains a simple average of K Hill estimators. Due to the volatile behavior of Hill estimators Hk,n as a function of k it is known that a robust average of Hill estimators provides better estimators. This will be discussed in more detail in case of the generalized Hill sum plot where we apply weighted trimmed least squares regression.

REGRESSION ESTIMATORS OF THE EXTREME VALUE INDEX

2.3.2

21

Generalized Hill sum plot estimators

In a similar way, a new estimator can be introduced for real valued γ based on the generalized Hill sum plot. Indeed from (2.14) ∗ Hk,n = γ + Dk −ρ + ˜k , k = 1, . . . , K,

(2.19)

Pk with ˜k = j=1 ej /k. The variance of ˜k is asymptotically equal to the ∗ asymptotical variance AVar(Hk,n ) which, according to Beirlant et al. (2005) is equal to ∗ AVar(Hk,n )

=

1 + γ2 ;γ > 0 k

=

(1 − γ)(1 + γ + 2γ 2 ) ; γ < 0. (1 − 2γ)k

Since the variance of the error term Var(k ) = Cγ /k is not constant, a weighted least square regression is applied with the same KxK weight matrix W as in case of the Hill sum plot. Here again we ignore the fact that the error terms are not independent. We also put ρ = −1. We also apply weighted trimmed least squares regression minimizing the sum of the bn/2c + 1 smallest squared residuals 1 . For more information we refer to Rousseeuw and Leroy (1987).

2.3.3

Simulation results

In Figures 2 till 5 we show the simulation results we obtained concerning weighted least squares regression estimators, trimmed and non trimmed, for some of the distributions considered in Section 2.2.4. For each distribution 100 repetitions of samples of size n = 500 were performed. 1 That

is bn/2c+1

min

X

γ ˆ

(r2 )i:n ,

i=1

where (r2 )1:n ≤ · · · ≤ (r2 )n:n are the ordered squared residuals. The notation bac stands for the largest integer less than or equal to a.

22

GENERALIZED SUM PLOTS

Weighted trimmed least squares yields less bias but somewhat higher mean squared error compared with the non robust regression algorithm. In case γ > 0 we also show the results for the weighted least squares estimators based on the Hill sum plot. Hill sum plots then yield better results than the generalized Hill sum plot. Also the trimmed regression algorithm is typically better than the non-robust version in case of the generalized Hill sum plot.

2.3.4

Some practical examples

We end this paper showing the proposed methods into action. We apply the methods to two data sets proposed earlier in Beirlant et al. (2004). The data sets themselves can be found on http://lstat.kuleuven.be/Wiley. The first data set contains daily maximal wind speeds at Brussels airport (Zaventem) from 1985 till 1992. In Example 1.1 in Beirlant et al. (2004) the authors come to the conclusion that the data follow a simple exponential tail beyond 80 km/hr, and hence γ equals 0. The weighted (trimmed) least squares estimates based on the generalized Hill sum plot indeed indicates a zero valued extreme value index. Also the moment and generalized Hill sum plots indeed exhibit an overall horizontal behavior for K = 1, . . . , 80. The generalized Hill sum plot is less volatile however. Finally we consider the AoN Re Belgium fire portfolio data introduced in section 1.3.3 in Beirlant et al. (2004). Here we omit the covariate information concerning sum insured and type of building. Here the estimate γˆ = 1 follows from the weighted trimmed least squares regression analysis. These estimates are indeed quite stable over K-values compared to the non-robust version. The sum plots in Figure 7(d) are quite comparable for K = 1, . . . , 80.

REGRESSION ESTIMATORS OF THE EXTREME VALUE INDEX

(b)

0.0

0.00

0.01

0.5

0.02

1.0

0.03

1.5

0.04

2.0

0.05

(a)

23

0

100

200

300

400

500

100

200

400

500

300

400

500

0.0

0.00

0.01

0.5

0.02

1.0

0.03

1.5

0.04

2.0

300

(d)

0.05

(c)

0

0

100

200

300

400

500

100

200

(f)

0.0

0.00

0.01

0.5

0.02

1.0

0.03

1.5

0.04

2.0

0.05

(e)

0

0

100

200

300

400

500

0

100

200

300

400

500

∗ Figure 2.2: Fréchet distribution: (a) means of Hk,n (full line); Hk,n (dashed line); Mk,n (dotted line) as a function of k. (b) MSE of the estimators in (a). (c) means of weighted least squares estimators based on the regression model of the Hill sum plot (full line); generalized Hill sum plot (dashed line). (d) MSE of the estimators in (c). (e) same as in (c), but now weighted trimmed least squares is used. (f) MSE of the estimators in (e).

24

GENERALIZED SUM PLOTS

(b)

0.0

0.00

0.02

0.5

0.04

1.0

0.06

1.5

0.08

2.0

0.10

(a)

0

100

200

300

400

500

100

200

400

500

300

400

500

300

400

500

0.0

0.00

0.02

0.5

0.04

1.0

0.06

1.5

0.08

2.0

300

(d)

0.10

(c)

0

0

100

200

300

400

500

100

200

(f)

0.0

0.00

0.02

0.5

0.04

1.0

0.06

1.5

0.08

2.0

0.10

(e)

0

0

100

200

300

400

500

0

100

200

∗ Figure 2.3: Burr distribution with ρ = −0.5: (a) means of Hk,n (full line); Hk,n (dashed line); Mk,n (dotted line) as a function of k. (b) MSE of the estimators in (a). (c) means of weighted least squares estimators based on the regression model of the Hill sum plot (full line); generalized Hill sum plot (dashed line). (d) MSE of the estimators in (c). (e) same as in (c), but now weighted trimmed least squares is used. (f) MSE of the estimators in (e).

REGRESSION ESTIMATORS OF THE EXTREME VALUE INDEX

(b)

0.00

−1.0

0.01

−0.5

0.02

0.0

0.03

0.5

0.04

1.0

0.05

(a)

25

0

100

200

300

400

500

100

200

400

500

0.00

−1.0

0.01

−0.5

0.02

0.0

0.03

0.5

0.04

1.0

300

(d)

0.05

(c)

0

0

100

200

300

400

500

100

200

400

500

0.00

−1.0

0.01

−0.5

0.02

0.0

0.03

0.5

0.04

1.0

300

(f)

0.05

(e)

0

0

100

200

300

400

500

0

100

200

300

400

500

∗ Figure 2.4: Gamma distribution: (a) means of Hk,n (dashed line); Mk,n (dotted line) as a function of k. (b) MSE of the estimators in (a). (c) means of weighted least squares estimators based on generalized Hill sum plot (dashed line). (d) MSE of the estimators in (c). (e) same as in (c), but now weighted trimmed least squares is used. (f) MSE of the estimators in (e).

26

GENERALIZED SUM PLOTS

(b)

0.0

−2.0

0.1

−1.5

0.2

−1.0

0.3

−0.5

0.4

0.5

0.0

(a)

0

100

200

300

400

500

0

100

200

400

500

300

400

500

(d)

0.0

−2.0

0.1

−1.5

0.2

−1.0

0.3

−0.5

0.4

0.5

0.0

(c)

300

0

100

200

300

400

500

0

100

200

(f)

0.0

−2.0

0.1

−1.5

0.2

−1.0

0.3

−0.5

0.4

0.5

0.0

(e)

0

100

200

300

400

500

0

100

200

300

400

500

∗ Figure 2.5: Reversed Burr: (a) means of Hk,n (dashed line); Mk,n (dotted line) as a function of k. (b) MSE of the estimators in (a). (c) weighted least squares estimators based on the generalized Hill sum plot (dashed line). (d) MSE of the estimators in (c). (e) same as in (c), but now weighted trimmed least squares is used. (f) MSE of the estimators in (e).

REGRESSION ESTIMATORS OF THE EXTREME VALUE INDEX

27

0.5 0.0 −0.5 −1.0

−1.0

−0.5

0.0

0.5

1.0

(b)

1.0

(a)

0

20

40

60

80

0

20

40

80

60

80

(d)

−30

−1.0

−20

−0.5

−10

0

0.0

10

0.5

20

30

1.0

(c)

60

0

20

40

60

80

0

20

40

∗ Figure 2.6: Zaventem daily maximum wind speed data: (a) Hk,n (full line); Hk,n (dashed line); Mk,n (dotted) as a function of k. (b) weighted least squares estimators based on the regression model of the Hill sum plot (full line); generalized Hill sum plot (dashed line). (c) same as in (b), but now weighted trimmed least squares is used. (d) sum plots

28

GENERALIZED SUM PLOTS

1.5 1.0 0.5 0.0

0.0

0.5

1.0

1.5

2.0

(b)

2.0

(a)

0

50

100

150

200

250

300

0

50

100

200

250

300

200

250

300

(d)

0

0.0

100

0.5

200

1.0

300

1.5

400

500

2.0

(c)

150

0

50

100

150

200

250

300

0

50

100

150

∗ Figure 2.7: AoN claim size data: (a) Hk,n (full line); Hk,n (dashed line); Mk,n (dotted line) as a function of k. (b) weighted least squares estimators based on the regression model of the Hill sum plot (full line); generalized Hill sum plot (dashed line). (c) same as in (b), but now weighted trimmed least squares is used. (d) sum plots

Chapter 3

Impact of dependence on flood frequency analysis based on regression in quantile plots Co-author: Patrick Willems. Water Resources Research, 47, W12537, doi:10.1029/2010WR10160, December 2011

Abstract Quantile-quantile plots (qq-plots) are often applied to investigate tail behavior in extreme value analysis. When used for flood frequency analysis, based on river flow data, dependence in these data affects the extreme value distribution’s tail behavior in the qq-plot. This problem is investigated based on both synthetic and observed river flow series. The synthetic series are generated based on a stochastic simulation procedure, which allows the real statistical flow properties to be known. Through simulation study, we demonstrate that serial dependence causes bending behavior in the qq-plot in the region of higher quantiles. This leads to the estimating error when the slope of the qq-plot is used as an estimator in the extreme value analysis. A correction method is proposed by modifying the plotting position using the extremal index and/or estimates of the extremal index.

29

30

3.1

IMPACT OF DEPENDENCE ON FLOOD FREQUENCY ANALYSIS

Introduction

Extreme value theory (EVT) is the branch of probability and statistics dedicated to characterizing the behavior of the extreme observations. Given a (independent identically distributed (iid)) sample X1 , · · · , Xn from a probability distribution F , let Mn be the sample maximum i.e. Mn = max(X1 , · · · , Xn ). If there exist normalizing constants an and bn > 0 such that   Mn − bn lim P ≤ x → G(x) n→∞ an then it is known that G must be one of the extremal distributions (Fisher and Tippet (1928), Gnedenko (1943)). The extremal distributions are usually combined to a parametric family: The generalized extreme value (GEV) family of distributions (The Jenkinson-Von Mises representation of extreme value distributions). The GEV has cumulative distribution function (  − γ1 ) x−µ Gγ (x) = exp − 1 + γ σ for all x such that 1 + γ x−µ σ > 0 where µ is the location parameter, σ > 0 is the scale parameter and γ is the shape parameter. The parameter γ controls the tail behavior; if γ is positive, then the tail is heavy. If γ = 0 (taken as the limit), then the tail is Exponentially decaying with G0 (x) = exp(− exp(x)). If γ is negative, then the tail is bounded or light. In practice, GEV is fitted to block maxima in particular annual maxima. However, using only annual maxima is considered as a wastage of useful information. Instead, a generalized Pareto distribution (GPD) is fitted to excesses over a high threshold. This has theoretical justification: For large threshold u, the distribution function of X − u, conditional on X > u, is approximately  γy −1/γ H(y) = 1 − 1 + σ defined on {y : y > 0 and (1 + γy ) > 0}. This is a GPD family. For γ = 0, σ H(y) is taken as a limit γ → 0, leading to   −y H(y) = 1 − exp , y > 0, σ corresponding to an Exponential distribution with rate parameter 1/σ. In this paper, data analysis is based on the GPD. In what follows we assume stationary sequences, that is (X1 , X2 , . . . Xm ) and (X1+h , X2+h , . . . Xm+h ) have the same distribution for all m and h.

INTRODUCTION

31

Return Levels Suppose that the GPD with parameters σ and γ is suitable model for exceedances of a threshold u by a variable X. Then for x > u P (X > x|X > u) = 1 − H(x − u). If follows that P (X > x) = λu {1 − H(x − u)}, where λu = P (X > u). Hence the level xm that is exceeded on average once every m observations is given by solution of λu {1 − H(x − u)} =

1 , m

leading to  xm =

u + σγ [(mλu )γ − 1], u + σ ln(mλu ),

γ= 6 0 γ=0

(3.1)

for large m. Standard errors for xm can be derived by delta method (see Coles (2001) for example). If there are ny independent observations per year then, an N-year return level corresponds to N ny −observations return level (i.e. m = N ny ). Hence, the N-year return level is defined by  u + σγ [(N ny λu )γ − 1], γ 6= 0 xN = . u + σ ln(N ny λu ), γ=0

Extremal Index At this point let the limiting distribution of maxima of the iid sequence be G(x), then the associated limiting distribution of maxima of a dependent sequence is given by Gθ (x), where θ is called the extremal index of the dependent sequence (Leadbetter et al., 1983). The extremal index θ ∈ [0, 1] is one measure of the clustering of extreme values. If θ = 1 the exceedances of an increasing threshold tend to occur singly, if θ < 1 the exceedances tend to form clusters. Various characterization of θ exist leading to different estimators. Common estimators found in literature are block, runs and the recent one-intervals estimator. The intervals estimator is reported to outperform the runs estimator at lower thresholds (Ferro and Segers, 2003). Since the aim of this paper is to study the effect of dependence, which is high at lower quantiles, we opt

32

IMPACT OF DEPENDENCE ON FLOOD FREQUENCY ANALYSIS

for the intervals estimator in this paper. All estimators are usually threshold dependent. In this paper, where θ is estimated, the least threshold is taken to be the 90% quantile of a given data set. The choice is arbitrary, but we believe that this quantile is sufficiently extreme. This threshold gives us enough data to test our method. The development of an intervals estimator rests on the limiting distribution of the times between exceedances of a threshold u by theP process {Xn }n≥1 . Let n Ti be the observed inter-exceedance times and let N = i=1 I(Xi > u) be the observed number of exceedances, where I is the indicator function, then the basic intervals estimator θn (u) is given by θn (u) =

2n2 (N − 1) . PN −1 N 2 i=1 Ti2

The variance of this estimator can be computed using a bootstrap scheme as described, for instance, in Ferro and Segers (2003). In this paper all computations were run on R, version 2.4.1 (R Development Core Team, 2006)

QQ-plots In extreme value analysis the qq-plot is one of the popular tools used to identify the type of a distribution’s tail of a given data set of extremes (Beirlant et al., 1996a; Willems et al., 2007; Kratz and Resnick, 1996). It is also used to visually, assess goodness of fit and estimate location, shape, and scale parameters. Consider the iid sample X1 , X2 , . . . , Xn with a common distribution function F and the corresponding order statistics denoted by X1,n ≤ X2,n ≤ · · · ≤ Xn−j+1,n ≤ · · · ≤ Xn,n . Note here that Xn−j+1,n is the j-th largest order statistic. Assume F is continuous, the variables {Ui = F (Xi ), 1 ≤ i ≤ n} are iid uniformly distributed on (0, 1) and D

(F (Xn−j+1,n ))j=1,...,n = (Un−j+1,n )j=1,...,n . Also, E(F (Xn−j+1,n )) =

n−j+1 , n+1

in particular E(F (Xn,n )) = where E(Y ) denotes the expectation of Y .

j = 1, . . . , n

n , n+1

(3.2)

(3.3)

INTRODUCTION

33

Approximating F (Xn−j+1,n ) by its mean (n − j + 1)/(n + 1), the plot    n−j+1 , F (Xn−j+1,n ) , 1 ≤ j ≤ n n+1 should be approximately linear. Equivalently, the plot    n−j+1 F ←( ), Xn−j+1,n , 1 ≤ j ≤ n n+1 should be approximately linear. F ← denotes the inverse of F . Notice F ← ((n − j + 1)/(n + 1)) is a theoretical quantile and Xn−j+1,n is the corresponding quantile of the empirical distribution function and hence the name qq-plot. Equation (3.2) is one of the estimators of plotting position. The plotting position can be defined as a value pj which estimates F evaluated at xn−j+1,n , i.e. pj ≈ F (xn−j+1,n ). For iid sequences, Un−j+1,n is a Beta random variable with parameters n−j+1 and j, hence E(Un−j+1,n ) = n−j+1 n+1 = E(F (Xn−j+1,n )). Note that this result can be obtained through the distribution of Xn−j+1,n if it is known, which is the case for iid sequences. Other estimators of pj are found in literature (e.g. Gringorten (1963); Cunnane (1978)) for the iid case. For the independent but not identically distributed case, Jones (1997) proposes an estimator of pj via maximum-likelihood. Estimates in this case are slightly different compared to the iid case. For the dependent but identically distributed case one would also expect the plotting positions to be slightly different from that of the iid case. For example, Dales and Reed (1989) show that the distribution of the network maxima series from N iid GEV lies exactly ln N to the left of the population growth curve on a variate versus Gumbel reduced-variate plot. However because of inter-site (spatial) dependence in annual maxima, the netmax growth curve is found to lie a shorter distance to the left. Dale and Reed named this distance ln N e, terming N e the effective number of independent gauges. Reed et al. (1999) corrected the effect by shifting the netmax points to the right by ln N e instead of ln N . In this paper our focus is on the effect of serial dependence in the series on the qq-plot. A simulation study in this paper reveals that there is a bending behavior at higher quantiles in the qq-plots of extremes when the observations are dependent, a behavior which is not observed for iid sequences of the same marginal distribution. The aim of this paper is to find the connection between this behavior and the amount of dependence in the sequence. This is achieved by determining (approximately) the distribution of order statistics of the dependent sequence. This in turn helps to estimate pj and hence arrive at the proposed

34

IMPACT OF DEPENDENCE ON FLOOD FREQUENCY ANALYSIS

correction method (adjusting plotting positions in the qq-plot). To be able to assess the effectiveness of the proposed method, synthetic data based on the max-autoregressive model as well as a first-order Markov chain with an extreme value transition distribution is generated. For real data, an iid subsequence from a dependent sequence is filtered by means of a declustering procedure. The qq-plot of the resulting sequence is compared to that of the full series. The paper is organized as follows: Section 3.2 gives a theoretical description of the effect of dependence on the qq-plots and its correction. Section 3.3 provides a simulation study. Here the synthetic series whose marginal distribution and extremal index are known is generated. Section 3.4 provides case studies in which the proposed method is applied to real river flows. Section 3.5 provides discussion and conclusions.

3.2

Effect of dependence on the qq-plots

Let X1 , X2 , . . . , Xn be a sample of iid random variables from a distribution F . For 0 < p < 1 let Q(p) = inf{x : F (x) ≥ p}. Q(p) is called the quantile function. For a continuous function F , Q(p) = F ← (p). Given an empirical and independent sequence x1 , x2 , . . . , xn , in general, the empirical quantile function Qn (p) is given by Qn (p) = xk,n ,

for

k + c1 k − 1 + c1 x} < j

i=1

and noting that Bn :=

n X

I{Xi >x}

i=1

is a Binomial random variable with parameters n and F¯ (x), where I is indicator function. For a dependent sequence, {I{Xi >x} , i = 1, 2, · · · , n} are dependent and hence affect the distribution of Bn . If nF¯ (x) → τ , using the Poisson approximation to Binomial distribution, and with τ = τ (x) = − ln G(x) we have the following limiting distribution of upper order statistics lim P (Xn−j+1,n ≤ un ) =

n→∞

j−1 X r=0

r

G(x)

(− ln G(x)) , r!

where un = cn x + dn with cn > 0 and dn ∈ < being normalizing constants. These results can be found in a number of articles, see Embrechts et al. (1997), Cohen (1989) and Hsing (1988), for example. For a dependent sequence, if P (Xn,n ≤ un ) → Gθ (x) as n → ∞, then the limiting distribution of order statistics is given by r j−1 X − ln Gθ (x) θ lim P (Xn−j+1,n ≤ un ) = G (x) Πj (r) (3.8) n→∞ r! r=0 where Πj (r) = P

r X

! Kl ≤ j − 1

l=1

with 0 < θ ≤ 1 being the extremal index of a stationary sequence and K1 , K2 , · · · are independent identically distributed random variables from π, the limiting distribution of cluster sizes. When r = 0, Πj (r) is interpreted as Πj (0) = 1. It is clear that the presence of dependence in the sequence affects the limiting distribution of order statistics. r

(− ln Gθ (x)) The term Gθ (x) in equation (3.8) corresponds to the Poisson process r! in (0, 1] with intensity θτ . Since the Poisson random variable happens to be

38

IMPACT OF DEPENDENCE ON FLOOD FREQUENCY ANALYSIS

the limiting variable of the Binomial distribution we can consider this term as a limiting variable of the Binomial distribution with parameters bnθc and F¯ (x) provided that F¯ (x) → 0, where byc denotes the largest integer less than r (− ln Gθ (x)) or equal to y. That is, we can consider Gθ (x) as the limiting of r!   bnθc ¯ r lim F (x)F bnθc−r (x) n→∞ r for [nθ]F¯ (x) → θτ as n → ∞ and τ = − ln G. Therefore, we can approximate the distribution of order statistics of the stationary sequence by  j−1  X bnθc ¯ r Fj,n (x) = F (x)F bnθc−r (x)Πj (r). (3.9) r r=0 Hence we have nθ −

Pj−1

r=1 Πj (r) . (3.10) nθ + 1 To be able to use equation (3.10) we need to estimate Πj (r). This probability could be expressed in terms of π, the limiting probability of cluster sizes, but it not known in practice. In the following we determine the range in which Pis j−1 r=1 Πj (r) can be found.

E(F (Xn−j+1,n )) =

Since the random variables K1 , K2 , · · · are iid from π the limiting Pr distribution of cluster sizes, the distribution π has mean θ−1 . Hence E( l=1 Kl ) = rθ−1 . Applying the Markov inequality we have ! Pr r X E( l=1 Kl ) r 1 − Πj (r) = P Kl > j − 1 ≤ = . (3.11) j−1 θ(j − 1) l=1

In equation (3.11) the right hand side of the inequality can be bigger than 1. When r ≤ θ(j − 1) we have bθ(j − 1)c − 1 ≤ 2

bθ(j−1)c

X

Πj (r) ≤ θ(j − 1).

r=1

Moreover, bθ(j−1)c

X r=1

Πj (r) ≤

j−1 X

Πj (r) ≤ j − 1.

r=1

Pj−1 When no further information is known about π we approximate r=1 Πj (r) by Pθ(j−1) the upper limit of r=1 Πj (r), i.e θ(j − 1). Substituting this in (3.10) we have n−j+1 . (3.12) E(F (Xn−j+1,n )) ≈ n + 1/θ

EFFECT OF DEPENDENCE ON THE QQ-PLOTS

39

Intuitively one can understand this by assuming that there are, on average, nθ independent elements in n realizations. Then apply formula (3.7) based on an independent sequence using nθ and making approximations assuming n is large enough: P (Xn−k+1,n ≤ x) =

bnθc(bnθc − 1)! (k − 1)!(bnθc − k)! Z x F nθ−k (u)[1 − F (u)]k−1 dF (u) + o(1). ×

(3.13)

−∞

Note that in equation (3.13), k − 1 ≤ bnθc. This leads to E(F (Xn−k+1,n )) = n− k−1 θ n+ θ1

with k−1 ≤ n. Changing index such that j − 1 = θ equation (3.12).

k−1 θ

≤ n leads to

Therefore, the Pareto qq-plot respectively and the Exponential qq-plot with plotting positions adjusted becomes    n−j+1 − ln(1 − ), ln(Xn−j+1,n ) : j = 1, . . . , n n + 1/θ and

 − ln(1 −

n−j+1 ), Xn−j+1,n n + 1/θ



 : j = 1, . . . , n .

Suppose that a least square line is fitted to the k + 1 upper quantiles (the upper part of the plot) and forcing the line to pass through the most left point n−k (− ln(1 − n+1/θ ), ln(Xn−k,n )) then the line fitted to Pareto qq-plot has the form  ln(Xn−j+1,n ) = ln(Xn−k,n ) + γ ln

(k + 1) + (1/θ − 1) j + (1/θ − 1)

 .

    k+1 Note that ln (k+1)+(1/θ−1) ≤ ln . Therefore we can define the function j+(1/θ−1) j 0 < ρ ≤ 1 depending on j, k and θ by     (k + 1) + (1/θ − 1) k+1 ln = ρj,k,θ ln . j + (1/θ − 1) j In terms of ρ, the line has a form of  ln(Xn−j+1,n ) = ln(Xn−k,n ) + γρj,k,θ ln

k+1 j

 .

(3.14)

For θ = 1, ρ = 1. For fixed j and k, ρ increases (decreases) with increase (decrease) in θ. For fixed k and θ, ρ → 1 as j increases towards k. For fixed

40

IMPACT OF DEPENDENCE ON FLOOD FREQUENCY ANALYSIS

j and θ, ρ increases with an increase in k, however, for this case ρ has values near to 1. These are numerical properties obtained from   ln (k+1)+(1/θ−1) j+(1/θ−1)   ρ= ln k+1 j

rho

0.65

0.70

0.75

0.80

0.85

0.90

0.95

for variation combination of j, k and θ. Figure 3.1 shows variation of ρ with j for fixed k = 100 and θ = 0.2. It follows that if the plotting position is not adjusted when θ < 1, the fitted line given by equation (3.14) has small slope i.e. γρj,k,θ . Furthermore, since ρj,k,θ is not constant, the plot will have a varying slope, and the effect will be large at small j (hence large quantiles). This could be the reason for the bending effect in the qq-plot at higher quantiles for dependent sequences.

0

20

40

60

80

100

j

Figure 3.1: Variation of ρ with j for k = 100 and θ = 0.2. ρ is defined in equation in 3.14.

3.3

Simulation study

In this section, we investigate the effect of dependence on a qq-plot. We also investigate the effectiveness of the proposed method in correcting this effect. Since our method depends on the extremal index θ, we generate synthetic data whose θ is known in advance. Two stationary processes are used: a maxautoregressive process and a first-order Markov chain with an extreme value

SIMULATION STUDY

41

transition distribution. Let Wn for n ≥ 1 be an independent unit Fréchet random variable and define random variables {Xn }n≥1 as follows X1 = W1 /θ, Xn = max{(1 − θ)Xn−1 , Wn }

for

n≥2

and θ ∈ (0, 1].

Then {Xn }n≥1 is a max-autoregressive processes with extremal index θ and with unit Fréchet margins. For the Markov chain, let α ∈ (0, 1] and the distribution function F (y1 , y2 of (X1 , X2 ) be given by −1/α

P (X1 ≤ y1 , X2 ≤ y2 ) = exp{−(y1

−1/α α

+ y2

) }.

Then F (y1 , y2 ) is a bivariate extreme value distribution with symmetric logistic dependence structure and with unit Fréchet margins. Other margins can easily be obtained by the integral transform technique. For instance, if FY (y) is a distribution function of any random variable Y , then − ln(FY (Y )) has an Exponential distribution. The extremal index θ of Markov chain process {Xn }n≥1 for specific values of α may be found by simulation (Smith et al., 1997; Ferro and Segers, 2003). In this simulation study, the standard Exponential distribution and the unit Fréchet distribution each is used for both max-autoregressive and Markov chain processes as the marginal distribution. For each process with fixed θ (or α in the case of the Markov chain ), 1000 samples each of 5000 sample size are generated. From the generated sequences, 2 pairs of qq-plot are produced: One pair is based on the individual sample with and without dependence effect corrected. Another pair is based on the mean behavior of order statistics of 1000 samples. The mean behavior is computed as follows. For exponential (s) qq-plots, if xn−j+1,n corresponds to the order statistics from sample s then the Pz (s) mean behavior x∗n−j+1,n is calculated as x∗n−j+1,n = z1 s=1 xn−j+1,n where z is the number of samples (z = 1000 in our case). For the Pareto qq-plot the Pz (s) mean behavior is calculated as x∗n−j+1,n = z1 s=1 ln(xn−j+1,n ). For the max-autoregressive process, extremal indices θ ∈ (0.2, 0.5, 0.8) are used. For Markov chain extremal indices θ ∈ (0.25, 0.5, 0.75), corresponding to α ∈ (0.43, 0.64, .82) reported in Ferro and Segers (2003) are used. Because of limited space, only results for θ = 0.2 in case of the max-autoregressive process and θ = .25 (or α = 0.43) in case of the Markov chain are shown. Furthermore,

42

IMPACT OF DEPENDENCE ON FLOOD FREQUENCY ANALYSIS

only the max-autoregressive with unit Fréchet margins and Markov chain with standard Exponential margins are shown.

6 4

log(x)

0

2

4 0

2

log(x)

6

8

(b)

8

(a)

0

2

4

6

Quantiles of Standard Exponential

8

0

1

2

3

4

5

6

7

Quantiles of Standard Exponential

Figure 3.2: Pareto qq-plot of a sample generated from a max-autoregressive process with θ = 0.2 and a unit Fréchet margin (a) with dependence effect; (b) with effect corrected using θ = 0.2. The superimposed straight line is the theoretical regression line.

In Figure 3.2 we show a pair of Pareto qq-plots of an individual sample with and without dependence effect corrected. The sample is drawn from the generated 1000 samples of the max-autoregressive process with θ = 0.2 and unit Fréchet margins. To minimize sample variability, a pair of Pareto qq-plots based on mean behavior (defined above) is shown in Figure 3.3. Similar plots (Exponential qq-plot) are shown in Figures 3.4 and 3.5 respectively for the Markov chain with θ = 0.25 (α = 0.43) and standard Exponential margins. From these plots, downward bending behavior at higher quantiles is noticed for dependent sequences. For the iid sequence bending behavior is usually noticed at lower quantiles in the case of Fréchet margins (and other heavy tails except a strict Pareto distribution). This is usually explained to be caused by a slow variation effect as far as extreme value theory is concerned. Correction for this effect can be found in literature, e.g. Willems et al. (2007). At higher quantiles bending behavior could be caused by other factors such as a poor fit of the distribution and unmodelled covariates (e.g trend, seasonal effects). However for these qq-plots, the sequences are stationary and the margins are known. When the proposed method is applied the effect is almost removed. The bending effect is minimal as θ → 1, and is high when θ → 0.

SIMULATION STUDY

43

6 0

2

4

mean of log(rank of x)

6 4 0

2

mean of log(rank of x)

8

(b)

8

(a)

0

2

4

6

8

0

2

Quantiles of Standard Exponential

4

6

Quantiles of Standard Exponential

Figure 3.3: Pareto qq-plot of mean behavior of order statistics of 1000 samples generated from a max-autoregressive process with θ = 0.2 and a unit Fréchet margin (a) with dependence effect; (b) with effect corrected using θ = 0.2. The superimposed straight line is the theoretical regression line.

4 x 2 0

0

2

x

4

6

(b)

6

(a)

0

2

4

6

Quantiles of Standard Exponential

8

0

1

2

3

4

5

6

7

Quantiles of Standard Exponential

Figure 3.4: Exponential qq-plot of a sample generated from a Markov chain process with θ = 0.25 (α = 0.43) and Exponential margin (a) with dependence effect; (b) with effect corrected. The superimposed straight line is the theoretical regression line.

Usually a return level plot (m, xm ), derived from equation (3.1), is replaced by (m, x∗m ) where x∗m is given by  u + σγ [(mλu θ)γ − 1], γ 6= 0 x∗m = (3.15) u + σ ln(mλu θ), γ=0

44

IMPACT OF DEPENDENCE ON FLOOD FREQUENCY ANALYSIS

4

mean of rank of x

0

2

4 2 0

mean of rank of x

6

(b)

6

(a)

0

2

4

6

Quantiles of Standard Exponential

8

0

2

4

6

8

Quantiles of Standard Exponential

Figure 3.5: Exponential qq-plot of mean behavior of order statistics of 1000 samples generated from a Markov chain process with θ = 0.25 (α = 0.43) and standard Exponential margin (a) with dependence effect; (b) with effect corrected. The superimposed straight line is the theoretical regression line.

when the GPD is fitted to cluster maxima, θ being the extremal index of the original series. Here again λu = P (X > u) (Coles, 2001). To assess, visually, the best fit of return levels, normally all (ordered) exceedances over the threshold are superimposed to the return level plot. For the best fit, one would expect the superimposed points to be close to the point estimates of return levels and to be within the confidence interval limits. In order to superimpose exceedances on the return level plot, one needs to estimate the plotting positions. Given the empirical quantile xn+j+1 > u, and since H is not known in practice, the plotting positions pj = H(xn−j+1,n ), j = 1, 2, . . . , n need to be estimated so as to have plotting positions (m, ˆ xn−j+1,n ) where m ˆ = (1 − H(xn−j+1,n ))−1 . For an iid sequence one can estimate pj by (n − j + 1)/(n + 1). For dependent sequences the proposed method leads to an estimate (n−j +1)/(n+θ−1 ). A pair of the return level plots with uncorrected plotting positions and with corrected plotting positions is shown in Figure 3.6. From these plots, certainly, observed quantiles are within the confidence intervals of the estimated quantiles. However, for the uncorrected plotting positions, one can conclude that x∗m over estimates the return level. With the corrected plotting positions, the conclusion obvious

CASE STUDIES

45

would be different, as can be seen from Figure 3.6 (b), observed quantiles are close to the estimated return level. (a)

(b)

8000 6000

Return level

0

2000

4000

6000 4000 0

2000

Return level

8000

10000

Return Level Plot

10000

Return Level Plot

0.1

0.5

1.0

5.0 10.0

Return period (years)

50.0

0.1

0.5

1.0

5.0 10.0

50.0

Return period (years)

Figure 3.6: Return level plot with plotting position (a) with dependence effect; (b) with effect adjusted. Sequence was generated from max-autoregressive process with θ = .2 and unit Fréchet margins. Upper and lower curves are 95% confidence interval.

3.4

Case studies

In this section we apply the proposed method to real data. To be able to assess the effectiveness of the method, a subsequence with independent observations is filtered from the series by means of a declustering techniques. The qq-plots of the independent and corresponding dependent series are shown. The return level plot for these series is also shown. Furthermore, the estimates of the slope of resulting qq-plots are obtained. All these plots and estimates are obtained when the proposed method is applied. Real river flow data considered for this study involves hourly discharge observations along the Molenbeek river in Belgium (Erpe-Mere station; catchment area of 40 km2 , years 1986-1996); and daily flow observations of Nadderv and Witham rivers in the UK (Clay Mill and Wilton stations; catchment areas of 221 km2 and 298 km2 ; years 1966-2003,1959-2002). These are the same data as considered in Willems et al. (2007).

46

IMPACT OF DEPENDENCE ON FLOOD FREQUENCY ANALYSIS

Declustering and parameter estimation Various procedures exist in which approximately independent exceedances are filtered. The most widely-adopted method is declustering. This works by using an empirical rule (e.g. blocks, runs, and intervals estimators) to define independent clusters of exceedances. The statistical model is then fitted to the cluster maxima, some times called peak over threshold (POT). The method is simple but it has its limitations. The results can be sensitive to the arbitrary choices made in the cluster identification. Alternative methods exist where by all exceedances are modeled. The list includes multivariate techniques (assuming dependence structure) as well as fitting GPD to all exceedances and adjusting estimates of uncertainty (Smith et al., 1997; Fawcett and Walshaw, 2007). Since our aim is to study the effect of dependence on qq-plots, we look for a procedure of extracting independent sequences from time series, using a declustering technique. Furthermore, since our method relies on θ (hence to be estimated), we employ an autodeclustering technique proposed by Ferro and Segers (2003) which is based on the intervals estimator. Choosing a threshold equal to 0.99 (> 90% quantile) on the Molenbeek time series produces 8987 exceedances out of 96029 realizations. Applying an autodeclustering procedure using the estimated θ = 0.0145(0.010, 0.222) at this threshold, results in 130 clusters. Based on all exceedances and cluster maxima, qq-plots and return level plots (based on fitted GPD) with and without dependence effect corrected are produced. Since cluster maxima are considered independent, the slope fitted to the respective qq-plot represents the slope of the associated independent sequence. To assess the effect of dependence on the qq-plot, a line with this slope is superimposed onto the qq-plots of other cases. A pair of the Exponential qq-plots of all exceedances with and without dependence effect corrected is shown in Figure 3.7. The superimposed line on these plots has a slope equal the slope of respective cluster maxima. As can be seen, the points (Figure 3.7 (a) ) on higher quantiles deviate from the superimposed line but close (Figure 3.7 (b) ) to the line when the dependence effect is corrected. The effect of the dependence on the return level plot is shown in Figure 3.8 (a) and with dependence effect corrected Figure 3.8 (b). When plotting position is not adjusted, there is a tendency of observed higher quantiles (Figure 3.8 (a)) to deviate from point estimate while close when the effect is corrected. Note here that the effect of dependence on the return level estimates is corrected via equation (3.15) while the effect of dependence on the superimposed observed return levels is corrected via plotting position adjustment using the proposed method.

CASE STUDIES

47

10 8 6 0

2

4

Ordered Exceedances

8 6 4 0

2

Ordered Exceedances

10

12

(b)

12

(a)

0

2

4

6

8

10

0

Quantiles of Standard Exponential

2

4

6

8

Quantiles of Standard Exponential

Figure 3.7: Exponential qq-plot of Molenbeek river discharge exceedances (a) with dependence effect; (b) with effect adjusted. The superimposed line has the slope 2.379. Unit of x is m3 /s

(a)

(b)

80 60

Return level

0

20

40

60 40 0

20

Return level

80

100

Return Level Plot

100

Return Level Plot

1e−01

1e+00

1e+01

1e+02

Return period (years)

1e+03

1e−01

1e+00

1e+01

1e+02

1e+03

Return period (years)

Figure 3.8: Return level plot of Molenbeek river discharge with GPD fitted to cluster maxima (a) with dependence effect; (b) with effect adjusted. Full line is the point estimate and upper and lower dotted lines are 95% confidence interval limits. Unit of x is m3 /s

48

IMPACT OF DEPENDENCE ON FLOOD FREQUENCY ANALYSIS

The mean excess plot (not shown) indicates linear increase behavior between thresholds 0.99 and 4; horizontal behavior between 4 and 9.4 and linear decrease behavior above 9.4. However, only 11 observations are above 9.4, too few to make meaningful inference. So the distribution seems to be in between heavy and Exponential decaying tail. The maximum-likelihood estimates (with standard error given in brackets) of parameters (γ, σ) of the GPD fitted to all exceedances and cluster maxima are {0.144 (0.129), 0.859 (0.014); 0.104 (0.120), 1.88 (0.280)}. Small values of the shape (γ) parameter estimates with the confidence interval containing a zero suggests Exponential decaying tail. A similar conclusion on tail type is found in Willems et al. (2007) where a similar data is analyzed. The least square estimates of the slope of the Exponential and the Pareto qq-plot for all exceedances, cluster maxima and corrected dependence effect are {1.403(0.006), 2.379 (0.018), 1.670 (0.011); 0.201 (0.0018), 0.282(0.0196), 0.243(0.0007) } respectively with standard error given in brackets. The least square estimates are fitted to 301 higher quantiles of the qq-plot. The choice is based on mean excess plot as well as visual inspection on qq-plot so that the majority of higher quantiles are close to the fitted line. For cluster maxima, the line is fitted to all (130) quantiles. A slope of the fitted line on Exponential qq-plot estimates a scale parameter σ for γ = 0 and estimates shape parameter γ for γ 6= 0 in Pareto quantile plot. With reference to the fitted line (with slope equal to 2.379) for cluster maxima, there is an improvement in slope from 1.403 to 1.670 on the Exponential qq-plot of all exceedances when the dependence effect is corrected . This improvement could be compared to the bias corrected asymptotic slope (1.825) on the Exponential qq-plot of the same data set reported in the paper by Willems et al. (2007). In that paper the independent sequence was obtained using physical criteria whereby two consecutive peaks were considered independent if some defined physical conditions were fulfilled. We refer to that paper for more details for reader interested in the technique. A similar conclusion is reached for Pareto qq-plot. There is an improvement in slope from 0.201 to 0.243 in comparison to 0.282 (a slope based on cluster maxima) when the proposed method is applied to the Pareto qq-plot. For this case, the line was fitted to the 301 highest quantiles except for cluster maxima in which a line was fitted to the 31 highest quantiles. The choice of these quantiles was based on the visual inspection of the linear trend of the qq-plot. The optimal choice on the number of higher quantiles to be used in estimation is a research question. The work on this topic can be found in literature. A summary on some work on the topic can be found in Beirlant et al. (2004) section 4.7.

CASE STUDIES

49

(a)

(b)

50 40 30

Return level

0

10

20

30 0

10

20

Return level

40

50

60

Return Level Plot

60

Return Level Plot

1e−01

1e+00

1e+01

1e+02

Return period (years)

1e+03

1e−01

1e+00

1e+01

1e+02

1e+03

Return period (years)

Figure 3.9: Return level plot of GPD fitted to all exceedances of threshold 0.99 of Molenbeek river flow series (a) with dependence effect, (b) with effect adjusted. Unit of x is m3 /s.

From this analysis, it appears that serial dependence affects the qq-plots. All real river flow series we have analyzed in this paper show bending behavior at higher quantiles except the Nadderv river flow series. However, the Nadderv data set has 3 values (27.73, 24.56, 25.82 ) which are quite large relative to the others (the fourth highest value is 20.60). The second and third highest values are in the same year with time lag of 52 days. The first and third highest have time lag of 20 years. When these values are excluded in the analysis, the bending behavior is observed. The bending behavior leads to under estimation of the fitted slope in comparison to the slope fitted to respective cluster maxima. A fair improvement in estimation is noticed when adjustment of plotting position is done using the proposed method to account serial dependence in the series. The serial dependence also appears to affect the quality of return level plot as far as the empirical quantile is concerned. The superimposed empirical data tends to deviate downward from the point estimate of return levels, and sometimes deviates away from the confidence interval limits as can be noticed in Figure 3.9 (a). Again, fair improvement is noticed when plotting positions are adjusted. This is illustrated in Figure 3.9 (b) where the effect in Figure 3.9 (a) is adjusted. In these plots, the GPD is fitted to all exceedances above threshold 0.99 of the Molenbeek river flow series. The effect decreases with an increase in threshold (decrease in serial dependence). Similar results were obtained to other data sets.

3.5

Discussion and Conclusion

We have demonstrated that the bending (downwards) behavior at extreme values in qq-plots could be due to the effect of dependence in the series. This may mislead the interpretation of qq-plots when used as tool for discriminating between heavy and Exponential decaying tails. Moreover, the bending behavior in qq-plots at extreme values may mislead the actual gradient when used as an estimate of the parameter of interest (eg. scale, shape, location, return levels etc.). In fact the bending behavior leads to the under estimation of the real gradient. When the extremal index is known (or estimated), the developed method can be used to minimize (or remove) the effect of dependence (bending behavior) in the qq-plots. For real data, this effect could be caused by other factors, such as unmodeled covariates which need to be investigated. The method has been tested on synthetic series: autoregressive processes and first order Markov chain processes with known extremal index and marginal distributions. From these, the bending effect appears to increase with an increase in serial dependence. The method was also applied to real data: Molenbeek, Nadderv and Witham river flows. It was found that when the effect of dependence is corrected, the qq-plots based on POT values and all exceedances (mostly dependent) values have similar results.

50

General Conclusions This thesis has focused on two statistical tools: sum plots and quantile-quantile plots (qq-plots). In chapter 2 we extended the sum plot developed for heavy tails estimator, in particular Hill estimator, to all tail types i.e. heavy, normal and light tails. The sum plot shows linear trend at higher quantiles. Regression estimator for the extreme value index γ can be obtained based on the slope of the linear part of the sum plot. We have illustrated this using Hill estimator, generalized Hill estimator and moment estimator. The sum plot can be used to test sign of γ for a given data set: increase in k indicates γ > 0, a horizontal pattern indicates γ = 0 and decrease in k indicates γ < 0. In future work this could be used to test the domain of attraction condition. For an overview of this problem we can refer to Neves and Fraga Alves (2008). We also studied the effect of dependence in the qq-plots. Simulation study here has shown that the effect of dependence in sequence causes bending behavior at higher quantiles. We have shown that this effect can be traced in the plotting position estimation. From approximation of the distribution of order statistics of dependent sequences, a new plotting position was proposed which accounts for the dependence in sequence through extremal index θ. The new plotting position has shown to reduce the effect of dependence in qq-plot. This also reduces the bias of extreme value index γ estimates based on regression estimators on the qq-plot.

51

Bibliography Beirlant, J., Dierckx, G., Goegebeur, Y., and Matthys, G. (1999). “Tail Index Estimation and an Exponential Regression Model.” Extremes, vol. 2(2), 177–200. Beirlant, J., Dierckx, G., and Guillou, A. (2005). “Estimation of the extremevalue index and generalized quantile plots.” Bernoulli, vol. 5(6), 949–970. Beirlant, J., Dierckx, G., Guillou, A., and Stărică, C. (2001). “On Exponential Representations of Log-Spacings of Extreme Order Statistics.” Extremes, vol. 5, 157–180. Beirlant, J., Segers, J., Goegebeur, Y., and Teugels, J. (2004). Statistics of Extremes. John Wiley & Sons, Ltd. Beirlant, J., Teugels, J., and Vynckier, P. (1996a). Practical Analysis of Extreme Values. Leuven University Press. Beirlant, J., Vynckier, P., and Teugels, J. (1996b). “Excess functions and estimation of the extreme value index.” Bernoulli, vol. 2, 293–318. Cohen, J. (1989). “On the compound poisson limit for higher level exceedances.” Journal of Applied Probability, vol. 26, 458–465. Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag London Limited. Cook, N. (2011). “Comments on ”Plotting Positions in Extreme Value Analysis”.” Journal of Applied Meteorology and Climatology, vol. 50, 255–266. Cunnane, C. (1978). “Unbiased plotting positions-a review.” Journal of Hydrology, vol. 37(3/4), 205–222. Dales, M. Y. and Reed, R. W. (1989). “Regional flood and storm hazard assessment.” Report No. 102, Institute of Hydrology, Wallingford, UK , 159pp.

53

de Haan, L. (2007). “Comments on ”Plotting Positions in Extreme Value Analysis”.” Journal of Applied Meteorology and Climatology, vol. 46, 396. de Sousa, B. and Michailidis, G. (2004). “A Diagnostic Plot for Estimating the Tail Index of a Distribution.” Journal of Computational and Graphical Statistics, vol. 13(4), 974–1001. Dekkers, A., Einmahl, J. H. J., and De Haan, L. (1989). “A Moment Estimator for the Index of an Extreme-Value Distribution.” The Annals of Statistics, vol. 17(4), 1833–1855. Dierckx, G. (2000). Estimation of Extreme Value Index. Ph.D. thesis, Katholieke Universiteit Leuven. Drees, H. (2002). “Tail Empirical Processes Under Mixing Conditions.” In H. Dehling, T. Mikosch, and M. Sorenses (editors), “EMPIRICAL PROCESS TECHNIQUES FOR DEPENDENT DATA,” Birkhauser, Boston, 325–342. Embrechts, P., Klüppelberg, C., and Mikosch, T. (1997). Modelling Extremal Events. Springer-Verlag Berlin Heidelberg. Fawcett, L. and Walshaw, D. (2007). “Improved estimation for temporally clustered extremes.” Environmetrics, vol. 18, 173–188. Ferro, C. A. and Segers, J. (2003). “Inference for clusters of extreme values.” Journal of Rolay of Statistics Society B, vol. 65(Part 2), 545–556. Fisher, R. A. and Tippet, L. (1928). “Limiting forms of the frequency distribution of the largest or smallest member of a sample.” In “Proceedings of the Cambrige Philosophical Society,” vol. 24. 180–190. Fraga Alves, M., Gomes, M. I., and de Haan, L. (2003). “A new class of semiparametric estimator of the second order parameters.” Portugalie Mathematica, vol. 60, 193–213. Gnedenko, B. (1943). “Sur la distribution limite du terme maximum d’une serie aléatoire.” Annals of Mathematics, vol. 44, 423–453. Gringorten, I. (1963). “A plotting rule for extreme probability paper.” J. Geophys. Res., vol. 68, 813–814. Hall, P. (1982). “On some simple estimate of an exponential of regular variation.” Journal of the Royal Statistical Society, vol. B 44, 37–42. Hazen, A. (1914). “Storage to be provided in impounding reserviors for municipal water supply.” Trans. Amer. Soc. Civ. Eng., vol. 77(1308), 1547–1550.

54

Henry III, J. B. (2009). “A Harmonic Moment Tail Index Estimator.” Journal of Statistical Theory and Applications, vol. 8(2), 141–162. Hill, B. (1975). “A simple general approach to inference about the tail of a distribution.” Ann. Statist., vol. 3, 1163–1174. Hill, W. G. (1976). “Order Statistics of Correlated variables and implications in genetic selection programmes.” Biometrics, vol. 32, 889–902. Hill, W. G. (1977). “Order Statistics of Correlated variables and implications in genetic selection programme. II. response to selection.” Biometrics, vol. 33, 703–712. Hsing, T. (1988). “On the extreme of order statistics for a stationary sequence.” Stoch. Proc. Appl., vol. 29, 155–169. Hsing, T. (1991). “On tail index estimation using dependent data.” The Annals of Statistics, vol. 19(3), 1547–1567. Hsing, T. (1993). “Extremal index estimation for weakly dependent stationary sequence.” The Annals of Statistics, vol. 21(4), 2043–2071. Huisman, R., Koedijk, K. G., Kool, C. J. M., and Palm, F. (2001). “Tail-Index Estimates in Small Samples.” Journal of Business & Economic Statistics, vol. 19(2), 208–216. Jenkinson, A. F. (1955). “The frequency distribution of the annual maximamum (or minimum) values of meteorological elements.” Quart. J. Roy. Meteo. Soc., vol. 81, 158–171. Jones, D. A. (1997). “Plotting positions via maximum-likelihood for a nonstandard situation.” Hydrology and Earth Sythem Sciences, vol. 1(3), 357–366. Katz, W., Richard, Parlange, B., Marc, and Naveau, P. (2002). “Statistics of extremes in Hydrology.” Advances in Water Resources, vol. 25, 1287–1304. Kenitzer, S. (2006). “Natural hazards are more common than statistics indicate.” American Meteorological Society, News release. URL http://www.ametsoc.org/amsnews/newsreleases.html Kratz, M. and Resnick, S. (1996). “The qq-estimator and heavy tails.” Communications in Statistics: Stochastic Models, vol. 12(3), 699–724. Leadbetter, M. (1983). “Extremes and Local Dependence in Stationary Sequences.” Wahrscheinlichkeitstheorie verw. Gebiete, vol. 65, 291–306. Leadbetter, M., Lindgren, G., and Rootzén, H. (1983). Extremes and Related Properties of Random Sequences and Series. Springer, New York Inc. 55

Markkonen, L. (2006). “Notes and Ccorrespondence: Plotting Positions in Extreme Value Analysis.” Journal of Applied Meteorology and Climatology, vol. 45, 334–340. Markkonen, L. (2008). “Extreme Value Analysis and Order Statistics: Bringing Closure to the Plotting Position Controversy.” Communications in Statistics Theory and Methods, vol. 37, 460–467. Neves, C. and Fraga Alves, M. (2004). “Reiss and Thomas’ automatic selection of the number of extremes.” Computational Statistics and Data Analysis, vol. 47, 689–704. Neves, C. and Fraga Alves, M. (2008). “Testing extreme value conditions: an overview and recent approaches.” REVSTAT-Statistical Journal, vol. 6, 83–100. Novak, S. Y. (2002). “Inference on heavy tails from dependent data.” Siberian Adv. Math, vol. 12(2), 73–96. Owen, D. B. and Steck, G. P. (1962). “Moments of order statistics from the equicorrelated multivariate normal distribution.” Annals of Mathematical Statistics, vol. 33, 1286–91. Pickands, J. (1975). “Statistical inference using extreme order statistics.” The Annals of Statistics, vol. 3(1), 119–131. R Development Core Team (2006). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. URL http://www.R-project.org Rawlings, J. O. (1976). “Order Statistics for a special class of unequally correlated multinormal variates.” Biometrics, vol. 32, 875–887. Reed, R. W., Faulkner, D. S., and Stewart, E. J. (1999). “The foregex method of rainfall growth estimation ii: Description.” Hydrology and Earth Sythem Sciences, vol. 3(2), 197–203. Rosbjerg, D. (1985). “Estimation in partial duration series with independent and dependent peak values.” Journal of Hydrology, vol. 76, 183–195. Rousseeuw, P. J. and Leroy, A. (1987). Robust Regression and Outlier Detection. Wiley. Smith, R., Tawn, J. A., and Coles, S. G. (1997). “Markov chain models for threshold exceedances.” Biometrika., vol. 84(2), 249–268. Printed in Great Britain. 56

Von Mises, R. (1936). “The frequency distribution of the annual maximamum (or minimum) values of meteorological elements.” Reprinted in Selected Papers Volumen II, American Mathematical Society Providence, R.I, 1954 , 271–294. Weibull, W. (1939). “A statistical theory of strength of materials.” Ing. Vetensk. Akad. Handl., vol. 151, 1–45. Willems, P., Guillou, A., and Beirlant, J. (2007). “Bias correction in hydrologic GPD based extreme value analysis by means of a slowly varying function.” Journal of Hydrology, vol. 338, 221–236.

57

Publications 1. Beirlant J., Boniphace E., and Dierckx G. "Generalized Sum plots". REVSTAT-Statistical Journal, Volume 9, Number 2, June 2011, 181198. 2. Boniphace E. and Willems P. "Impact of dependence in river flow data on flood frequency analysis based on regression in quantile plots: analysis and solutions". Water Resources Research, 47, W12537, doi:10.1029/2010WR10160, December 2011.

59

Arenberg Doctoral School of Science, Engineering & Technology Faculty of Science Department of Mathematics Research Group: Statistics Celestijnenlaan 200B box 2400 B-3001 Heverlee

Suggest Documents