Bootstrap confidence intervals for large-scale multivariate monotonic regression problems

Bootstrap confidence intervals for large-scale multivariate monotonic regression problems Oleg Sysoev, Anders Grimvall, Oleg Burdakov February 22, 201...

Author: Laura Lucas

1 downloads 0 Views 8MB Size

Report

Download PDF

Recommend Documents

Bootstrap Confidence Intervals

Statistical inference using bootstrap confidence intervals Michael Wood Bootstrap confidence intervals

Dependent Bootstrap Confidence Intervals for a Population Mean

Bootstrap confidence intervals in nonparametric regression with built-in bias correction

Confidence Intervals for Ranks

Confidence Intervals

Multivariate Dyadic Regression Trees for Sparse Learning Problems

3. Hypothesis tests and confidence intervals in multiple regression

Chapter 9: Confidence Intervals. Statistical Estimation Point Estimation Interval Estimation. Confidence Intervals One-sided Confidence Intervals

Statistical Inference. Confidence Intervals

Notes 7: Confidence Intervals

Better Binomial Confidence Intervals

Module 4 Confidence Intervals

Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy

Confidence Intervals for Michaelis-Menten Parameters

Chapter 19 Confidence Intervals for Proportions

CONFIDENCE INTERVALS FOR TOURISM DEMAND ELASTICITY

Chapter 19 Confidence Intervals for Proportions

Bayesian Multivariate Logistic Regression

Confidence intervals for the kappa statistic

CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

ON TEACHING BOOTSTRAP CONFIDENCE INTERVALS. Joachim Engel Ludwigsburg University of Education, Germany

From sampling distributions to confidence intervals. Sociology 360 Statistics for Sociologists I Chapter 14 Confidence Intervals

Bootstrap confidence intervals for large-scale multivariate monotonic regression problems Oleg Sysoev, Anders Grimvall, Oleg Burdakov February 22, 2010 Abstract Monotonic regression (MR) results in functions that are monotonic with respect to a set of input variables or predictors. Recently, the methods used to estimate such regression models were substantially improved, and several algorithms can now produce monotonic fits to large, multivariate datasets. Here, we show how the bootstrap can be used to estimate the uncertainty of such fits. In particular, we show that a segmentation of the MR problem can greatly reduce the cost of computing confidence intervals for the expected response at arbitrary levels of the predictors. Also, we show that the processing of binary response variables requires an algorithm that utilizes results obtained for different topological orders of the given data. The performance of our algorithms is illustrated using data on death in coronary heart disease (CHD) for a large population that several decades ago took part in a health screening involving blood analyses. This dataset also illustrates how MR of a binary response can be a valuable complement to logistic regression.

Keywords: Confidence intervals, monotonic regression, bootstrap, percentile, BCa, pool-adjacent-violators algorithm, large-scale problems.

1

Introduction

Monotonic regression (MR) is widely used in different areas, such as economics, image and signal processing, physics, medicine (see, for instance [1, 2, 20]). Sometimes, it is reasonable to assume that response variable y ∈ R1 is a monotonic function of explanatory variables (inputs, predictors) x ∈ Rp , i.e. increasing or decreasing with respect to x. For instance, one may treat the risk of lung cancer as an increasing function of exposure and age. However, the obtained empirical data is often not consistent with the monotonicity. This happens due to different reasons, such as measurement errors or missing important predictors. The task of MR is to obtain a monotonic fit from non-monotonic data set D. Besides, this fit should be optimal, i.e. as close as possible in certain sense to the observed response values in D. 1

The general MR framework was developed in [3, 17]. During the past decades, exact algorithms, i.e. algorithms able to obtain optimal fit were suggested, see, for instance, [4, 5, 14, 16]. These algorithms vary with respect to the computational complexity. However, the best of the exact algorithms is able to fit only medium-scale problems. Recently, GPAV and GPAVR algorithms [6, 7, 8]and later SB algorithm [9] were developed. Although the fit obtained by these algorithms is not optimal, it is very close to optimal. An important property of these algorithms is their low computational complexity that makes it possible to fit very large-scale multivariate data sets. For such sets, it is interesting to estimate uncertainty of the expected response. In this paper, we derive bootstrap confidence interval (CI) algorithms which obtain CIs for the expected response values corresponding to the input levels present in the data set. The set of CIs obtained for all input levels forms a confidence band. There exist papers devoted to the CIs in MR. Analytical CI methods are developed in [13, 15, 18, 19]. Except [19], these methods are suggested for univariate case when the data has dose-response structure, i.e. each input value corresponds to the fixed number of response values. In fact, they do not even use the optimal MR fit in their computations; only the monotonicity assumption is employed. In [19], optimal MR fit is used for the general univariate data structure, not necessarily dose-response, but it is unclear how to use this method in multivariate setting. Another limitation is that analytical methods are developed for the observed response which can be presented as a sum of expected response and normally distributed error terms. A criticism of analytical CI methods for MR was given in [11]. It was claimed there that sometimes the fitted values may be outside the confidence band and that this band is often too wide. Bootstrap CI methods are alternative to the analytical methods. Their benefits are the applicability to any regression technique and absence of model restrictions like normality of the error terms. Another benefit is that the computation becomes mostly a computer routine. A foundation of these methods is given by [12]. A modern review of the commonly accepted bootstrap CI methods can be found in [10]. One of bootstrap CI methods, known as the percentile method, was examined in [11] in the MR context. However, the studies were limited to univariate sets having dose-response structure. Here, we consider how percentile method and it’s extension, the BCa (bias corrected and accelerated) can be adapted in multivariate large-scale MR setting. We focus especially on deriving CIs for the expected risk of the coronary heart decease (CHD). Our analysis is based on the set of predictors cholesterol, age and Bernoulli distributed response variable P . When a data set has Bernoulli distributed response, it appears that the fit obtained by the GPAV and the SB algorithms may be far from optimal. It forces us to develop a new algorithm for such sets, see Section 4. For medical research data with Bernoulli distributed response, it is common to use a logistic regression model to estimate the expected response. In the case

2

of the CHD data, it can be presented as P (cholesterol, age) =

1 1 + exp(−β0 + β1 cholesterol + β2 age)

(1)

Here, we investigate if the logistic regression is a good choice for the CHD data. This is done by comparing the logistic regression fit with the derived CIs.

2

Monotonic regression problem

In MR, it is assumed that the expected response φ(x) is a monotonic function of predictors x ∈ Rp . The expected response is normally unknown. Instead, the set of observations © ª D = (Xi , Yi ) , Xi ∈ Rp , Yi ∈ R1 , i = 1, . . . , n is available where Xi is called a predictors’ level, Yi is a response component or a response value, Y = (Y1 , . . . , Yn ) is an observed response vector or observed response. Sometimes, the word ’observed’ is omitted for the sake of brevity. The observed model is typically assumed as Yi = φ(Xi ) + ²i

(2)

where ²i are independent identically distributed error terms. It is in general impossible to find the expected response function φ. ³ The task of´MR is to ˆ find instead a vector called a fitted response Y = Yˆi , i = 1, . . . , n which is monotonic as close as possible to the observed response. This is done by solving the following optimization problem: Pn 2 (3) min i=1 wi (fi − Yi ) f

s.t.

fi ≤ fj

iff

Xi ≺ Xj , i, j = 1, . . . , n

This problem admits also the following graph formulation as it was discussed in [9]. Given a directed acyclic graph G = G hN, Ei with a set of nodes N and a set of edges E, each node i ∈ N = {1, . . . , n} is associated with observation i, and each edge (i, j) ∈ E is associated with relation Xi ≺ Xj . Problem (3) is equivalent to Pn 2 min (4) i=1 wi (fi − Yi ) f

s.t. fi ≤ fj

for all (i, j) ∈ E

In medical data, it happens often that each value of Xi corresponds to several observed values Yij , j = 1, . . . , ni . Moreover, when the observed response values Yij are distributed as Bernoulli(φ(Xi )), model (2) is no longer valid. This makes MR not directly applicable to the data set of our interest, CHD data. However, as it was shown in [17], p. 32, when the observed response values 3

are binomially distributed (Bernoulli distributed in particular), the maximum likelihood estimator of φ(Xi ) is Yˆi which is a component P of solution to the ni problem (3) with weights wi = ni and response values Yi = j=1 Yij /ni . Following [17], an optimal solution to (4) can be represented as follows. The nodes of the graph G are partitioned into disjoint sets called blocks. Each block Bj is characterized by the the weight of the block X W (Bj ) = wk (5) k∈Bj

and associated weighted average P k∈Bj

av(Bj ) = P

wk Yk

k∈Bj

wk

.

(6)

If node i ∈ Bj , the corresponding component of the optimal solution is ˆ φi = av(Bj ). The solutions obtained by algorithms GPAV, GPAVR, SB and SBR also have a block structure, i.e. solution components are partitioned in the set of blocks with identical fitted response within each block.

3

Bootstrap confidence intervals for multidimensional monotonic regression

In Section 4, we constructed SBR algorithm based on our previous works [6, 7, 9] and motivated the need for this special algorithm. In this section, it suffices to know that the SBR algorithm is specially designed to compute a monotonic fitted response for multivariate large-scale data with Bernoulli distributed error terms. In Section 1, drawbacks of analytical CI methods were discussed. Here, we apply the methods based on the bootstrap. The most widely known techniques for bootstrap CI estimation are bootstrap-t, percentile and its extension called BCa algorithm [12]. An important difference between them is the order of the coverage error [10]. For the BCa and bootstrap-t it is known to be O(n−1 ), while for the percentile approach it is O(n−0.5 ). Although the percentile has a worse coverage, the approach is quite widely used due to the simplicity of implementation and lower computational burden compared to bootstrap-t and BCa. The bootstrap CI algorithms are suitable for any type of estimators. Nevertheless, some specific issues need to be considered when these algorithms are used in MR. Unfortunately, bootstrap-t method is too computationally heavy for MR, as it will be discussed below. Here, we concentrate on the remaining methods, percentile and BCa. The general framework for these two methods can be described as follows. 4

Having a sample D = {Z1 , . . . , Zn } drawn from some population and an ˆ estimator θ(D) of the population parameter θ, a 100(1 − α)% CI need to be ∗ ∗ ˆ constructed for θ. At the first step, B bootstrap sets D1∗ = (Z11 , . . . , Zn1 ),... ∗ ∗ ∗ DB = (Z1B , . . . , ZnB ) are obtained by drawing from the set D with replacement. ˆ ∗ ) is computed for each j. At the third step, At the second step, the value θ(D j ˆ∗ θˆi∗ are reordered ascending; after this step it ish assumed i that θi increases with i. Finally, confidence interval is computed as θˆL , θˆU , where L and U are the indices defined by the chosen bootstrap CI method. When each response value is Bernoulli distributed, we consider each fitted value Yˆi as an estimator of the expected response value φ(Xi ). A set of the confidence intervals obtained for all possible Xi can be treated as a confidence band for the response surface. In regression context, we reformulate bootstrap CI framework as Algorithm © 1 Bootstrap confidence intervalsªfor regression models Given D= (Xi , Yi ) , Xi ∈ Rp , Yi ∈ R1 , i = 1, . . . , n . ∗ 1. Obtain B bootstrap sets D1∗ , . . . , DB using one of the bootstrap techniques. 2. For each j = 1, . . . , B fit a MR model to Dj∗ and obtain the fitted response model yˆ∗ = Mj (x). 3. For each j = 1, . . . , B and i = 1, . . . , n evaluate the fitted model Mj at the level Xi and obtain Yˆij∗ = Mj (Xi ). 4. For each i = 1, . . . , n, ∗ 4.1 Reorder Yˆi1∗ , . . . , YˆiB ascending ∗ ∗ 4.2 Compute confidence interval for φ(Xi ) as [Yˆiλ , Yˆiυ ] i i where λi and υi are defined by the chosen bootstrap CI algorithm

At Step 2, we use SBR algorithm for fitting MR models. The structure of a particular bootstrap CI algorithm is defined by the choice of bootstrap technique at Step 1 and also the choice of strategy for computing λi and υi at Step 5. Depending on the chosen bootstrap technique, it may be problematic to evaluate models Mj at Step 3, as it will be shown below. We begin with the discussion of the commonly known bootstrap techniques [12] and how they should be used in MR.

3.1

Parametric bootstrap

In parametric bootstrap, it is assumed that the distribution type F (θ) of the ˆ response variable is known.³ In ´ this case, an estimate θ(D) is used to generate bootstrap samples from F θˆ . When the observed responses are Bernoulli distributed, the fitted values Yˆi estimate expected values φ(Xi ). Then, bootstrap values Yij∗ can be obtained by sampling from Bernoulli(Yˆi ).

5

Algorithm©2 Parametric bootstrap, Bernoulliª distributed response Given D = (Xi , Yi ) , Xi ∈ Rp , Yi ∈ R1 , i = 1, . . . , n . 1. Compute Yˆ1 , . . . , Yˆn by applying the SBR algorithm. 2. For each j = 1, . . . , B and i = 1, . . . , n, generate Yij∗ ∗ from Bernoulli(Yˆi ), set Xij =©¡Xi ¢ ¡ ¢ª ∗ ∗ ∗ 3. For each j = 1, . . . , B, Dj = X1j , Y1j , . . . , Xnj , Ynj ∗ Note, only response values are resampled, while Xij are fixed at the levels of Xi given by D. Algorithm 2 can be reformulated by taking into account the block structure of fitted response obtained by MR. Recall, when D is fitted by a MR algorithm, observations i = 1, . . . , n of the are grouped into blocks {Bk } such that the values of the respective fitted values Yˆi are identical within a block, see (6). When the observed response values are binary and weights wi = 1, formula (6) means that Yˆi , i ∈ Bk is a fraction of indices j, j ∈ Bk such that Yj = 1. Using this fact, it is easy to show that sampling from Bernoulli(Yˆi ) is equivalent to sampling with replacement from the set YBk = {Yj , j ∈ Bk }.

Algorithm©3 Parametric bootstrap, Bernoulliª distributed response Given D = (Xi , Yi ) , Xi ∈ Rp , Yi ∈ R1 , i = 1, . . . , n . 1. Compute the fitted response values Yˆ1 , . . . , Yˆn and block partitioning B1 , . . . , BK by applying SBR 2. For each j = 1, . . . , B and i = 1, . . . , n Find k such that i ∈ Bk . Compute l by sampling from Bk Set Yij∗ = Yl ∗ Set Xij = Xi ¢ª ¢ ¡ ©¡ ∗ ∗ , . . . , Xnj , Ynj 3. For each j = 1, . . . , B, Dj∗ = X1j , Y1j In [11], bootstrapping within a block (Algorithm 3), which was called a monotonic bootstrap, was used to estimate percentile confidence intervals of univariate MR fit. We have just shown that for Bernoulli distributed response it is equivalent to parametric bootstrap.

3.2

Semi-parametric resampling

Semi-parametric resampling can be used when a CI for a regression fit need to be computed. In such regression models, it is assumed that Yi = φ(Xi , θ) + ²i

(7)

where φ(Xi , θ) is unknown expected response, ²i are independent and identically distributed error terms. If assumption (7) holds, bootstrap sets Di∗ can be computed as follows. ˆ Fitting the data set D allows to estimate regression ³ ´parameters θ as θ. At the next step, residuals are computed ²ˆi = Yi − Yˆ Xi , θˆ . 6

³ ´ The bootstrap responses Yi∗ j are computed as as Yij∗ = Yˆ Xi , θˆ +²∗ij . Here ²∗1j , . . . , ²∗nj are obtained by sampling with replacement from Eps = {ˆ ²1 , . . . , ²ˆn } for each j = 1, . . . , B. If necessary, Eps is first centered around 0. For the Bernoulli distributed response, assumption (7) is not correct. Even if one could claim that the response can presented in the form (7), it is easy to show that the bootstrap values Yi∗ will not be binary in general. This contradicts a bootstrap principle which assumes that bootstrap samples and original sample should have the same nature. We conclude that semi-parametric bootstrap should not be used for the data with Bernoulli distributed response.

3.3

Nonparametric resampling

In nonparametric resampling, pairs (Xi , Yi ) are drawn from a multidimensional distribution. Bootstrap in this case means sampling pairs from D with replace∗ ment, i.e. (Xij , Yij∗ ) = (Xk , Yk ) for some k, 1 ≤ k ≤ n. It was mentioned that it may be problematic to make evaluations at Step 3 of Algorithm 1 when MR is applied and certain bootstrap techniques are used. The most of fitted parametric models are global, i.e. they can be computed at any given level of x ∈ Rp . However, the MR fitted ¡model M ¢ j (x) is defined ∗ ∗ , Yij∗ ∈ Dj∗ . Since ∈ Rp : Xij only at the levels of x used for fitting, Xij in general the levels of predictors in set Dj∗ are not identical to those in D, model evaluation at the levels of predictors missing in Dj∗ should be done by interpolation. The only requirement for such interpolation is that the predicted response value should not violate the monotonicity. For a given Xi missing in Dj∗ , there are many possible values satisfying this requirement. In Section 4, it is shown that the predicted value for Xi can be any value from interval [L(i, j), U (i, j)] where L and U are functions of i, j, model Mj and data set Dj∗ . It will be also shown that, when i is in the inner area of a block, i.e. it has predecessors and successors in this block, L(i, j) = U (i, j) and, consequently, the interpolation value is unique. The described estimation problem does not arise in parametric resampling. ∗ Since Xij = Xi , i = 1, . . . , n, any model Mj has the same set of level of inputs as D. It means that the fitted values at Step 3 if Algorithm 1 are defined uniquely. When nonparametric bootstrap is used, some values Xi present in D may be © ∗ ª ∗ missing among X1j , . . . , Xnj due to resampling, some of them may encounter more often than in D. To obtain unique interpolation values, additional model assumptions are needed In [11], a linear interpolation was used since the model was assumed to be locally linear. (i,j)) One of our proposals is to use (L(i,j)+U as the prediction value. This 2 estimator retains a block structure which seems to be more natural for the MR. We shall call it a mid-point modeling. Another alternative is to take into account extreme cases. For each bootstrap ∗ set Dj∗ , two model estimates are computed Yˆij∗ = L(i, j) and Yˆi(j+B) = U (i, j). The estimates corresponding to component i can be thought of as the set of 2·B

7

bootstrap samples. In this case, in Step 4 the number B should be replaced with 2 ∗ B. This strategy will be referred to as worst-case modeling. Both modeling strategies have their benefits and drawbacks. The BCa and percentile CI for each component i is defined as a percentile of the empirical distribution given by bootstrap estimates corresponding to i. It is easy to show that if mid-point modeling is used, the CI is in generalmore narrow compared to worst-case modeling. In some cases, mid-point modeling may give too narrow interval which does not contain the true response value. The worst-case modeling may, on the opposite, provide too wide intervals which will not allow to reject inadequate models in model selection process. Now we consider in details the structure of each of the mentioned bootstrap CI algorithms.

3.4

Bootstrap-t algorithm Yˆ ∗ −Yˆi

The algorithm is based on the analysis of ijσˆij . Here σ ˆij is an estimated ∗ ˆ standard error of Yij . Computation of σ ˆij involves a second level bootstrap, i.e. M bootstrap sets are sampled from each Dj∗ and fitted with a MR algorithm. This means that totally B·M bootstrap sets with n observations should be fitted. The recommended values are M = 25, B = 2000, see [10]. It implies that the total number of sets to be generated and fitted is 50000. Even SBR algorithm, which requires only around 10 seconds of computer time for n = 50000 on a modern PC, will take around 6 days to produce all bootstrap fits. This is why we do not use bootstrap-t in our simulations.

3.5

Percentile algorithm

The percentile approach is based on Algorithm 1. In this algorithm, lower and upper indices are defined independently on i as λi = dBα/2e and υi = bB(1 − α/2)c, where α is a given confidence level. The limits obtained will be close to α/2 and 1 − α/2 percentiles of the empirical distribution given by ∗ Yˆi1∗ , . . . , YˆiB . Percentile approach was applied for computing CIs in [11]. However, as it is pointed out in [12], this approach may have substantial coverage error. Because of that, its extension, called BCa (bias corrected and accelerated), was developed.

3.6

BCa algorithm

The approach was designed to eliminate drawbacks of percentile approach. BCa algorithm for computing indices λi and υi in MR context can be described as follows. First, for each i = 1, . . . , n the values of bias, bi , and acceleration, ai should be computed. Let Φ be a cumulative density function of the standard normal

8

distribution. Then,

³ ´ #j Yˆij∗ < Yˆi  bi = Φ−1  B 

(8)

Here #j Zj denotes the number of elements j such that Zj is true. Using [12], the following procedure can be derived to obtain CIs for expected response. Let D(k) be a data set obtained by removing observation i from D (jack-knife) and M (k) (x) is a model obtained by applying SBR to D(k) . We shall assume that M (k) (x) is uniquely defined for any x ´ via mid-point modeling. Com³ (k) (k) (k) are computed by estimating ponent Yi of the vector Y (k) = Y1 , . . . , Yn Pn (.) (k) (k) = M (k) (Xi ). Let Y¯i = k=1 Yi /n. the model M (k) (x) at level Xi , Yi Then, ´3 Pn ³ ¯ (.) (k) − Y Y i i k=1 ai = · (9) ³ ´2 ¸3/2 Pn (.) (k) ¯ 6 Y −Y k=1

i

i

Having computed ai and bi , i = 1, . . . , n, the lower and upper indices can be found as λi = dBα1i e υi = bBα2i c

(10) (11)

where Ã

! bi + z (α/2) ¡ ¢ α1i = Φ bi + 1 − ai bi + z (α/2) ! Ã bi + z (1−α/2) ¡ ¢ α2i = Φ bi + 1 − ai bi + z (1−α/2)

(12) (13)

and z (α) is an α-percentile of the standard normal distribution. The following issues arise when BCa algorithm is applied in the context of large-scale MR. Formula (8) was suggested to correct the bias in CI estimation and it works well when all Yˆij∗ are unique for a given i. However, when Yˆij∗ = Yˆi for all j(practice shows that there are cases with this property when response is Bernoulli distributed), bi = −∞ and expressions for λ and υh become i meaningless. However, it is clear that the CI in this case should be Yˆi , Yˆi . The same problem happens if Yˆij∗ ≥ Yˆi for all j. Our suggestion is to modify (8): ³ ´ ³ ´   #j Yˆij∗ < Yˆi + #j Yˆij∗ = Yˆi /2  bi = Φ−1  (14) B

9

Another complication is how to compute ai . For each ai , it is required to produce n MR fits to sets D(k) having n − 1 observations. This is computationally cumbersome. In Section 4, we suggest a modification of SBR algorithm for designed for computing jack-knife. This modification is much faster than the SBR algorithm. However, if the modified SBR is used, the brute-force computation of all ai is very time and memory consuming for large n. We suggest a procedure for computing all ai in parallel. By expanding brackets in (9), the following equivalent procedure can be obtained: Algorithm 4 Acceleration 1. Set Ai := 0, Bi := 0, Ci := 0, i = 1, . . . n 2. For each k = 1, . . . , n 2.1 Compute fitted response vector Yˆ (k) for set D(k) 2.2 For each i = 1, . . . , n ³ ´2 ³ ´3 (k) (k) (k) 2.2.1 Update Ai := Ai + Yi , Bi := Bi + Yi , Ci := Ci + Yi 3. For each i = 1, . . . , n 3.1 Update Ai := Ai /n, Bi := Bi /n, Ci := Ci /n 3Ai Bi −2A3i −C 4.1 Set ai = 6n1/2 (Bj −A2 )3/2 j

Note, Algorithm 4 does not require to store all Yˆ (k) at a time. Instead, after update of scalars Ai , Bi , Ci at step k, Yˆ (k) can be removed from the memory at the step k + 1.

4

MR algorithms for large-scale data with Bernoulli distributed response

In Section 1, it was mentioned that the GPAV, GPAVR and SB algorithms were developed for solving large-scale MR problems. While the GPAV and the GPAVR can handle the sets with up to n = 105 observations, the SB algorithm can process even 106 observations and requires less time and memory in general. The price for that is a decreased accuracy of the obtained solution compared to the GPAV accuracy. Constructing CIs requires performing multiple runs of the fitting algorithm, that’s why we took SB algorithm as a basis for the MR algorithm used in our simulations. Definitions and notations of Section 2 and also the following definitions will be used here. A node i of graph G is called a predecessor of node j if there is a directed path from i to j in G. In terms of observations, i is a predecessor of j if Xi ≺ Xj . In its turn, j is successor of i. A topological order in graph G is an enumeration (ordered list) of graph nodes T = {t1 , . . . , tn } such that all predecessors of tj are enumerated before tj . This implies that if nodes of the graph are treated in 10

the order given by T , each node will be treated only after all its predecessors are treated. In general, one may construct many topological orders for a given G. We outline the GPAV, GPAVR and SB below, the full description can be found in respective papers.

4.1

GPAV algorithm

The nodes of graph G are treated sequentially in the order given by T . Values Ai = Yi are associated with each node i. Each step of the algorithm involves merging node i with some of it’s predecessors that violate monotonicity into one block Bi . Vertex j violates monotonicity with respect to i if j is a predecessor of i and Aj > Ai . For the nodes that are merged to this block, the values Ak , k ∈ Bi are all updated to av(Bi ), see (6). By the end of the algorithm, the set of nodes N is split into blocks such that nodes within a block Bi have identical associated Ai s and the set of Ai satisfies the monotonicity constraints. The fitted response is computed as Yˆk = Ai , where k ∈ Bi . The algorithm is thereby characterized by Input: Graph G with n nodes, observed response vector Y = (Y1 , . . . , Yn ), vector of weights W = (w1 , . . . , wn ), topological order T . Output: Set of blocks R = {B1 , . . . , Bm }, fitted values Yˆ1 , . . . , Yˆn .

4.2

GPAVR algorithm

The GPAVR combines in a special way the GPAV solution obtained for a topological order T and data set D with the GPAV solution for another MR problem. The solution of the latter can be regarded as would the GPAV be applied to the data set D but the nodes of the graph are treated in the order reverse to a topological order T 0 and the current node in the GPAV algorithm is merged with violators-successors instead of violators-predecessors. The resulting fit is monotonic and proves to be much more precise than the original GPAV solution.

4.3

SB algorithm

. A topological order T is partitioned into s parts , T1 , . . . , Ts of equal length. In other words, T = T1 ∪ . . . ∪ Ts . If n is not a multiple of r, Ts is made longer or shorter. The set D is divided into sets, called segments, D1 , . . . , Ds such that each Di has observations indexed by Ti only. Given Di , order Ti , weights and responses corresponding to observations from Di , the MR problem of the type (4) is formulated and solved by the GPAV for each Di . As a result, the observations in each segment Di are grouped into blocks collec6ed in Ri . The response obtained by uniting all local fits is not globally monotonic; an additional monotonizing step is required. 11

The blocks from R1 ∪ . . . ∪ Rs are treated as new observations. The new observed responses become av(B) and the new weights are W (B) for each B ∈ R. In [9], it is shown how new graph a GB should be constructed. For new observations and GB , the problem of the type (4) is formulated and solved by the GPAV. Since new observations are blocks of the original observations, the final GPAV call will merge some of these blocks into a larger blocks contained in R0 . The final fitted response is as Yˆi = av(Bj ), where i ∈ Bj , Bj ∈ R0 . In our preliminary experiments with the CHD set, we applied the GPAV algorithm with different topological orders. Although it took considerable time and memory resources, it was possible to make a single run for several topological orders. The fitted response is presented at Fig. 1. It can be seen that the fitted response depend very much on the selected topological order. Moreover, blocks at each figure are aligned along the direction given by the respective topological order. For instance, when topological order is (1stComp), the blocks look as they consists of the observations that belong to the set satisfying C 1 < cholest < C 2 , where C 1 , C 2 are some constants. For (SumComp), the blocks remind C 1 < cholesterols + ages < C 2 . Here cholesterols , ages denote standardized values of cholesterol and age, respectively. The same remark can be done for (SumPred). The residual sum of squares appears to be quite different for different cases. The same problem appears if SB algorithm is applied instead of the GPAV. It can be concluded that neither GPAV nor SB are able to obtain a reasonable fit for CHD data. A reason can be found when the GPAV iterations for Bernoulli distributed response are analyzed. Our explanation is as follows. This response distribution implies that the observed response values form a binary vector, i.e. each observed response value is either zero or one. For the simplicity of explanations, assume that (SumComp) was selected. Consider CHD data set zoomed to a certain area, see Fig 2. The inclined lines bound the area C1 < ages + cholesterols < C 2 . Crosses denote observations having zero response values, circle stand for unit response values. Note, in almost any selected neighborhood of each input Xi there can be found observations with both zero and unit observed response. This means that the GPAV will have a plenty of monotonicity violation situations, that in its turn implies that the produced blocks will be large. There is an observation with zero response which is successor of all observations found in area 1. Area 1 is bounded by inclined line and dashed lines and this observation can be found in the upper right corner of this area. The GPAV will merge this observation with big amount of its nearest predecessors, may be all of predecessors. As a result, a big block in area 1 will be created, we shall call it block 1. The next treated observation is the one being a successor to all observations from area 2. It will merge many of its predecessors from area 2 and there is a high chance that it will violate monotonicity with respect to observations from block 1. In this case, block 1 will be joined to a larger block, block 2, which will 12

reside in areas 1 and 2. Proceeding in the same manner, the obtained large block has large chance to reside in all areas from 1 to 5.

4.4

SBR algorithm

To avoid dependence on the choice of the topological order, a modification of the SB algorithm is suggested, called segmentation-based algorithm with refinement (referred to as the SBR or the SBR algorithm). The SB algorithm employs GPAV for fitting the response in each segment Di and for producing a global monotonic response. Instead of the GPAV, the SBR algorithm applies the GPAVR algorithm for these purposes. Because the GPAVR fit is not much dependent on the choice of topological orders, it is reasonable to assume that the new algorithm should have a better result for the CHD set. We made a comparison of the solution obtained by SBR and an exact algorithm, called IBCR [4], applied to a discretized CHD set. The discretization was done by suitable rounding off cholesterol and age values. Such discretization decreased number of unique input values that made it possible to apply an exact MR method. Figure 3 demonstrates that the SBR provides a response which is quite close to the optimal fitted response obtained from the discretized data. The analysis of residuals confirms that the SBR solution is close to the IBCR solution. To perform efficiently multiple runs of the SBR algorithm for bootstrap or jack-knife sets, two strategies are suggested.

4.5

SBR modifications for bootstrap and jack-knife

Computing bootstrap CIs involves fitting sets Di∗ obtained by sampling from D. Depending on the bootstrap technique selected, Di∗ contain either the same set of levels of x as D or some of them are missing and some other are present more often than in D. One may formulate and solve the problem of the type (3) for each of Di∗ from scratch. However, it is much more efficient to use the results of computations made for D. ˜ ∗ , initially identical to D∗ If the input For each Di∗ , we create a new set D i i level Xj is present in D but absent in Di∗ , we add a new observation (Xj , 0) ˜∗ and associate the weight Wj = 0 to this observation. It is easy to show that D i ˆ∗ has the same levels of inputs as D and the MR problem (3) formulated for D i is equivalent to the (3) formulated for Di∗ . ˜ ∗ and This implies that topological orders used for D can be applied to D i the graphs G computed by the GPAVR algorithm in each segment of D can be ˜ ∗ . Our preliminary numerical experiments employed in the GPAVR runs for D i shows that the running time of the SBR reduces dramatically if this is done. Another but similar strategy can be suggested for the fitting of set D(i) ˜ (i) is obtained from D by removing observation i, jack-knife. In this case, D 13

Table 1: Bootstrap CI algorithms used in our numerical experiments. Each algorithm is defined by the selected combination of a CI approach and a bootstrap method Bootstrap method CI approach

Percentile BCa

P arametric Method 1 Method 2

N on−parametric,

mid − point Method 3 Method 4

N on−parametric,

worst − case Method 5 Method 6

obtained by adding the observation (Xi , 0) with associated weight Wi = 0 to D(i) . Data set D(i) is identical to D except of observation i. In order to obtain the fitted response for D(i) , one may find the segment containing i and recompute the fit for this segment. The local fitted response obtained by GPAVR for the remaining segments of D(i) will be the same as the one for D. Finally, one needs to merge the local fitted responses into monotonic global one by the SBR algorithm. Finally, consider a prediction problem in MR. The most of regression models are global, i.e. the fitted model M (x) can be evaluated at any x ∈ Rp . Let’s suppose now that set D is fitted. When the MR is applied, the fitted model is a set of fitted values corresponding to the inputs present in D. Recall, the fitted response can be presented as a set of blocks where the fitted values are identical within a block. If a prediction needs to be done for X 0 = Xi present in D, the predicted value is Y 0 = Yˆi . If X 0 6= Xi , the predicted value can be any such that it does not violate monotonicity of the fitted response. It means that the predicted value Y 0 should not be less than the fitted value of any predecessor of X 0 in D and not more than the value of any successor of X 0 in D. Let l be the index of the predecessor with maximum fitted value and u the index of the successor with minimum fitted value. Then, Y 0 should satisfy Y 0 ∈ [Yˆl , Yˆu ]. If l and u are in the same block , then Yˆl = Yˆu and the prediction value becomes unique. Since the predicted observation has predecessor and successor that are in the same block, it means that the observation is resided in the inner area of this block. If l and u are from different blocks, the predicted observation can not be related to any of blocks. Consequently, it is resided between some blocks.

5

Results

The bootstrap confidence bounds for the expected risk of the CHD were computed by using the combinations presented in Table 1. Our analysis of the confidence bounds obtained by the Methods 3,4,5,6 shows that there is a negligible difference between the results of the mid-point and worst-case modelings for CHD set. The reason is that whenever data set D or any of Di∗ are fitted, the blocks corresponding to the fitted response are large. 14

Consequently, the predicted value for any input has large chances to be situated in the inner are of some block and, hence, is unique. Here, we provide surface plots for lower and upper limits of confidence bands obtained by Methods 3 and 4, see Figures 4, 5. We omit the plots for Methods 1 and 2 since they look similar to the ones for Methods 3 and 4. Specific features of plots for Methods 1 and 2 will be discussed further. From Figures 4, 5, it can be concluded that the percentile confidence bands are more smooth than the BCa intervals. The reason is that the BCa approach makes a bias correction by shifting the percentile surface toward the fitted values presented at Fig. 6. At Fig 6, a group of values are marked with cross. These observations have unusually high value of the fitted risk. The most of them are concentrated in the area with high age and cholesterol level. It can be questioned if this fitted response value shift is significant, i.e. if one can claim that old people with high level of cholesterol have much higher chances to have CHD. To make a right conclusion, we make an analysis of lower limits of confidence bands. Fig. 7 shows that fitted response shift is significant if Method 1 is applied. However, the lower bounds by Method 2 are not so high for these combinations of age and cholesterol. Consequently, according to Method 2, it is not possible to either confirm or reject the hypothesis about the presence of the fitted response values shift. Methods 3 and 4 give the same conclusion as Method 2. Since Methods 2 and 4 correct the bias, we assume that their results should be trusted more than the result of Method 1. Our analysis of confidence bands shows that CIs have different width depending on the input level and CIs are wider on the boundary of the input domain where cholesterol and age levels are high. Moreover, the width of CIs is almost identical for BCa and percentile approaches, except of the boundary. There, CIs of the BCa are in the average more narrow compared to the percentile. It was also concluded that the upper and lower limits of CIs on the boundary are higher for the percentile compared to the BCa. Finally, a logistic regression model 1 was fitted to the CHD set. The surface plot of the fitted values is presented at Fig. 6. The graph is obviously monotonic and smooth. To see if such model approximates well the expected risk of the CHD, we count how well it fits the confidence bounds. It appears that 41%, 52%, 37% or 46 % of fitted values are outside the confidence limits given by Methods 1, 2, 3 or 4, respectively. Consequently, the simple logistic regression model is not a good option for the estimation of the CHD risk.

6

Conclusions

To construct CIs for MR, we investigated how existing bootstrap CI algorithms can be adapted to large-scale MR. It appeared that the bootstrap-t is impossible to run due to high computational burden. The remaining methods, percentile and BCa were implemented for MR. It was discussed that these methods require to specify how the bootstrap is done 15

and how the fitted response is evaluated at the levels of predictors missing in the fitted response. We investigated which of bootstrap principles can be used for the data set of our interest, CHD. It appeared that parametric bootstrap becomes equivalent to monotonic bootstrap and semi-parametric bootstrap is not applicable. For the evaluation of the fitted response, we suggested a midpoint modeling and a worst-case modeling strategies. To adapt BCa to MR, we discussed how the acceleration and bias, ai and bi , should be computed. We have shown that the GPAV and SB may have bad accuracy of the fit to the risk of CHD. The quality of the fit was dependent on the chosen topological order. As an improvement, we modified the SB by using the GPAVR for fitting the response in each segment and for the final merging step. The resulting algorithm, called SBR, was shown to obtain the fitted response which is much closer to the optimal one. As another improvement, we suggested modifications which speed-up the SBR significantly for bootstrapped samples by reusing the information obtained by the fitting of original set. In our numerical experiments with CHD set, we found that BCa indeed corrects the CI estimation bias. Also, we discovered that MR fit suggests that there might be a much higher risk of CHD for older people with high cholesterol level. A percentile method using parametric bootstrap confirmed this hypothesis, while other involved methods did not support it. Our analysis of confidence bounds demonstrated that a simple logistic regression has around 40 % of fitted values which are outside the confidence limits. This proves that the logistic model is not a good approximation method for estimation of expected risk of CHD and demonstrates the usefulness of MR.

References [1] Acton, S.A., Bovik, A.C.: Nonlinear image estimation using piecewise and local image models. IEEE Transactions on Image Processing, 7, 979-991 (1998) [2] Ant-Sahalia, Y., Duarte J.: Nonparametric option pricing under shape restrictions. Journal of Econometrics, 116, 9–47 (2003) [3] Barlow, R.E., Bartholomew, D.J., Bremner, J.M., Brunk, H.D.: Statistical inference under order restrictions. Wiley, New York, (1972) [4] Block, H., Qian, S., Sampson, A.: Structure Algorithms for Partially Ordered Iso- tonic regression. Journal of Computational and Graphical Statistics, 3, 285–300 (1994) [5] Brunk, H. D.: Maximum likelihood estimates of monotone parameters. Annals of Mathematical Statistics, 26, 607-616 (1955) [6] Burdakov, O., Grimvall, A., Sysoev, O.: Data preordering in generalized PAV algorithm for monotonic regression. Journal of Computational Mathematics 4, 771–790 (2006) 16

[7] Burdakov, O., Sysoev, O., Grimvall A., Hussian, M.: An O(n2 ) algorithm for isotonic regression problems. In: Di Pillo, G., Roma, M. (eds.) Series: Nonconvex Optimization and Its Applications, vol. 83, pp. 25–33, SpringerVerlag (2006) [8] Burdakov, O., Grimvall, A., Sysoev, O.: Generalized PAV algorithm with block refinement for partially ordered monotonic regression. Proceedings of the Workshop on Learning Monotone Models from Data 2009, 23–37 (2009) [9] Sysoev, O., Grimvall, A., Burdakov, O.: A Segmentation-Based Algorithm for Large-Scale Monotonic Regression Problems. Submitted to X (2010) [10] Carpenter J, Bithell J.: Bootstrap confidence intervals: when, which, and what? A practical guide for medical statistics. Stat Med., 19, 1141-1164 (2000) [11] Dilleen, M., Heimann, G., Hirsch, I.: Non-parametric estimators of a monotonic dose-response curve and bootstrap confidence intervals. Statistics in Medicine, 22, 869–882 (2003) [12] Efron, B., Tibshirani, R.: An introduction to the bootstrap. New York: Chapman & Hall (1993) [13] Korn, E.L.: Confidence bands for isotonic doseresponse curves. Appl. Statist., 31, 59-63 (1982) [14] Lee, C. I. C.: The min-max Algorithm and Isotonic Regression. The Annals of Statistics, 11, 467–477 (1983) [15] Lee, C.I.C.: On estimation for monotone doseresponse curves. J. Amer. Statist. Assoc., 91, 1110-1119 (1996) [16] Maxwell, W. L., Muchstadt, J. A.: Establishing consistent and realistic reorder intervals in production-distribution systems. Oper. Res., 33, 1316– 1341 (1983) [17] Robertson, T., Wright, F.T., Dykstra, R.L.: Order Restricted Statistical Inference. Wiley, New York (1988) [18] Sampson, A.R., Singh, H., Whitaker, L.R.: Simultaneous confidence bands for isotonic functions. Journal of Statistical Planning and Inference, 139, 828–842 (2009) [19] Schoenfeld, D.: Confidence intervals for normal means under restrictions with application to doseresponse curves, toxicology experiments, and lowdose extrapolation. J. Amer. Statist. Assoc., 81, 186-195 (1986) [20] Ulm, K., Salanti, G.: Estimation of the general threshold limit values for dust. International Archives of Occupational and Environmental Health, 76, 233–240 (2003)

17

1 0.8

P

0.6 0.4 0.2 0 80 800

60

600 400

40 200 20

Age

0

Cholesterol

(NumPred)

1 0.8

P

0.6 0.4 0.2 0 80 800

60

600 400

40 200 20

Age

0

Cholesterol

(1stComp)

1 0.8

P

0.6 0.4 0.2 0 80 800

60

600 400

40 200

Age

20

0

Cholesterol

(SumComp) Figure 1: The fitted value of P obtained by the GPAV. Topological order is chosen by increase of: a) cholest, then age (1stComp). b) number of predecessors (NumPred). c) sum of the stadnardized age and cholesterol (SumComp)

18

5 −0.2

−0.25

Age

s

4

−0.3

3 −0.35

2

−0.4

1 −0.8

−0.6

−0.4

−0.2

0

0.2

Cholestrols

Figure 2: The observed response in a small area of CHD input domain. Crosses represent zero response, circles represent unit response.

19

1 0.8

P

0.6 0.4 0.2 0 80 800

60

600 400

40 200 20

Age

0

Cholesterol

(SBR algorithm)

1 0.8

P

0.6 0.4 0.2 0 80 800

60

600 400

40 200

Age

20

0

Cholesterol

(IBCR) Figure 3: The fitted value of P obtained by the SBR(to the left). The IBCR applied to a discretized CHD set with 2546 different levels of x

20

0.8

P

0.6

0.4

0.2

0 80 600

60

500 400

40

300 20

Age

200 100

Cholesterol

(NonP-Perc-λ)

1 0.8

P

0.6 0.4 0.2 0 80 600

60

500 400

40

Age

300 20

200 100

Cholesterol

(NonP-Perc-υ) Figure 4: The lower (NonP-Perc-λ) and upper (NonP-Perc-υ) confidence bounds obtained by percentile method with nonparametric bootstrap.

21

0.8

P

0.6

0.4

0.2

0 80 600

60

500 400

40

300 20

Age

200 100

Cholesterol

(NonP-BCa-λ)

1 0.8

P

0.6 0.4 0.2 0 80 600

60

500 400

40

Age

300 20

200 100

Cholesterol

(NonP-BCa-υ) Figure 5: The lower (NonP-BCa-λ) and upper (NonP-BCa-υ) confidence bounds obtained by percentile method with nonparametric bootstrap.

22

1 0.8

P

0.6 0.4 0.2 0 80 800

60

600 400

40 200 20

Age

0

Cholesterol

(Fitted Risk by MR)

0.8

P

0.6

0.4

0.2

0 80 60 40 20

Age

100

200

300

400

500

600

Cholesterol

(Fitted Risk surface plot by MR)

0.8

P

0.6

0.4

0.2

0 80 600

60

500 400

40

Age

300 20

200 100

Cholesterol

(Fitted Risk surface plot by Logistic regression) Figure 6: Fitted risk for MR and logistic regression

23

1 0.9 0.8

P

0.7 0.6 0.5 0.4

80

70

60

50

300

400

Age

500

600

700

800

700

800

Cholesterol

(Method 1)

1 0.9 0.8

P

0.7 0.6 0.5 0.4 0.3 0.2 80

70

60

50

300

400

Age

500

600

Cholesterol

(Method 2) Figure 7: Fitted risk (crosses) and lower confidence limits (circles) for inputs corresponding to high levels of fitted risk.

24