Making Good Prediction: A Theoretical Framework Adeline Lo∗, Herman Chernoff†, Tian Zheng‡and Shaw-Hwa Lo§ Current draft: October 2015 First draft: November 2014

This is a draft only. Please do not circulate or distribute. Abstract We propose approaching prediction from a framework grounded in maximizing correct prediction rates. While this is intuitively obvious, not nearly enough attention has been paid to the creation of a clear theoretical framework for prediction; motivated by the needs of current genome-wise association studies (GWAS) we provide a discussion on such a framework. We lay out an objective function for correct prediction rates for which we need to maximize, then consider and ultimately reject an estimator based on the sample analog of the solution to the maximization problem due to (1) the estimator’s inability to distinguish between noisy and predictive variables which directly leads to (2) an inability to estimate the sample analog. In response, we offer an alternative solution to the maximization problem. We demonstrate that the Partition Retention method’s I-score is a measure that asymptotically approaches this alternative solution. Based on the I-score for a given variable set we derive an asymptotic lower bound for the correct prediction rate. The I-score can differentiate between noisy and predictive variables as well. We offer simulations and applications of the I-score on real data to demonstrate its predictive performance on sample-constrained data. These show that the I-score can capture highly predictive variable sets and serves as a lower bound for the out sample correct prediction rate. Future research in the avenue of alternative solutions as sample-based measures of predictivity is much desired.

∗ Department

of of ‡ Department of § Department of † Department

Political Science, University of California, San Diego Statistics, Harvard University Statistics, Columbia University Statistics, Columbia University

1

Introduction and Motivation Prediction is a highly important goal for many scientists and has become increasingly important and difficult as the quantity and complexity of available data has grown. Complex and high dimensional data particularly demand attention. An early 2013 Nature Genetics article, “Predicting the influence of common variants”, identified prediction as an additional important goal for current genome-wise association studies (GWAS). However, prediction does not have a clear theoretical framework such as we might see for statistical significance. Variables can be selected on the basis of measures of influence that are statistically significant without requiring cross-validation efforts to support the findings of said variables. The same cannot be said of variables selected for prediction. That is, we do not have a theoretical framework from which to consider the problem of variable selection (VS) for variables that have high theoretical prediction rates, or high “predictivity” as we refer to it here. Rather, VS for variable sets that have high predictivity is currently conducted in two common ways. The first is VS through identification of statistically significant variables, measured through tests of statistical significance — such as the chi-square test. The second is through VS of variables that appear to do well in out-of-sample test data, as measured through testing sample error rates. The first way of prediction is still very much in use in predicting health outcomes (see [7] among others) but has been disappointing, particularly in GWAS, in predicting well. A recent New England Journal of Medicine article illustrates one example of such an approach, whereby researchers constructed a model based on five genetic variants from GWAS results on prostate cancer; they report that the variants do not increase predictive power [27]. Likewise, Gransbo et al. show that chromosome 9p21, while significantly associated with cardiovascular disease, does not improve risk prediction [7]. Using variants discovered to be statistically significant in association with disease outcomes to build predictive models does not seem to imply an automatic increase in the ability to predict. We point out in [16] that this first approach suffers from the problem that significant variables are not necessarily predictive and vice versa, so targeting significant variables might miss the goal of VS of variables with high predictivity. Indeed, this problem is prevalent in simple as well as complex data. The second way for VS in prediction might do better but is still trying to measure sample-specific error rates. This requires setting aside testing data to determine how well selected predictors might do on “new data”. However, as is the case in big data like GWAS, researchers are often lacking in large enough sample sizes for this approach to be efficient in the usage of data, and indeed sometimes it is impossible to set aside enough testing data to evaluate all the possible VS. Understanding the theory behind prediction, that is, defining theoretical prediction rates and maximizing on that, may help us in designing better methods for selecting variable sets with high predictivity, particularly when sample sizes are moderate to small when compared to the number of explanatory variables. It may also aid in the creation of measures designed to identify variables’ theoretical correct prediction rates, rather than relying on ad hoc modeling and cross validation. The hope is that such a logically driven approach, designing measures that identify a variable set’s correct prediction rate may be both fruitful and efficient (in the use of sample data) for 2

guiding VS. In this paper, we consider the endeavor of creating a prediction-based theoretical framework. Grounded in statistical theory, we highlight a new avenue of research towards creating sensible measures that target highly predictive variable sets.1 We emphasize in particular genetic data, though we will show that the methods proposed can be easily tailored to other high-dimensional data in the sciences and the social sciences. Our paper proceeds as follows. The first section considers the related literature of variable (or feature) selection in machine learning and computer science. While the literature is rich in techniques designed to better prediction, most techniques are evaluated through out-of-sample or cross validated error rates. These require the use of setting aside some part of valuable sample data for the purpose of cross validation. Additionally, statistical significance is still a popular method of selecting variables. We highlight the need for a theoretically motivated measure of predictivity. The second section creates the set-up for considering maximally predictive variable sets in a non-sample constrained, theoretical world. We articulate the objective function and consider the solution to the maximization problem. We then consider a sample analog of the solution and reject this as a practical measure. We consider an alternative solution to the objective function and demonstrate that the I-score asymptotically approaches this alternative solution. The third section demonstrates the effectiveness of the I-score as a measure of predictivity in simulations and real data applications. Finally, we offer some concluding remarks.

1

Variable selection literature

Variable Selection or Feature Selection refers to the approach of selecting a subset of an original group of variables to construct a model. Often VS is used on data of large dimensionality with modest sample sizes [?]. In the context of high dimensional data, such as GWAS, with perhaps redundant or irrelevant information, this dimensionality reduction can be a very important step. Unlike projection or compression based approaches (such as principal component analysis or usage of information theory), VS methods do not change the variables themselves. The types of approaches and tools developed for feature selection are both diverse and varying in degrees of complexity; however, there is general agreement that three broad categories of feature selection methods exist: filter, wrapper and embedded methods. Filter approaches tend to select variables through ranking them by various measures (correlation coefficients, entropy, information gains, chi-square, etc.). Wrapper methods use “black box” learning machines to ascertain the predictivity of groups of variables; since wrapper methods train similar predictive models for each subset of variables, they can be computationally intensive. Embedded techniques search for optimal sets of variables via a built-in classifier construction. A popular example of an embedded approach is the least absolute shrinkage and selection operator (LASSO) method for constructing a linear model, which penalizes the regression coefficients, shrinking many to zero. Often cross-validation is used to evaluate the prediction rates. No 1 We

refer mostly to variable sets here, which include of course 1-variable sized variable sets.

3

measure currently exists that directly attempts to evaluate a variable set’s theoretical level of predictivity, however. For a more comprehensive survey of the feature selection literature see, among others [?], [?], [?], and [?]. Although a spectrum of feature selection approaches exists, many scientists have taken the approach of tackling prediction through the usage of important and hard-to-discover influential variables found to be statistically significant in previous studies. In the context of high dimensional data and in the spirit of further investigating variables known to be influential, it is reasonable to hope that these same variables can prove useful for predictive purposes as well. This approach is in some ways most similar to a univariate filter method, as it is independent of the classifier and has no cross-validation or prediction step for VS. We show in our related work [16] how and why the popular filter approach of variable selection through statistical significance is different from variable selection through predictivity. For an intuitive illustration of the relationship between predictive and significant sets of variables, see Figure 1. Under the context of significance-test based searching for variable sets, we expand the scope of the set of variables found to be significant as the sample-size grows (see widening orange dotted ovals). However, the set of predictive variables (blue circle) are not susceptible to sample-size changes in the same way — as predictivity is a population parameter — and overlaps with, but is not one-to-one to, significant sets. It is easy to see that in this scenario, targeting significant sets may miss the point of prediction entirely. Instead, we suggest that emphasis must be placed on creating measures designed to evaluate variable sets’ predictivity. Many methods also use out of sample testing error rates or cross validation (CV) to ascertain whether prediction is done well. This approach was not designed to specifically find a theoretic “true” correct prediction rate for a given variable set; rather this is simply an evaluation for how well the selected variable sets used in the training set perform prediction in future data. Sometimes the selected variable sets in the training set are selected through statistics such as the adjusted R squared, Akaike information criterion (AIC) or Bayesian information criterion (BIC). When p  n (or even in instances where p > n), a standard in big data, however, these statistics can fail to be useful.2 Using out of sampling testing and/or cross validation techniques additionally requires setting aside valuable sample data to make sure the variable sets selected under the training set are indeed highly predictive and are not just overfitting the data. It becomes important then that we have a good screening mechanism when conducting VS for removing noisy variables (and thus finding influential ones). We show in our simulations how poorly we can do in VS for prediction through training set compared to out of sample testing prediction rates (with “infinite” future testing data). An ideal measure for predictivity would guide our VS stage through screening out noisy variables and should correlate well with the out of sample correct prediction rate. We will demonstrate a potential candidate measure, the I-score, for evaluating the predictivity of a given variable set. We also compare against using the bootstrap-cv and note that while the I-score performance is quite similar to the bootstrap-cv it is computationally faster and more efficient in the usage of sample data. 2 “Unfortunately, the C , AIC, and BIC approaches are not appropriate in the high-dimensional setting, because estimating σ ˆ2 p [variance] is problematic. (For instance, the formula for σ ˆ 2 from Chapter 3 yields an estimate σ ˆ 2 = 0 in this setting.) Similarly, problems arise in the application of the adjusted R2 in the high-dimensional setting, since one can easily obtain a model with an adjusted R2 value of 1.” [12]

4

2

Theoretical prediction rates

If the goal is to find highly predictive variables and variable sets for some outcome, a natural approach is to find methods that maximize theoretically correct prediction rates and minimize error rates, or making incorrect predictions. We show in this note that the Partition Retention (PR) method’s I-score approaches minimization of error prediction rates when dealing with high dimensional data like GWAS, and is thus a prime candidate for a method that tackles the big data prediction goal. An additional benefit of the I-score that we will show is that it easily lends itself to evaluation against the theoretical “best predictor”.

2.1

The set-up

To target variable sets with high predictivity, we must design measures that maximize prediction rates. We first ask: what does the theoretical maximization of correct prediction rates mean? Consider GWAS data of the usual type, with cases and controls. Assume that there are nd cases and nu controls.3 Prediction accuracy is evaluated either by cross-validation (CV) procedures or an independent testing set with roughly equal sizes of cases and controls as in the training set. In the former case, the original size nd + nu sample would serve as a training set. Using the traditional Bayesian binary classification setting, we ideally have a prior probability that the state of the next individual, w, is a disease case, d, π(w = d) and that the next individual is a control case, u, π(w = u) = 1 − π(w = d). In the following we shall assume that both d and u are equally likely and that the cost of an incorrect classification is the same for both possibilities. Later, we generalize to different cost functions and priors for d and u. In assessing the predictivity of a variable set, ideally we know the joint distribution of the disease status and the feature value x, denoted p(x, w), either through knowledge or through estimation via a reliable source. The joint distribution can be expressed as: p(w, x) = π(w|x) · p(x) = p(x|w) · p(w), where π(w|x) is the posterior distribution and π(w) is the prior. It is easy to see that the best classification rule can be derived by Bayes’ decision rule for minimizing the posterior probability of error: d if π(d|x) > π(u|x), otherwise u. Here x = (x1 , x2 , ..., xm ), with each xi taking one of the values in {0, 1, 2}, corresponding to the 3 possible gentoypes for each single-nucleotide polymorphism (SNP). In this way, x forms a partition, denoted by Πx , with 3m = m1 elements (Πx = {x = (x1 , x2 , ..., xm ), xj ∈ {0, 1, 2}, 1 ≤ i ≤ m}). A problem that emerges when dealing with case-control data like GWAS is that prior information on observing the next person as a disease case is unavailable or unknown and cannot be easily estimated from empirical data. Priors are defined by circumstances and contexts within which the case-control data are sampled — each dataset requires its own unique and unknown prior at that point in time. To surmount this, we focus first on the classification problem based on the class-conditional probability: p(x|w = 3 We consider prediction of a binary variable (or classification) here. Our framework can be extended for prediction of a continuous variable. In this case, the I-score still plays a role in measuring predictivity though we leave this for beyond the scope of this paper.

5

d) versus p(x|w = c) . This is equivalent, implicitly, to the full Bayes decision rule when an equal prior is adopted, π(d) = π(c) = 1/2. General cases for arbitrary priors and unbalanced loss functions are discussed later. Let X be a discrete random vector defined on the space Πx , with density denoted by pX (x). Suppose nd cases and nu controls are independently selected from two discrete probabilities: pXd (x) and pX c (x), equivalent to p(x|w = d) and p(x|w = c) respectively. Note that {X d , X c } always come as a pair when the m SNPs are fixed (fixing Πx = {x = (x1 , x2 , ..., xm )}); that is, X d and X c are defined on a common partition space, Πx . If the newly arrived subject has a 50% chance to be either case or control, the expected error of adopting this rule is:

e[pX d , pX c ] =

1 X min{pX d (x), pX c (x)} 2 x∈Πx

The correct prediction rate c becomes:

c = c[pX d , pX c ] = 1 − e[pX d , pX c ] =

1 X max{pX d (x), pX c (x)} 2 x∈Πx

2.2

Maximizing the objective function

It is clear that one can achieve a better prediction rate by seeking the probability pair {pX d , pX c } that minimizes the expected predictive error e[pX d , pX c ], or equivalently, maximizes the predictive rate c[pX d , pX c ]. Equivalently: 1 X 1 1 {c[pX d , pX c ] − e[pX d , pX c ]} = c[pX d , pX c ] − = |pX d (x) − pX c (x)| 2 2 4

(1)

x∈Πx

Therefore,

c[pX d , pX c ] =

1 1 X + |pX d (x) − pX c (x)| 2 4

(2)

x∈Πx

This suggests that we can achieve better prediction rates by choosing variable sets corresponding to the probability pairs that maximize Equation 2, which we also refer to as Solution (1). In this theoretical setting, it is easy to show that c increases or stays the same when another variable is added to the current variable set. We can consider the solution of the maximization of the objective function as simply finding the set of X variables that maximizes c. In this setting, where sample size is no constraint, this characteristic is handy in that we are never hurt in our search for highly predictive variables by simply adding explanatory variables to our current set. However, in the realistic world of sample size constraints, a direct search for a variable set with a larger value of 2 will fail; we give a heuristic explanation as to why in the following section.

6

2.3

Solution (1) and problems with the sample analog

First, as noted earlier, Solution (1) as seen in Equation 2 is always increasing (or the same) when more variables are added to the current variable set. Therefore, in principle, a unique variable set that maximizes the equation is difficult to identify. Second, in reality, the actual theoretical values of the equation are unknown and must be estimated. We may naturally turn to the estimate of its true theoretical values. Again, due to the similar problem occurring for the true equation, the estimated values under the sample analog (empirical predictive training rate) is monotically increasing with the addition of more variables into the existing variable set. To make this point clearly, we assume all variables Xj can take only two values, 0 or 1. Suppose Xk = {X1 , ..., Xk } and Xk+1 = {X1 , ..., Xk , Xk+1 }. The partition formed by Xk is Π(Xk ) = {A1 , ..., AM }, while the partition formed by Xk+1 is Π(Xk+1 ) = {A1 ∩ B, ..., AM ∩ B, A1 ∩ B c , ..., AM ∩ B c } = {Π(Xk ) ∩ B, Π(Xk ) ∩ B c } where B = {Xk+1 = 1}. We then have Π(Xk )1 = Π(Xk ) ∩ {Xk+1 = 1} and Π(Xk )0 = Π(Xk ) ∩ {Xk+1 = 0}, where Π(Xk )1 and Π(Xk )0 form two finer partitions of Π(Xk+1 ) = Π(Xk )0 ∪ Π(Xk )1 . pΠ(Xk )1 (d) − pˆΠ(Xk )1 (c)|. It is thus easy to see that Then |ˆ pΠ(Xk ) (d) − pˆΠ(Xk ) (c)| ≤ |ˆ pΠ(Xk )0 (d) − pˆΠ(Xk )0 (c)| + |ˆ the sample analog inherently favors an increase in number of partitions. As the partition becomes increasingly finer, there reaches a point where there is at maximum a single observation within each partition element and 100% correct sample prediction rate is attained. This is true regardless of the true prediction rate. As a result, the final estimated prediction rate will be equivalent to 100%, rendering it useless as a method for searching for highly predictive variable sets and screening out noisy variable sets. This is a direct result of a sparsity problem that does not occur in our theoretical world but certainly plagues the sample-size constrained real world. In our sample constrained setting, the natural sample analog of Equation (2) favors ever-increasing the variable set with both truly influential as well as noisy and uninfluential variables. We continue accepting both types of variables until our partition experiences complete sparsity. What is needed is a “sample-based measure” that can discern adding noisy versus influential variables and identify the variable set(s) that maximize prediction rates under a given sample size.

2.4

Alternative Solution (2)

We consider this obstacle and suggest the I-score of the PR method as a possible approach to countervailing the inability to maximize the sample analog of the prediction rate. We suggest an alternative solution, or Solution (2), for which we can provide a measure in the sample-constrained world. We provide a simple lemma:

Lemma 1 qP Pk k 2 Let a1 , a2 , a3 ...ak be k nonnegative numbers. Then i=1 aj ≥ i=1 aj . If we replace ai by |p(j|d) − p(j|c)| ∀i, Pk 1 ≤ j ≤ k, it is clear that by maximizing i=1 (p(j|d) − p(j|c))2 over possible pairs will have the parallel effect of encouraging selection of probability pairs that satisfy the maximization in Equation 1, yielding a better predictor. In the I-score asymptotics section that follows, we show that the I-score can be seen asymptotically as precisely 7

Pk the maximization of the term up to a constant A(Πx ) = i=1 (p(j|d) − p(j|c))2 . qP Pk k 2 Since i=1 |p(j|d) − p(j|c)| ≥ i=1 (p(j|d) − p(j|c)) , a strategy that seeks a variable set with a larger value of A(Πx ) (asymptotically equivalent to a variable set with a larger I-score) will automatically have the effect of searching for a variable set with a better prediction rate. The rationale here is that our alternative solution (2) also encourages satisfying the maximization problem from Equation 1, but the measure for which we suggest for the alternative solution (2), the I-score, importantly can discern noisy variables from influential ones (for theoretical work supporting this characteristic of the I-score, see[4]).

The I-score

3

We can show that, as sample sizes increase, identifying a cluster of variables with a larger influential score, or I-score, will simultaneously yield a cluster with high predictivity. The key insight to note is that the squared right hand side of Equation (1) (ignoring the constant 1/4) is directly related to the asymptotic I-score through Lemma 1. The Influential score (I-score) is derived from the Partition Retention (PR) method. Several forms and variations were associated with the PR method before it was finally coined with this name in 2009 [4]. We introduce the PR method and the I-score briefly here. Consider a set of n observations of a disease phenotype Y (dichotomous or continuous) and a large number S of SNPs, X1 , X2 , ..., XS . Randomly select a small group, m, of the SNPs. For simplicity of presentation, we call this small group m, Xk , k = 1, ..., m that take values 0, 1, and 2 (corresponding to 3 genotypes for a SNP locus: AA, A/B and B/B). There are then m1 = 3m possible values for each set of X’s. Partition the n observations into m1 cells according to the values of the m SNPs and refer to this partition as Π.4 The proposed I-score is designed to place greater weight on cells that garner more observations:

IΠ =

m1 X nj j=1

where s2n =

3.1

1 n

Pn

i=1 (Yi

n

·

(Y¯j − Y¯ )2 s2n

Pm1 =

j=1 P n

n2j (Y¯j − Y¯ )2 (Yi − Y¯ )2

(3)

i=1

nj

− Y¯ )2 .5

Asymptotics of the I-score

The I-score is able to serve as a measure for alternative Solution (2) because, as we demonstrate here, the I-score approaches the desired squared-term measure in Solution (2) asymptotically. It is also formally proven in [4] to be able to screen out noisy variables. We demonstrate this screening ability in the simulation section in this paper. Under the null hypothesis of no association between {Xj , j = 1, ..., m} and Y, IΠ can be asymptotically expressed 4 Note: 5 We

the Π partition here is equivalent to the earlier discrete random variable X’s finite K values. use GWAS data to motivate our presentation of the I-score and PR method, but the approach works for any discrete data.

8

as

Pm1

j=1

pj χ2j (a weighted average), where pj is the probability of the kth cell and {χ2j } are m1 chi-squares, each

with degree of freedom, df= 1 [4]. The above formulation and properties of IΠ apply to the specified Y model with case-control study (where Y = 1 designates case and Y = 0 designates control as a special example) as demonstrated in [4]. More specifically, in a case-control study with nd cases and nu controls, ns2 IΠ can be expressed as the following:

ns2n IΠ =

X

n2j (Y¯j − Y¯ )2

xj ∈Π m

=

3 X

(nm d,j

+

2 nm u,j )

+

2 nm u,j )



nm nd d,j − m nd,j + nm n d + nu u,j

2



nm nd d,j m − n +n nm + n d u u,j d,j

2

j=1 m

=

3 X

(nm d,j

j=1

2 X 3m 

2 nm nm u,j d,j − = nd nu i=1  2 X  m 2 nd,j nm nd nu u,j = − nd + nu nd nu 

nd nu nd + nu

xj Π

m where nm d,j and nu,j denote the numbers of cases and controls falling in jth cell, and Π stands for the partition

formed by m variables. Since the PR method6 seeks the partition that yields the largest I-score, one can decompose the following:

ns2n IΠ =

X

n2j (Y¯j − Y¯ )2

xj ∈Π

= An + Bn + Cn

where,

An =

X

n2j (Y¯j − µj )2

xj ∈Π

Bn =

X

n2j (Y¯ − µj )2

xj ∈Π

Cn =

X

−2n2j (Y¯j − µj )(Y¯ − µj )

xj ∈Π

6 The PR method encompasses a backwards dropping algorithm (BDA) that is introduced in [20]; we directly cite and present the BDA in Appendix III.

9

Here, µj and µ are the local and grand means of Y , that is, E(Y¯j ) = µj ; Y¯ = µ =

nd nd +nu

for fixed n.

It is easy to see that both terms An and Cn , when divided by n2 (the square of the sample size) converge to 0 in probability as n −→ ∞. We turn to the final term, Bn . Note that:

lim n

X n2j Bn = lim ( 2 )(µj − µ)2 n n2 n xj ∈Π

In a case-control study, we have:

µj =

nd p(xj |d) nd p(xj |d) + nu p(xj |c)

and

µ= Since for every j,

nj n

nd nd + nu

converges (in probability) to pj = λp(xj |d) + (1 − λ)p(xj |u) as n → ∞, if limn

nd n

= λ, a fixed

constant between 0 and 1, it follows that:

X n2j 2 Bn λp(xj |d) p X 2 = −λ ( 2 )(µj − µ)2 → pj 2 n n λp(xj |d) + (1 − λ)p(xj |u) xj ∈Π

=

X 

as n → ∞

xj ∈Π

2 λp(xj |d) − λ[λp(xj |d) + (1 − λ)p(xj |u)]

xj ∈Π

=

X 

2 λ(1 − λ)p(xj |d) − [λ(1 − λ)p(xj |u)]

xj ∈Π

= λ2 (1 − λ)2

X

[p(xj |d) − p(xj |u)]2

xj ∈Π

= λ2 (1 − λ)2

X

[p(xj |d) − p(xj |u)]2

xj ∈Π

Thus, ignoring the constant term in the above equation (Equation ??) the I-score can guide a search for X P partitions which will lead to maximizing the summation term j∈Π [p(j|d) − p(j|c)]2 . Recall c in Equation 2:

c[pX d , pX c ] =

1 1 X + pX d (x) − pX c (x) 2 4 x∈Πx

1 1 X = + p(xj |d) − p(xj |u) 2 4 j∈Π

10

We have proved the following Theorem 1.

THEOREM 1 Under the assumption that

nd n

→ λ, a value strictly between 0 and 1, and the adoption

of an equal prior π(d) = π(u) = 1/2, then X s2n IΠ = λ2 (1 − λ)2 [p(xj |d) − p(xj |u)]2 . n→∞ n lim

(4)

j∈Π

COROLLARY 1.

Under the same assumptions in Theorem 1, the following is an asymptotic lower bound for

the correct predictive rate:  c[pX d , pX c ] ≥ lim

n→∞

1 1 + 2 4

s

IΠ nλ(1 − λ)



1 1 = + 2 4

s lim

n→∞

IΠ nλ(1 − λ)

(5)

Proof of Equation 5: The asympototic lower bound of Equation 5 is a simple consequence of Lemma 1 and Theorem 1. In theory, the above corollary allows us to apply a useful lower bound for identifying good variable sets by evaluating their I-scores. In practice, however, once the variable sets are found, the correct prediction rates may be larger than the identified lower bounds. Theorem 1 provides a simple asymptotic behaviors of I-score under some strict assumptions. We offer similar derivations below following two levels of relaxations of the constraints. COROLLARY 2. Under the assumptions of an arbitrary prior π(d) and

nd n

→ λ as n → ∞, the correct prediction rate can be

easily seen as: c∗ [pX d , pX c ] =

1 1 X + p(xj |d)π(d) − p(xj |u)π(u) 2 2

(6)

xj Π

∗ Let the modified score IΠ be defined as

 2 1 X 2 π(d) π(u) nj y¯j ( ) − (1 − y¯j )( ) . 4 λ 1−λ

(7)

∗ 2 s2n IΠ p 1 X  = p(xj |d)π(d) − p(xj |u)π(u) n→∞ n 4

(8)

∗ ns2n IΠ =

xj ∈Π

Then we have: lim

xj ∈Π

Similar lower bounds to Corollary 1 can then be derived as: 1 1 1 X c [pX d , pX c ] = + p(xj |d)π(d) − p(xj |u)π(u) ≥ + 2 2 2 ∗

xj Π

11

r

∗ λ(1 − λ)IΠ . n→∞ n

lim

(9)

Sketched Proof of (8) and (9): Since Equation (7) can be expressed as:     2 π(d) π(u) 1 X 2 nj y¯j − (1 − y¯j ) 4 λ 1−λ xj ∈Π "    #2 nku,i π(u) 1 X 2 nkd,i π(d) = nj − 4 nj λ nj 1 − λ xj ∈Π     2 nku,i nπ(d) nπ(u) n2 X nkd,i · − · ≈ 4 n nd n nc

∗ ns2n IΠ =

xj ∈Π

n2 X 2 ≈ [p(xi |d)π(d) − p(xi |u)π(u)] 4

(10)

xj ∈Π

Dividing both sides of Equation ((10)) by n2 , we obtain Equation (11) as n → ∞. Similar to Corollary 1, Equation (9) is a direct consequence of Equation (8) and Lemma 1 (but with aj replaced by |p(xj |d)π(d) − p(xj |u)π(u)|). The last generalization of the proposed framework involves the possibility of incurring different costs (or losses) when making incorrect predictions. Thus far we have used a 0-1 loss on the binary classification problem. The 0-1 loss treats the loss of two incorrect decisions equally. In real applications, the scientist may wish to weigh the costs of different incorrect predictions differently. For instance, failing to detect a cancer patient may be deemed a more costly mistake to make than that of misclassifying a healthy patient because ameliorating the former mistake later on can be more difficult. Another example could be the different cost amounts in making a loan decision. The cost of lending to a defaulter may be seen as greater than that of the loss-of-business cost of declining a loan to a non-defaulter due to some positive level of risk aversion. Let loss function L be defined as:

L(d, u) = l1 , L(u, d) = l2

(11)

L(d, d) = L(u, u) = 0

(12)

and

where l1 and l2 are the prices paid (or losses incurred) for misclassifying a diseased individual to the healthy class or a healthy person to a diseased class, respectively. It is easy to derive the optimum Bayes solution by minimizing the expected predictive loss, that is, to assign future observations to the class with less loss, given its xj value. We simply assign a test sample with partition (predictor) xj to d if:

p(xj |d)π(d)L(u, d) < p(xj |u)π(u)L(d, u)

otherwise, assign to u. Equivalently, choose d if

12

p(xj |d)π(d)l2 < p(xj |u)π(u)l1 otherwise u. In this way, the expected loss of adopting this rule is thus:

el =

1 X min{axj , bxj }, 2 xj ∈Πx

where axj = p(xj |d)π(d)l2 and bxj = p(xj |u)π(u)l1 . The random rule of choosing d or u by flipping a coin, has an expected loss of:

γ=

1X 1 (axj + bxj ) = (π(d)l2 + π(u)l1 ) , 2 2

a constant independent of partition Πx . So, the “correct prediction gain”, cl (interpreted the improvement of correct prediction with respect to γ), can be defined as:

cl =

1 X 1 X max{axj , bxj } = (axj + bxj ) − el = γ − el 2 2 xj ∈Πx

xj ∈Πx

Again we have γ cl − el + 2 2 1 X γ axj − bxj = + 2 2

cl =

After standarizing by γ, we obtain the correct prediction rate as: cl γ 1 1 X = + axj − bxj 2 2γ

c=

c Collecting the above discussion together, let the cost-based I-score IΠ be defined as:

c ns2n IΠ

     2 1 X 2 π(u) π(d) = 2 nj y¯j l2 − (1 − y¯j ) l1 4γ λ 1−λ j∈Πx

2



n X 2 [p(xj |d)π(d)l2 − p(xj |u)π(u)l1 ] 4γ 2 j∈Πx

We present the following lower bound in Corollary 3.

13

(13)

COROLLARY 3. Under the assumptions of Corollary 2 and using the loss function L described in Equation 11, then c s2n IΠ 1 X 2 [p(xj |d)π(d)l2 − p(xj |u)π(u)l1 ] = 2 n→∞ n 4γ

lim

(14)

j∈Πx

Furthermore, one can derive a similar lower bound for the correct prediction rate c as: 1 1 X axj − bxj + 2 2γ ! r c λ(1 − λ)IΠ 1 1 ≥ lim + n→∞ 2 2γ n r c λ(1 − λ)IΠ 1 1 lim = + 2 2γ n→∞ n

c=

(15)

The proofs for Equations (14) and (15) are quite similar to that for Corollary 2 given above; we shall omit them. Note that searching for X with larger I-scores is asymptotically equivalent to maximizing a summation term which is the square of the right hand side of Equation 1. In this way, the I-score thus approximates maximization of predictivity. The theoretical implication is that the I-score not only approaches prediction through actually seeking to maximize a term that is the square of the exact measure of maximal predictivity, but also subsequently can be held to a theoretical benchmark and evaluated as a prediction method. For example, if a variable set X has a large value of I-score (substantially larger then 1, see [4]), it is a strong indication that X itself must be a variable set with good predictivity. This stands in contrast to many current approaches to prediction (e.g. Random Forest, LASSO) that are evaluated for predictivity in an ad hoc nature such as via cross validation.

3.2

Desirable properties of the I-score

We note that the I-score is one possible approach to approximating maximization of the prediction rate in the sample analog form, and that the search for other potential scores is desirable and needed. Nevertheless, several additional properties of I are particularly appealing. First, I requires no specification of a model for the joint effect of {X1 , X2 , ..., Xm } on Y . It is designed to capture the discrepancy between the conditional means of Y on {X1 , X2 , ..., Xm } and the mean of Y . Second, when a variable independent of the dependent variable is added to a group of variables which may have an influence (adding a noisy variable to an influential variable set), the I-score tends to decrease. I can screen out noisy variables. The Pearson chi-square statistic does not have this property; rather, it tends to increase, leading to a tendency to adjoin useless variables to those considered. As demonstrated earlier, the sample analog of Solution (1) also suffers from this problem. Third, as mentioned earlier, the I-score does not monotonically increase with the addition of any and all variables as would the sample analog form of Equation (1). Rather, given a variable set of size m with m − 1 truly influential

14

variables, the I-score is higher under the influential m − 1 variables than under all m variables. Further dropping to m − 2 variables leads to a decrease in the I-score. This natural tendency of the I-score to “peak” at variable set(s) that lead to the maximum predictive rate in the face of noisy variables, under the current sample size, is crucial. Most importantly, we showed that the I-score approximates maximizing Equation (1) by identifying the cluster Pk of variables that maximize the term j=1 (p(j|d) − p(j|c))2 , which is directly related to maximization of the correct prediction rate less the error rate (less than or equal to). The quadratic nature of the term the I-score seeks to maximize directly relates to the I-score’s ability to commence partition-seeking for the most predictive group of X variables from a very larger number of variables — there is an inherent maximum I-score. Note that a score or measure that exactly resembles either side of Equation (1) does not offer the assurance of maxima. We have discussed how such a measure (the sample analog of Solution (1)) would always favor absolute increases in number of variables and monotonically decrease in the decrease of variables. Finally, the I-score is directly evaluated against the theoretical benchmark of maximal predictivity through simple algebraic translation of the score.

4

Using the I-score in sample-constrained settings

We have shown that the I-score asymptotically approaches Solution (2) and has several desirable properties. We take this opportunity to explore and illustrate to the reader the application of the I-score to data. To provide additional evidence of the I-score’s ability to measure true predictivity, we consider a set of simulations for which we know the “true” levels of predictivity for all variable sets. We also provide a real data application on breast cancer for which the I-score approach has done very well in predicting. We take a moment to comment that evaluating a variable (or variable set) for predictivity, which is the variable selection (VS) stage, is different from evaluating a given classifier, which is the prediction stage. The latter considers evaluation of predictivity for a function of a set of explanatory variables f (x) for some outcome variable y, while the former considers the underlying predictivity of the set of explanatory variables x for that outcome y. Often we see the two stages conducted simultaneously, as is common in approaches such as LASSO or other regression-based approaches. Our work here focuses simply on the VS stage. Variables selected as highly predictive in our framework can then be flexibly used in various models for prediction purposes as pleases the researcher. We are now in an odd situation where we have identified variable sets that could not have been found using current traditional approaches and yet we would like to evaluate the predictivity of our identified variable module against traditional approaches. Nevertheless, we endeavor to do so by comparing our measure of predictivity of a given variable set, which is motivated through a theoretical foundation, with current sample-based evaluations of prediction errors. Several options arise: training, testing, and as well as bootstrap corrected error rates. We utilize these current options as a basis of comparison against our theory-based approach towards measuring underlying levels of predictivity of variable sets. We will show that the measure used in our approach provides a useful

15

approximation of a lower bound of the “true” level of out of sample testing predictivity, and correlates well with it, while training does not correlate well with the out of sample prediction rate. Bootstrap cv correlates in a similar fashion as the I-score lower bound to the out of sample prediction rate but is tremendously more computationally taxing.7 As such, our approach has an important benefit to prediction research: compared with methods such as cross validation of error rates, the I-score is both efficient in the usage of sample data, in the sense that it uses all observations instead of separating data into testing and training, and is also less computationally cumbersome as it is a single measure calculated per variable set.

4.1

Simulations

(section to be updated)

4.2

Real data applications

4.2.1

Application to the vant Veer breast cancer data

To reinforce the previous sections, we turn to a brief analysis of real disease data. As noted before, part of this research team has discovered that applying the PR approach to real disease data has not only been quite successful in finding variable sets (thus encompassing higher order interactions, traditionally rather tricky in big data), but has also resulted in variable sets that are very predictive8 that do not necessarily show up as significant through traditional significance testing. We present examples of some discovered variable sets found to be highly predictive for a real data set on breast cancer that are not highly significant910 . Systematic name

Gene name

Marginal p-value

1

Contig45347_RC

KIAA1683

0.008

2

NM_005145

GNG7

0.54

3

Z34893

ICAP-1A

0.15

4

NM_006121

KRT1

0.9

5

NM_004701

CCNB2

0.003

Joint I-score: 2.89

Joint p-value: 0.005

Family-wise threshold: 6.98x10−5

Table 1: Real data example vant Veer breast cancer data. In Table 1 we investigate the top, 5-variable module found to be predictive through both top I-score and performance in prediction in cross-validation and an independent testing set in [20]. To find how significant these variables are, we calculate the individual, marginal association of each variable in the marginal p-value. Given the family-wise p-value threshold of 6.98x10−5 , none of these variables show up as statistically significant. Measuring 7 If

one uses B bootstrap to estimate cv rate, it takes B-fold more time compared to computing the I-score. “predictive” refers to both high in I-score as well as having high correct prediction rates in k-fold cross-validation testing rates. 9 These examples are found from the analysis and data used in [20]. 10 We note an inherent difficulty the presentation of the reverse situation, that of finding the most significant variable sets in the breast cancer data and determining their predictivity rates. This is precisely because the PR approach allows for higher order interaction searches which is more difficult using current common approaches. While it is possible to use common approaches to discover marginally significant variables, or possibly two-way interactions, and then determine their predictivity rates, capturing up to 5-way (as shown in our presentation here using the PR approach) interactions is not yet feasible as of the date of this writing. 8 Here

16

the joint influence of all five variables does not have a p-value that is significant either. Using the variable sets that appeared to have the highest I-scores to predict on this dataset resulted in an out-of-sample testing error rate of 8%, in direct comparison with the literature’s best error rates of 30%.

5

Concluding remarks

Prediction has become more important in recent decades and, with it, the need for tools appropriate for good prediction. A first step in prediction is to find variable sets that are highly predictive, what we have called the VS stage in this paper. We have shown in other work that approaching VS through selection of variables from a statistical significance criterion (for instance, using the chi-square test statistic) is not ideal [16]. The current alternative solution is to conduct VS via sample-based, out-of-sample testing error rates. This approach is ad hoc in nature, sample-based, and is not measuring some theoretical underlying level of predictivity for a given variable set. Often validation of selected candidate variable sets requires setting aside valuable sample data in out of sample testing or cross validation. Sometimes the sample size may not suffice for validating variable set sizes larger than one or two variables, as is often the case in big data like GWAS. As such, prediction research would benefit from a theoretical framework towards finding very predictive variables. We believe our work here is a first effort in that direction, by considering what theoretically maximally predictive variable sets are, and how we might try to find them. We identify the equation for the theoretical maximally predictive variable sets in Equation ?? and then demonstrate that unfortunately, the sample analog for Solution (1) is quite useless in a sample-constrained world. As such, we offer an alternative solution, whereby we approximate the solution to the equation we wish to maximize; importantly we show that the I-score has a natural tendency to discard noisy variables, keep influential ones, and asymptotically approach this alternative solution to Equation ?? for predictive variables. The I-score does well in both our complex simulations as well as real data applications vis à vis correct prediction rates. We note that other measures with such desirable properties may also exist, and we encourage rigorous research in this direction. As a new field of inquiry, the search for measures that maximize predictivity may do much in the way of living up to the hopes of advancing predicting outcomes of interest, such as disease status.

17

Figures Set)of)variable)modules) with)predic6ve)power) above)certain)threshold) Significant)set)by)a)test) using)a)small)sample) Significant)set)by)a)test) using)a)large)sample) Significant)set)by)a)test) using)a)huge)sample) Figure 1: Illustration of the relationship between predictive and significant sets of variable sets. Rectangular space denotes all candidate variable sets. Significant sets are identified through traditional significancetests.

6

Appendix I: Simulation details

(to be updated)

7

Appendix III11

The backward dropping algorithm (BDA) is a greedy algorithm to search for the variable subset that maximizes the I-score through stepwise elimination of variables from an initial subset sampled in some way from the variable space. The details are as follows. 1. Training set: Consider a training set {(y1 , x1 ), ..., (yn , xn )} of n observations, where xi = (x1i , ..., xpi ) is a p-dimensional vector of explanatory variables. Typically p is very large. All explanatory variables are discrete. 2. Sampling from variable space: select an initial subset of k explanatory variables Sb = {Xb1 , ..., Xbk }, b = 1, ..., B. 3. Compute I-score: I(Sb ) =

P

j∈Pk

n2j (Y¯j − Y¯ )2 .

4. Drop variables: Tentatively drop each variable in Sb and recalculate the I-score with one variable less. Then drop the one that gives the highest I-score. Call this new subset Sb0 which has one variable less than Sb . 11 The

presentation of the backward dropping algorithm is taken directly from section 2.2 of [20]. For further details, see [20].

18

5. Return set: Continue the next round of dropping on Sb0 until only one variable is left. Keep the subset that yields the highest I-score in the whole dropping process. Refer to this subset as the return set Rb . Keep it for future use. If no variable in the initial subset has influence on Y , then the values of I will not change much in the dropping process [...]. On the other hand, when influential variables are included in the subset then the I-score will increase (decrease) rapidly before (after) reaching the maximum [...].

References [1] Predicting the influence of common variants. Nat. Genet. 45, 339 (2013). [2] Cantor, R. M., Lange, K. & Sinsheimer, J. S. Prioritizing GWAS results: A review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 86, 6–22 (2010). [3] Clayton, D. G. Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet. 5, e1000540 (2009). [4] Chernoff, H., Lo, S. H. and Zheng, T. (2009) Discovering influential variables: a method of partitions. Annals of Applied Statistics 3 (4):1335-1369. [5] de los Campos, G., Gianola, D. & Allison, D. B. Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat. Rev. Genet. 11, 880–6 (2010). [6] Fisher, R.A., Statistical methods for research workers. 14th ed. 1970, Darien, Conn: Hafner Pub. Co. [7] Gransbo, K. et al. Chromosome 9p21 genetic variation explains 13% of cardiovascular disease incidence but does not improve risk prediction. J. Intern. Med. 274, 233–40 (2013). [8] Gyenesei, A., et al., High-throughput analysis of epistasis in genome-wide association studies with BiForce. Bioinformatics (Oxford, England), 2012. 28(15): p. 1957-1964. [9] Hemani, G., et al., EpiGPU: exhaustive pairwise epistasis scans parallelized on consumer level graphics cards. Bioinformatics (Oxford, England), 2011. 27(11): p. 1462-1465. [10] Hu, X., et al., SHEsisEpi, a GPU-enhanced genome-wide SNP-SNP interaction scanning algorithm, efficiently reveals the risk genetic epistasis in bipolar disorder. Cell Res, 2010. 20(7): p. 854-7. [11] Jakobsdottir, J., Gorin, M. B., Conley, Y. P., Ferrell, R. E. & Weeks, D. E. Interpretation of genetic association studies: markers with replicated highly significant odds ratios may be poor classifiers. PLoS Genet. 5, e1000337 (2009).

19

[12] James, G., Witten, D., Hastie, T., & Tibshirani, R. An introduction to statistical learning. (Springer, New York, 2013), p. 241. [13] Janssens, A.C. and C.M. van Duijn, Genome-based prediction of common diseases: advances and prospects. Hum Mol Genet, 2008. 17(R2): p. R166-73. [14] Kwon, M.-S., et al., cuGWAM: Genome-wide association multifactor dimensionality reduction using CUDAenabled high-performance graphics processing unit. International journal of data mining and bioinformatics, 2012. 6(5): p. 471-481. [15] Liu, Y., et al., Genome-wide interaction-based association analysis identified multiple new susceptibility Loci for common diseases. PLoS genetics, 2011. 7(3): p. e1001338. [16] Lo., A, Chernoff, H., Zheng, T., & Lo, S. Why are significant variables not automatically good predictors? Forthcoming in Proceedings of the National Academy of Sciences. [17] Schupbach, T., et al., FastEpistasis: a high performance computing solution for quantitative trait epistasis. Bioinformatics, 2010. 26(11): p. 1468-9. [18] Wan, X., et al., BOOST: A fast approach to detecting gene-gene interactions in genome-wide case- control studies. American journal of human genetics, 2010. 87(3): p. 325-340. [19] Wan, X., et al., Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics, 2010. 26(1): p. 30-7. [20] Wang, H., Lo, S.-H., Zheng, T., & Hu, I. (2012). Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics (Oxford, England), 28(21), 2834–42. doi:10.1093/bioinformatics/bts531. [21] Wang, Y., et al., AntEpiSeeker: detecting epistatic interactions for case-control studies using a two- stage ant colony optimization algorithm. BMC Res Notes, 2010. 3: p. 117. [22] Ware, J.H., The limitations of risk factors as prognostic tools. N Engl J Med, 2006. 355(25): p. 2615-7. [23] Witte, J. S. Genome-wide association studies and beyond. Annu. Rev. Public Health 31, 9–20 4 p following 20 (2010). [24] Yang, C., et al., SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies. Bioinformatics, 2009. 25(4): p. 504-11. [25] Yoshida, M. and A. Koike, SNPInterForest: a new method for detecting epistatic interactions. BMC Bioinformatics, 2011. 12: p. 469.

20

[26] Yung, L.S., et al., GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies. Bioinformatics, 2011. 27(9): p. 1309-10. [27] Zheng, S. L. et al. Cumulative association of five genetic variants with prostate cancer. Eur. Urol. 54, 460–1 (2008). [28] Zhang, Y. and J. Liu, Bayesian inference of epistatic interactions in case-control studies. Nature Genetics, 2007. 39(9): p. 1167-1173. [29] Zhu, Z., et al., Development of GMDR-GPU for Gene-Gene Interaction Analysis and Its Application to WTCCC GWAS Data for Type 2 Diabetes. PLoS ONE, 2013. 8(4): p. e61943.

21