Robust Variable and Interaction Selection for High-Dimensional

arXiv:1611.08649v1 [stat.ME] 26 Nov 2016

Classification via Logistic Regression Yang Li, Jun S. Liu Department of Statistics, Harvard University November 29, 2016

Abstract Discriminant analysis and logistic regression are fundamental tools for classification problems. Quadratic discriminant analysis has the ability to exploit interaction effects of predictors, but the selection of interaction terms is non-trivial and the Gaussian assumption is often too restrictive for many real problems. Under the logistic regression framework, we propose a forward-backward method, SODA, for variable selection with both main and quadratic interaction terms, where in the forward stage, a stepwise procedure is conducted to screen for important predictors with both main and interaction effects, and in the backward stage SODA remove insignificant terms so as to optimize the extended BIC (EBIC) criterion. Compared with existing methods on quadratic discriminant analysis variable selection (e.g., [Murphy et al., 2010], [Zhang and Wang, 2011] and [Maugis et al., 2011]), SODA can deal with high-dimensional data with the number of predictors much larger than the sample size and does not require the joint normality assumption on predictors, leading to much enhanced robustness. Theoretical analysis establishes the consistency of SODA under high-dimensional setting. Empirical performance of SODA is assessed on both simulated and real data and is found to be superior to all existing methods we have tested. For all the three real datasets we have studied, SODA selected more parsimonious models achieving higher classification accuracies compared to other tested methods.

1

Introduction

Nowadays the amount of data generated everyday has been dramatically increasing. Machine learning algorithms are routinely used to extract information and make predictions from the data. Classification, also known as "supervised learning", is a foundational building block of statistical machine learning, of which the goal is to train a model from data to predict future observations to one of the known classes. Applications of statistical classification methods include but are not limited to cancer diagnosis [Tibshirani et al., 2002], text categorization [Joachims, 1998], computer vision [Phillips, 1998] and bioinformatics problems such as protein-protein interaction prediction [Chowdhary et al., 2009]. Many classification methods have been developed, including logistic re-

1

gression, naive Bayes, K-nearest-neighbors, support vector machines [Boser et al., 1992], and random forests [Breiman, 2001]. As important players in classification methods, linear and quadratic discriminant analysis (LDA and QDA) [Anderson, 1958] are widely used. One advantage of QDA is its ability to exploit interaction effects of predictors. In recent years, there has been a significant surge of interest in selecting interaction effects for regression or classification problems (e.g. [Simon and Tibshirani, 2012, Bien et al., 2013, Jiang et al., 2014, Fan et al., 2015]), which both improves the classification accuracy and is of scientific interest. With rapid technical advances in data collection, it has been more common that the number of predictors is in a higher order of the number of observations, which is also known as the “large p small n” problem. For example, in gene expression microarray analysis, usually n is in hundreds of samples, and p is in thousands of genes [Efron, 2010], and in a typical genome-wide association study, n is in the order of a few thousands of subjects, and p is from some thousands to several millions SNP markers [Waldmann et al., 2013]. Large p small n is a serious problem to quadratic discriminant analysis, as the number of interaction terms is of order O p2 . Vanilla LDA or QDA 

are infeasible when p > n since the sample covariance matrices are consequently singular. Even in low-dimensional scenario, including unrelated predictors could significantly impair the classification accuracy, as they bring in extra noise and increase the number of parameters to be estimated. A number of variable selection methods have been developed for high-dimensional classification problems. Many previous work focused on imposing regularization on the LDA model for both sparsity of model and stability of parameter estimation. For example, [Shao et al., 2011] proposed a sparse LDA method that takes sparse estimates of the covariance matrix of mean vectors. [Witten and Tibshirani, 2011] proposed to use fused Lasso to penalize discriminant vectors in Fisher’s discriminant problem. [Cai and Liu, 2011] proposed to estimate the product of precision matrix and the difference between two mean vectors directly through a constrained L1 minimization. [Han et al., 2013] relaxed the normal assumption of LDA to the larger Gaussian Copula models. For more examples of recent development of high-dimensional LDA, see [Guo et al., 2007] [Fan and Fan, 2008], [Clemmensen et al., 2011], [Mai et al., 2012] and [Fan et al., 2013]. Aforementioned methods all work for LDA models with only linear effects. In many real data sets, however, interaction effects may be significant and scientifically interesting. Ignoring interaction effects in these problems may lead to inferior prediction accuracy. To illustrate, consider a two-class Gaussian classification problem with both linear and interaction effects. For simplicity, throughout this article we use “interactions” to denote all second-order effects, including both twoway interactions Xi Xj with i 6= j and quadratic terms such as Xi2 . Let X = (X1 , X2 , X3 ) denote the set of predictors, and the true Bayes rule is to classify an observation to class 1 if Q (X) > 0, and to class 0 otherwise, where Q (X) = 1.577 + X1 − 0.6X12 − 0.6X32 − 0.7X1 X2 − 0.7X2 X3

(1)

We simulated 100 independent datasets with 100 samples in each class. Figure 1 shows the scatter2

plot of (X1 , X2 ) for one simulated dataset. For each simulated dataset, we applied LDA, logistic regression and QDA to train classifiers, and the classification accuracy is calculated by 1000 extra samples from the true model. The results are shown in Table 1. As expected, both linear methods, LDA and logistic regression, have poor prediction power, but QDA with ability to exploit interaction effects improved the classification accuracy dramatically. We further test the classification accuracy of QDA with not only true predictors but also k noise (unrelated) predictors (with k ranging from 1 to 50) drawn independently from N (0, 1). Figure 1 shows the classification error rate of QDA increases dramatically as the number of noise predictors

40

Class 1 Class 2

30



25

2 0







● ●●

●●

● ●●

●●



● ● ● ● ● ● ●





10

−4

−2

X2

●●

20

● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●●●● ●● ● ●●●● ●● ● ● ● ●●● ● ●● ● ●● ●● ● ● ● ●● ●● ● ● ●●● ●● ●● ●● ● ●● ●● ●●● ● ● ●● ● ● ● ●●● ●●● ● ● ● ● ● ● ●

35



15



Classification error rate (%)

4

increases.

−6

−4

−2

0

2

4

6

0

X1

10

20

30

40

50

Num of redundant predictors

Figure 1: Simulated two-class Gaussian classification problem. Let µ1 , µ2 , Ω1 , Ω2 denote mean vectors and precision matrices of the two classes. We set µ1 = (0.5, 0, 0), µ2 = (−0.5, 0, 0), Ω1 = I3 − Ω, and Ω2 = I3 + Ω, where I3 is a 3 × 3 identify matrix, Ω1,1 = Ω3,3 = −0.60 and Ω1,2 = Ω2,1 = Ω2,3 = Ω3,2 = −0.35. (Left) Scatterplot on (X1 , X2 ) for one simulated dataset. Contours illustrate the shape of multivariate Gaussian distributions. (Right) QDA classification error rate versus number of noise predictors added.

Method

LDA

Logistic regression

QDA

QDA with 50 noise predictors

Test error %

34.81 (1.47)

34.88 (1.38)

15.65 (0.84)

37.33 (1.78)

Table 1: Means and standard deviations (in parentheses) of test error rates for different classification methods over 100 replications. QDA classification error rate is as high as 37.33% when additional 50 noise predictors are included. As shown above, methods with ability to conduct simultaneous main effects and interaction effects selection are of great interest. However, a direct application of Lasso on logistic regression with all second-order terms is not feasible for moderately large p (e.g. p ≥ 1000). To cope with this difficulty, [Fan et al., 2015] proposed innovated interaction screening (IIS) based on transform3

ing the original p-dimensional predictor vector by multiplying the estimated precision matrix for each class. IIS first reduces the number of predictors to a smaller order of p, and then identifies both important main effects and interactions using elastic net penalty [Zou and Hastie, 2005]. The performance of the method, IIS-SQDA, heavily depends on the estimation of the p × p dimensional precision matrix, which is usually a hard problem under high-dimensional setting. Recently, [Murphy et al., 2010], [Zhang and Wang, 2011] and [Maugis et al., 2011] proposed stepwise procedures for QDA variable selection that are similar to each other. Their methods assume that related and unrelated predictors jointly follow a multivariate normal distribution. These methods were shown to be consistent under the normality assumption, however in practice the performance can be much compromised when this assumption is violated. Many false positive predictors can be selected when predictors follow heavier-tailed distributions or they are correlated in non-linear manners (see Section 4.2). In this article, we consider the logistic regression instead of QDA model to avoid the joint normality assumption on predictors. In particular, we study the logistic regression model with secondorder terms, which covers quadratic discriminant analysis as a special case. Since predictors are conditioned on, our method is robust to misspecified generative distributions on predictors. To deal with high-dimensional predictors, we penalize the model likelihood with the extended Bayesian information criterion (EBIC) introduced in [Chen and Chen, 2008]. We propose the method SODA (Stepwise cOnditional likelihood variable selection for Discriminant Analysis), which starts with a forward stepwise screening procedure to add in predictors with main or interaction effects so as to reduce the number of candidate predictors to a smaller order of n, and finishes with a backward stepwise elimination procedure for further narrowing down individual main and interaction effects. Under some regularity conditions, we establish the screening consistency of the forward procedure and establish the individual term selection consistency of the backward procedure under high-dimensional setting. This article is organized as follows. In Section 2, we briefly review QDA and propose our procedure SODA for variable and interaction selection. Theoretical properties of SODA are studied in Section 3. In Section 4, we evaluate the performance of SODA and other methods by simulation studies. We further apply SODA to a couple of real datasets to evaluate its empirical performance in Section 5. A discussion section ends the paper in Section 6. Proofs of theorems are provided in the Appendix.

2 2.1

SODA for variable and interaction selection QDA and existing variable selection methods

We consider the K-class classification problem with quadratic discriminant analysis. Let Y ∈ {1, . . . , K} denote the class index and X = (X1 , X2 , . . . , Xp )T be a vector of p predictors. Let {(xi , yi ) : i = 1, . . . , n} denote n independent observations on (X, Y ). We want to build a model ˜ . QDA asfor predicting a new observation’s class membership y˜ based on its predictor values x 4

sumes true predictors in each class follow multivariate normal distribution, X | Y = k, µk , Σk ∼ N (µk , Σk ) ,

k = 1, . . . , K.

(2)

where µ = (µ1 , . . . , µK ) and Σ = (Σ1 , . . . , ΣK ) denote the mean vectors and covariance matrices for the K classes and N (·, ·) denotes the multivariate normal distribution. Incorporating the prior probability of Y being in each class, denoted as π = (π1 , . . . , πK ), we can calculate the posterior distribution of Y , P (Y = k | X, π, µ, Σ) =

πk φ (X | µk , Σk ) , l=1 πl φ (X | µl , Σl )

PK

˜ , the Bayes rule where φ denotes the multivariate normal density function. For a new observation x based on 0-1 loss is to predict the corresponding Y as the class with the highest posterior probability, ˜ , π, µ, Σ) . Yˆ = arg max P (Y = k | x

(3)

k

In general, the classification boundaries are quadratic functions of x ˜. When covariance matrices are assumed to be equal, i.e. Σ1 = · · · = ΣK , classification boundaries are linear in x ˜, and QDA reduces to LDA. In real applications, parameters π, µ and Σs are estimated from training samples by the maximum likelihood estimation (MLE). When p is large, usually only a small proportion of predictors have predictive power on Y . Recently [Murphy et al., 2010], [Zhang and Wang, 2011] and [Maugis et al., 2011] proposed three similar variable selection procedure based on the BIC for QDA model. Let S index a subset of predictors, S c = {1, . . . , p} \S and, let XS denote predictors in S. Let P denote the set of truly relevant predictors. All of the three methods make following two assumptions: (A1) Given predictors in P, unrelated predictors in P c do not contribute to the prediction of Y . In other words, P (Y | XP , XP c ) = P (Y | XP ) .

(4)

(A2) All of the related and unrelated predictors follow a jointly multivariate normal distribution in each class. Combining assumptions (A1) and (A2) leads to the following generative model, XP | Y = k ∼ N (µk , Σk ) , 



XP c | XP , Y = k ∼ N a + B T XP , Σ0 ,

(5)

for some vector a and matrix B. Let ˆlG (S) denote the maximum log-likelihood for the model (5). The BIC proposed in [Zhang and Wang, 2011, Murphy et al., 2010, Maugis et al., 2011], denoted as BICG , is defined as BICG (S) = −2 ˆlG (S) + df (S) log (n) ,

5

(6)

where df (S) denotes the number of parameters for the model with selected predictors in S. It has been shown that the set selected to minimize the BIC is a consistent estimator of P. Assumption (A2) appears rather strong and may lead to false positive selections if certain noise variables’ distributions deviate from the posited Gaussian structure. Indeed, examples in Sections 4 and 5 show that if unrelated predictors do not nicely follow model (5), there will easily be many false positive variable selections and the accuracy of the trained classifier will be compromised. Moreover, these methods do not scale well to high- dimensional cases when p is much larger than n.

2.2

Conditional likelihood model and its extended BIC

We consider a variable selection method without an explicit distributional assumption on X. We make the model assumption that the conditional distribution of Y given X is in the following parametric form, p (Y = k | X, θ) =

exp [δk (X | θ)] , 1 + l=1 exp [δl (X | θ)] PK−1

k = 1, . . . , K,

(7)

where δk (X | θ) is the discriminant function for class k and θ denotes the vector of parameters. Choose class K as the baseline class, then δK (X | θ) = 0 and δk (X | θ) = αk + β Tk X + XT Ak X,

for k = 1, . . . , K − 1.

Since X is conditioned on, we do not need to model the distribution of XP or XP c , which much improves the robustness of the variable selection procedure. Special cases of this model include but are not limited to: • (Multinomial) logistic regression (with Ak = 0 for all k) • Linear/quadratic discriminant analysis, where p (XP | Y ) is multivariate normal distribution • Discriminant analyses where p (XP | Y ) is in the multivariate exponential family, 



p (XP = x | Y = k, η) = h (x) g (η k ) exp η Tk x . To see the connection between QDA and model (7), it is noted that for QDA models,  1 −1 T log |Σk | − log |ΣK | + µTk Σ−1 µ − µ Σ µ , k K K K k 2 −1 T = µTk Σ−1 k − µK ΣK ,  1 −1 = − Σ−1 for k = 1, . . . , K − 1. k − ΣK , 2

αk = log (πk /πK ) − β Tk Ak

6

Let M and I denote subsets of main effect and interaction pairs, respectively, and let M0 and I0 denote corresponding true sets, defined as M0 = {j : ∃ k s.t. βk,j 6= 0}

and I0 = {(i, j) : ∃k s.t. Ak,i,j 6= 0} ,

with k indicating the class label. Let S = M ∪ I and A = M0 ∪ I0 denote the true set. The true set of related predictors, P, can be derived from A as P = M0 ∪ {j : ∃ i s.t. (i, j) ∈ I0 } . Inference of P is implied by the inference of A. The main objective is to infer individual terms in A, and we are especially interested in interaction terms in I0 . Let θ S denote a parameter vector where coefficients are 0 for terms not in S, and θ k,S denote the corresponding coefficients for class k. For a dataset {(xi , yi ) : i = 1, . . . , n}, the log-likelihood for θ S is denoted as ln (θ S ). Let Z be the augmented version of X, containing intercept 1, main effect and interaction terms of X, and let zi be the i-th observation of Z. Then ln (θ S ) takes the form of a logistic regression model in Z: ln (θ S ) =

n X

(

θ Tyi ,S zi − log 1 +

i=1

K−1 X



exp θ Tl,S zi



!)

.

l=1

˜ S denote the MLE of θ S . By Lemma 2 in the appendix, with high probability ln (θ S ) is Let θ ˜ S can be obtained by Newton-Raphson algorithm. Let θ 0 denote the true parameter convex and θ ˜ S for S with S ⊃ A. vector. Theorem 1 illustrates the consistency of θ Theorem 1. Under conditions C1 ~ C4 in section 3, as n → ∞, max

S⊃A, |S|≤Q

 

˜

θ S − θ 0 = Op n−1/2+ξ , 2

(8)

where 0 < ξ < 1/2 and Q ≥ |A| are any positive constants independent of n. In high-dimensional setting, classical BIC is too liberal and tends to select many false positives [Broman and Speed, 2002]. [Chen and Chen, 2008] proposed extended BIC (EBIC) and showed it to be consistent for linear regression models under high-dimensional setting. The EBIC for set S is specified as 



˜ S + |S| log n + 2γ |S| log p, EBICγ (S) = −2 ln θ

(9)

where |S| is the size of set S, and γ is a tunning parameter. Selection of γ depends on the relative order of n and p, and heuristics of determining γ is discussed in section 2.4. Let S˜EBIC be the selected set of predictors minimizing the EBIC, let Q be any positive constant greater than constant p0 in condition (C1) in section 3, then S˜EBIC = arg min EBICγ (S) , S: |S| ≤ Q

7

(10)

where |S| denotes the size of set S. The asymptotic property of S˜EBIC is shown by the following theorem. Theorem 2. (EBIC criterion consistency) Under conditions C1 ~ C4 in section 3, S˜EBIC is a consistent estimator of A, 



Pr S˜EBIC = A → 1, as n → ∞, for any γ > 2 − 1/ (2κ). By treating our model as a logistic regression on (Z, Y ), Theorem 2 follows directly from the asymptotic consistency of EBIC for generalized linear models (GLM), which was proved in [Chen and Chen, 2012] and [Foygel and Drton, 2011] in both fixed and random design contexts. We thus omit its proof. Different from [Chen and Chen, 2012] and [Foygel and Drton, 2011], here we require γ > 2 − 1/ (2κ) instead of γ > 1 − 1/ (2κ) to penalize additional model flexibility caused by the inclusion of interaction terms.

2.3

SODA: a stepwise variable and interaction selection procedure

In practice it is infeasible to enumerate all possible S to find the one that minimizes the EBIC. For a closely related generalized linear model variable selection problem, [Chen and Chen, 2012] and [Foygel and Drton, 2011] used Lasso [Tibshirani, 1996] to obtain a solution path of predictor sets, and chose the optimal set with the lowest EBIC. However, this method also fails under the highdimensional setting for QDA, in which there are O p2 candidate interaction terms. Furthermore, 

Lasso’s consistency for logistic regression requires the incoherence condition [Ravikumar et al., 2010] that can be easily violated due to the correlation between interaction terms and their corresponding main effect terms. Thus, it is highly likely that the true set A is not even on the solution path of Lasso. The IIS procedure proposed in [Fan et al., 2015] requires the estimation of p × p precision matrix, which is by itself a computationally intensive challenging problem. If the related and unrelated predictors are moderately correlated, IIS’s marginal screening strategy has difficulties in filtering out noise predictors. To overcome limitations of existing methods, we propose here the stepwise procedure SODA, consisting of three stages: (1) forward main effect screening; (2) forward interaction screening and (3) backward elimination. 1. Forward main effect screening: Let Mt denote the selected set of main effects at step t. SODA starts with M1 = ∅ and iterate the operations below until termination. (a) For each predictor j ∈ / Mt , create a new candidate set Mt,j = Mt ∪ {j}. (b) Find the predictor j with lowest EBICγ (Mt,j ). If EBICγ (Mt,j ) < EBICγ (Mt ), ˜ F and go to 2. continue with Mt+1 = Mt,j , otherwise terminate with M 2. Forward interaction screening: Let Ct denote the selected set of predictors at step t, and ˜ F ∪ Ct ∪ (Ct × Ct ) denote the set of terms induced by Ct . SODA starts with C1 = ∅ St = M and iterate the operations below until termination. 8

˜ F ∪ Ct,j ∪ (a) For each j ∈ / Ct , create a candidate set Ct,j = Ct ∪ {j} and let St,j = M (Ct,j × Ct,j ). (b) Find the predictor j with lowest EBICγ (St,j ). If EBICγ (St,j ) < EBICγ (St ), continue with Ct+1 = Ct,j , otherwise terminate with C˜F and go to 3. 3. Backward elimination: Let St denote the selected set of individual terms at step t of back  ˜ F ∪ C˜F ∪ C˜F × C˜F and iterate the operations below ward stage. SODA starts with S1 = M until termination. (a) For each main or interaction term j ∈ St (e.g. j = 1 or j = (1, 2)), create a candidate set St,j = St \ {j}. (b) Find term j with lowest EBICγ (St,j ). If EBICγ (St,j ) < EBICγ (St ), remove term j, otherwise terminate and retain set S˜ = St . SODA first runs the forward main effect screening and forward interaction effect screening until no new predictor can be added to reduce the EBIC score, then the backward elimination stage follows until no term should be eliminated. Stepwise forward procedure is widely used on variable screening for linear regressions (e.g. [Wang, 2009] and [Wasserman and Roeder, 2009]). The goal for forward screening stage is to add all related terms into the model at the expense of adding a small number of unrelated ones. The backward stage then removes those false positives. The forward screening does not assume the joint normality among predictors and thus is more robust than IIS. It also does not need to estimate the p × p precision matrix thus can be much faster than IIS. An important feature of SODA’s forward screening is that in each step we only need to evaluate the EBIC for adding each of the p predictors, therefore its time complexity is O (p), which is much less than the number of interaction terms O p2 . Forward screening stage usually reduces the 

number of predictors to a much smaller number than p. One feature of SODA’s backward stage is the exclusion of individual terms instead of all terms relate to one predictor. For example in Eq (1), true terms include X1 X2 and X2 X3 therefore X2 is a true predictor, but X22 is not a related term in A. Including X22 in the model increases the variance of estimation thus is sub-optimal. In other words, SODA selects individual main and interaction effect terms. Therefore SODA is adaptive in the sense that it automatically chooses between linear model with only main effect terms or more complicated quadratic classification model. We note that only the forward interaction screening stage is required for achieving the screening consistency. However, the number of parameters and the EBIC penalization in the forward interaction screening stage increases quadratically with the cardinality of Ct , therefore it can be hard to add predictors with only weak main effects. To optimize the empirical performance, we add to SODA the forward main effect screening stage to identify the predictors with only weak main effects. Asymptotic properties of SODA will be studied in Section 3 and effectiveness of SODA will be shown empirically in Sections 4 and 5.

9

2.4

Implementation issues: tuning parameter and screening depth

In the empirical analysis, we need to choose the tuning parameter γ of EBIC. Sections 3 characterizes asymptotic properties of the EBIC and SODA, and provides some guidelines for choosing γ. However, these theoretical results are not directly usable because the consistency result is only valid in an asymptotic sense. In practice, we propose to use a 10-fold cross-validation (CV) procedure for selecting γ from {0, 0.5, 1.0}. For simulation study and real data analysis in Section 4 and 5, to make SODA more easily comparable with Lasso with EBIC studied in [Chen and Chen, 2012], we fix γ = 0.5 which is also used in [Chen and Chen, 2012]. The forward screening stage terminates if the EBIC of all candidate models are larger than the EBIC of the current model. Therefore, the screening depth of the forward stage is determined by the EBIC. In Theorem 3, we show that this procedure is asymptotically screening consistent, namely the true related terms will be all included by the end of the forward stage. Nevertheless, SODA is not sensitive to adding more terms in the forward stage, since those unrelated terms will be eventually eliminated in the backward stage. Moreover, missing one related term is usually more harmful than including one unrelated term. Therefore, to optimize the empirical performance, we let SODA continue the forward interaction screening stage for pf steps after the step that fails to decrease EBIC (default pf = 3).

3

Theoretical properties of SODA

To study theoretical properties of SODA procedure, we assume following regularity conditions: (C1) The divergence speed of p is bounded above by p ≤ nκ for some κ > 0, and the size of the true predictor set P is bounded as |P| ≤ p0 for a fixed integer p0 . (C2) Magnitudes of true coefficients in θ A are bounded above and below by constants, namely there exist positive constants θmax > θmin > 0 such that θmin ≤ min {|θj | : j ∈ A} ≤ max {|θj | : j ∈ A} ≤ θmax . (C3) Let Z be the augmented version of X, containing intercept 1, as well as all first-order and second-order terms of X. Each Zj ∈ Z is sub-exponential, i.e. there are positive constants C1 and C2 such that, Pr (|Zj − E [Zj ]| > t) ≤ C1 exp (−C2 t)

for all t > 0.

(C4) Let Cov (Z) denote the covariance matrix of Z. There exist constants 0 < τ1 < τ2 < ∞ such that τ1 ≤ λmin (Cov (Z)) < λmax (Cov (Z)) ≤ τ2 , where λmin (·) and λmax (·) denote the smallest and largest eigenvalues of a matrix. We first establish the screening consistency of the forward screening stage. We can show that the forward interaction screening step is already screening consistent. To proceed, we need to define

10

the following concept to study the stepwise detectability of true predictors in P. Let θ ∗S denote the population version of the risk minimizer, θ ∗S = arg min E [− log p (Y | X, θ S )] , θS

∗ where the expectation is over the joint distribution of (Y, X). Let vector θ j∗ S be parameters in θ S

associated with predictor Xj . The stepwise detectable condition is necessary for forward screening consistent. Definition 1. (Stepwise detectable condition) A set of predictors C1 is stepwise detectable given C2 if C1 ∩ C2 = ∅, and for any set C satisfying C ⊃ C2 and C 6⊃ C1 , there exist constants θmax > θmin > 0, such that



j∗ θmin ≤ max

θ SC∪{j} c



j∈C ∩C1

≤ θmax ,

where SC∪{j} = n Mj ∪ I j with Mj = C ∪ {j} and I j = Mj × Mj , and k·k∞ denotes the L∞ o m−1 norm. Let Tm = j : predictor j is stepwise detectable given ∪i=0 Ti−1 and T0 = ∅. The set of true predictors P is said to be stepwise detectable if j ∈ ∪∞ i=1 Ti for all j ∈ P. In other words, if the current selection C contains C2 , then there always exist detectable predictors conditioning on currently selected variables until we include all the predictors indexed by C1 . A true predictor j ∈ P is not stepwise detectable either because it perfectly correlates with some other terms, or its effects can only be detected conditioning on some other stepwise undetectable terms. We give an example to illustrate the scenarios when stepwise detectable condition may or may not hold. Suppose there are two true jointly normal relevant predictors X1 and X2 with means µ1 and µ2 , and there is only one interaction term X1 X2 in model (7), i.e. A = {(1, 2)}. P is not stepwise detectable if both µ1 = 0 and µ2 = 0, since Cov (X1 X2 , X1 ) = 0 and Cov (X1 X2 , X2 ) = 0. Starting from empty set ∅, the forward procedure will not add any of the two into the model, because there is no main effect for X1 and X2 and the interaction term X1 X2 does not correlate with marginal terms X1 and X2 . However, if either µ1 6= 0 or µ2 6= 0, P = {1, 2} is stepwise detectable.   ˜ F ∪C˜F ∪ C˜F × C˜F denote the selected set of terms at the end of forward screening Let S˜F = M stage. It is unrealistic to require S˜F = A. However, it should be demanded that S˜F ⊇ A, i.e. S˜F 



contains all related terms. We define the forward stage to be screening consistent if p S˜F ⊇ A → 1. We also don’t want the size of S˜F to be too large, otherwise forward screening loses its purpose. The screening consistency of forward stage is established by the following theorem. Theorem 3. (Forward stage screening consistency) If conditions C1 ~ C4 hold, and all predictors in P are stepwise detectable, then the forward interaction screening stage finishes in finite number of steps and is screening consistent. In particular, as n → ∞, 









Pr C˜F ≤ Q → 1, and Pr C˜F ⊇ P → 1,

11

l

m

−2 where Q = 8λ−1 1 θmin log K , λ1 is a positive constant defined in Lemma 2 in appendix, K is the

number of classes, and θmin is a positive constant defined in condition C2. In other words, asymptotically C˜F contains all predictors in P, which implies S˜F ⊇ A, and the forward stage stops in finite number of steps. We show in the following theorem two uniform bounds guaranteeing that all unrelated terms will be eliminated and all related terms will be kept in the backward stage. Once again, symbol j in the theorem indexes a main effect or interaction term. Theorem 4. (Uniform bound of EBIC in backward stage) Fix any positive constant Q > 0. Under conditions C1 ~ C4, as n → ∞, !

Pr

max

min {EBICγ (S\ {j}) − EBICγ (S)} < 0

S)A:|S|≤Q j∈S\A

and

→ 1,

(11)

!

Pr

min

min {EBICγ (S\ {j}) − EBICγ (S)} < 0

S⊃A:|S|≤Q j∈A

→ 0,

(12)

for any constant γ > Q − |A| − (2κ)−1 . Eq (11) implies that if S ) A and |S| ≤ Q, there will be at least one unrelated term j ∈ S ∩ Ac such that removing j from S leads to lower EBIC. Eq (12) implies that if S ⊃ A and |S| ≤ Q, there is no related term j ∈ A such that removing j from S leads to lower EBIC. As a summary, as n → ∞, with probability tending to 1, no related term will be eliminated, and all unrelated terms will be eliminated in the backward stage until S˜ = A. Theorem 4 requires candidate sets have finite size (|S| ≤ Q), which is proved by Theorem 3 to hold asymptotically for the starting set of the backward stage S˜F . Hence, combining Theorem 3 and 4 establishes the model selection consistency of SODA.

4 4.1

Simulation studies Logistic regression with only main effects

We first study the performance of SODA under scenarios when there are only main effects but no interactions. In this section, we compare SODA and Lasso (denoted as Lasso-Logistic) on logistic regression variable selection. We consider two simulation settings in Examples 1.1 and 1.2 respectively. In both examples, we simulated predictors X from the multivariate normal distribution with covariance matrix C. In Example 1.1 we set C to have power decay correlations between variables, and in Example 1.2 we obtained C from a real dataset. Let Q denote the Fisher information matrix of the form, 

n

o

Q ≡ E −∇2 log p (Y | X, θ 0 ) = 

12



exp θ T0 X 1 + exp



θ T0 X

T 2 XX .

(13)

Let sub-matrices Q21 = QP c ,P and Q11 = QP,P . The consistency of Lasso-Logistic requires the incoherence condition [Ravikumar et al., 2010] that there exists an α ∈ (0, 1] such that



Q21 Q−1 11



≤ 1 − α.

Let c = Q21 Q−1 11 , then c has the length as the number of unrelated predictors, and the incoherence condition requires each element of c to be smaller than 1. For each setting, we randomly generated 100 datasets from logistic regression model with n = 100, 150, . . . , 2000 observations, and applied SODA and Lasso-Logistic to each data set. We used EBICγ as criterion for both SODA and Lasso-Logistic. Lasso-Logistic fitted a solution path of selected predictors, and chose the optimal set of predictors with the lowest EBICγ . In simulation studies, we set γ = 0.5 for SODA and Lasso-Logistic. We calculated the average number of false negatives (FN) and false positives (FP), and the percentage of correct fits (PCF), which is the percentage of times that the selected set is the true set A. For SODA, any selected interaction term would also be considered as a false positive. Example 1.1. Let p = 1000, and we randomly selected 5 true predictors with coefficients β0,j ∼ Unif [0.5, 2], j ∈ A. The covariance matrix is set to have power decay correlation such that Ci,j = (0.5)|i−j| . Following a similar argument as Corollary 3 of [Zhao and Yu, 2006], it is easy to show that X satisfies the incoherence condition. The histogram of elements of c in log-scale for one simulation run is plotted in Figure 2, and it is shown that no cj ≥ 1 in c. As shown in Figure 3, SODA and Lasso-Logistic had very similar performances under this setting. Example 1.2. In this example, we also randomly selected 5 true predictors with coefficients β0,j ∼ Unif [0.5, 2], j ∈ A. The covariance matrix C was set to be the sample covariance matrix of the Michigan lung cancer dataset [Beer et al., 2002] analyzed in Section 5.2 for p = 5, 217 genes. So Example 1.2 had a much higher dimension than Example 1.1. The histogram of elements of c in log-scale for one simulation run is plotted in Figure 2. As illustrated by Figure 2, in real dataset predictors are usually highly correlated with each other, and the incoherence condition is strongly violated. As shown in Figure 3, in this case LassoLogistic had a very poor performance whereas SODA performed robustly. As n increases, the Lasso-Logistic’s total number of FPs and FNs increased, and PCF stayed at zero. In contrast, SODA had increasing probability of selecting the correct model A as n increases. In both two examples, SODA did not select no any interaction term for all simulations.

4.2

Quadratic discriminant analysis with interactions

In this section, we compare performances of several methods on variable and interaction selection for quadratic relationship models. Besides SODA, we also consider the backward procedure proposed in [Zhang and Wang, 2011], denoted as ZW, and the forward-backward procedure proposed in [Murphy et al., 2010], denoted MDR. Both methods assume joint normality between XP and XP c as specified in Eq (5). We also consider the IIS-SQDA in [Fan et al., 2015]. The screening

13

1000

500

1110 elements in c > 1

0

0 200

600

Frequency

300 100

Frequency

0 element in c > 1

0

1

2

3

4

0

log (1 + c)

1

2

3

4

log (1 + c)

Figure 2: Histogram of elements of c in one simulation run for Example 1.1 (Left) and 1.2 (Right). stage of IISQ-SQDA assumes the joint normality between XP and XP c , while the variable and interaction selection stage does not depend on the assumption. SODA does not assume the joint normality throughout the procedure. We consider five simulation settings respectively in Examples 2.1 - 2.5. The first setting follows the multivariate normal model as assumed in Eq (5) while the other settings do not. There are two classes (K = 2) and p predictors, where the first 3 predictors, X1 , X2 and X3 , are related predictors and other p − 3 predictors are unrelated but correlated with the 3 related predictors. Example 2.1 - 2.4 are low-dimensional with p = 50, and Example 2.5 studies the high-dimensional scenario with p = 1000. Mean vectors and covariance matrices for the two classes are µ1,A = (0.5, 0, 0), µ2,A = (−0.5, 0, 0), Ω1,A = I3 − Ω, and Ω2,A = I3 + Ω, where Ω1,1 = Ω3,3 = −0.60 and Ω1,2 = Ω2,1 = Ω2,3 = Ω3,2 = −0.35. The true classification rule is classifying an observation to class 1 if Q (x) > 0, where Q (x) = 1.577 + X1 − 0.6X12 − 0.6X32 − 0.7X1 X2 − 0.7X2 X3 . There are one linear effect term on X1 and four interaction effect terms on X1 , X2 and X3 , and A = {1, (1, 1) , (3, 3) , (1, 2) , (2, 3)}. X1 , X2 and X3 are related predictors, but not all of their associated terms are in A. For each simulation setting, we generated 100 datasets with 10 different sample sizes for each class, ranging linearly in log-scale from 100 to 1000: n = 100, 100 × 101/9 , 100 × 102/9 . . . , 1000. For SODA and IIS-SQDA, the set of selected predictor variables is defined as the union of all predictors appear in selected linear and interaction terms. We calculated the average number of false negatives and false positives for variable selection (VFN and VFP), main effect term selection (MFN and MFP), and interaction term selection (IFN and IFP). To benchmark the classification accuracy, we also include the full model of LDA and QDA with all predictors, and the Oracle model that contains exactly the five true terms. The average

14

SODA Lasso−Logistic

0.8

2

PCF

3

4

SODA Lasso−Logistic

0

0.0

1

FP + FN

Example 1.1 (PCF)

0.4

5

Example 1.1 (FP + FN)

200

500

1000

2000

100

1000

Sample size per class

Sample size per class

Example 1.2 (FP + FN)

Example 1.2 (PCF)

PCF

8 6 4

2000

SODA Lasso−Logistic

0.8

SODA Lasso−Logistic

10

500

0

0.0

2

FP + FN

200

0.4

100

100

200

500

1000

2000

100

Sample size per class

200

500

1000

2000

Sample size per class

Figure 3: Simulation study results for Example 1.1 (top) and 1.2 (bottom). FP: average number of false positives. FN: average number of false negatives. PCF: percentage of times of selecting correct model A. classification test error rate (TE) of models is estimated by testing the trained model on 10,000 extra observations simulated from true models. Results for these five examples are shown in Figure 4 and 5. For SODA and IIS-SQDA, we also counted the number of FNs and FPs for main effect and interaction terms that are shown in Table 2. Example 2.1. (Multivariate normal predictors) Unrelated predictors were simulated as linear combinations of Xk and Xl , with Gaussian noise. For j ∈ {3, . . . , 50}, Xj = bj,0 + bj,1 Xl + bj,2 Xk + εj , where Xk and Xl are randomly selected from X1 , X2 and X3 , coefficients bj,0 , bj,1 and bj,2 are drawn from uniform distribution U [−1, 1], and Gaussian noise term εj ∼ N (0, 2). Since the linear-Gaussian assumption in Eq (5) hold for this example, three methods ZW, MDR and SODA were able to detect all relevant predictors in P = {1, 2, 3} as n increases, and both VFN and VFP were low. However, it seems that IIS-SQDA selected too false positives, and the number VFP+VFN increased with n. Estimated classification errors for different methods are also shown in

15

Figure 4. As expected, LDA and QDA without variable selection performed the worst, and methods consistent in this scenario (ZW, MDR and SODA) achieved almost Oracle classification accuracy. IIS-SQDA performed sub-optimal due to its selection of too many irrelevant terms. The individual term selection performance of IIS-SQDA and SODA are shown in Table (2). It shows that SODA selected individual terms correctly. The phenomenon that IIS-SQDA tends to select too many false positive terms was also observed in [Fan et al., 2015], and the reason is that the variable selection consistency of elastic net procedure in IIS-SQDA requires the so-called Elastic Irrepresentable Condition [Jia and Yu, 2010] that might not hold in this case. Example 2.2. (Cauchy unrelated predictors) We consider the scenario when the distribution of unrelated predictors have heavy tails and contain outliers. In particular, each unrelated predictor Xj , j = 3, . . . , 50, was simulated as Xj ∼ Cauchy (mj ) ,

mj ∼ Unif [0, 1] .

(14)

As shown in Figure 4, ZW and MDR on average selected about 20 and about 45 false positive and negative predictors, and classification accuracies were close to 50%, which demonstrates the sensitivity of two methods to the joint normality assumption in (5). In contrast, SODA is robust to the heavy tail and outliers in unrelated predictors. IIS-SQDA worked almost as well as SODA. Example 2.3. (Quadratic unrelated predictors) We simulate scenarios where unrelated predictors are non-linearly dependent on related predictors. In particular, each unrelated predictor was simulated as a quadratic function of X1 and X2 with Gaussian noise. For j ∈ {3, . . . , 50}, Xj = bj,0 + bj,1 Xk + bj,2 Xl + bj,3 Xk2 + bj,4 Xl2 + εj ,

(15)

where Xk and Xl were randomly selected from X1 , X2 and X3 , coefficients bj,0 , ..., bj,4 are drawn from U [−1, 1], and εj ∼ N (0, 5). As shown in Figure 4, ZW and MDR methods selected on average 4 to 10 false positive and negative predictors. IIS-SQDA selected a large number of false positive main and interaction terms, as shown in Table 2, due to the correlation between related and unrelated predictors as well as correlation between main and interaction terms. Example 2.4. (Heteroskedastic model) We simulated the unrelated predictors from the following model. For j ∈ {3, . . . , 50}, Xj = bj,1 Xk + bj,2 Xl + |Xk | εj ,

(16)

where Xk and Xl were randomly selected from X1 , X2 and X3 , coefficients bj,1 and bj,3 are simulated from U [−1, 1], and εj ∼ N (0, 1). It violates the constant variance assumption of ZW and MDR as specified in Eq (5). Same as previous two examples, ZW, MDR and IIS-SQDA had sub-optimal performance. SODA selected almost no VFP and VFN, and achieved almost Oracle prediction accuracy when n ≥ 200. Example 2.5. (High dimensional scenario with p = 1000) Unrelated predictors were simulated

16

as follows. For j ∈ {4, . . . , 100}, we first drew all predictors from N (mj , 1), mj ∼ U [0, 1], and then randomly picked 40% of them and re-simulated their non-linear correlation between Xj and (Xk , Xl ) similarly as (14), (15) or (16), where k and l were randomly drawn from {1, 2, 3}. For j ∈ {101, . . . , 1000}, we first drew all predictors from N (mj , 1), and then randomly picked 40% of them and re-simulated their non-linear correlation between Xj and (Xk , Xl ) similarly as (14), (15) or (16), where k and l are indexes uniformly drawn from [101, 1000]. We changed ZW to a forward procedure since the backward procedure is not feasible when p > n. Results are shown in Figure 5 and Table 2. MDR results are not shown because it is unstable for highly correlated X matrix and usually keeps on adding new predictors until the estimation of covariance matrices become singular. Overall, SODA achieved much better performance compared to ZW and IIS-SQDA, and the classification accuracy of SODA was almost the same as the Oracle model for n > 100. To compare the computational efficiency of IIS-SQDA, ZW and SODA, we show their running times in log-scale versus n in Figure 6. IIS procedure requires the estimation of p × p precision matrix, which can be slow in high-dimensional setting. On average, IIS-SQDA took 800 minutes, ZW took 22 minutes and SODA took 4 minutes to analyze one simulated dataset with p = 1000 and n = 1000. SODA was more than 200 times faster than IIS-SQDA in this setting. SODA Example

2.1

2.2

2.3

2.4

2.5

IIS-SQDA

n

MFN

MFP

IFN

IFP

TE (%)

MFN

MFP

IFN

IFP

TE (%)

100

0.11

0.10

1.01

0.25

17.6

0.16

7.92

0.02

6.73

23.7

215

0

0

0.10

0.02

15.7

0.01

10.4

0.01

17.2

21.4

1000

0

0

0

0

15.3

0

14.2

0

90.8

19.6

100

0.06

0.05

0.83

0.03

16.3

0.08

0.29

2.17

2.20

23.7

215

0

0

0.05

0

15.6

0.05

0

0.01

0.22

15.8

1000

0

0

0

0

15.4

0

1.29

0

14.9

16.3

100

0.19

0.65

2.09

0.70

22.3

0.12

18.7

0.21

3.35

25.4

215

0.04

0.28

0.53

0.24

17.8

0.01

23.8

0.02

5.67

20.8

1000

0

0

0

0

15.4

0

30.9

0

27.3

17.7

100

0.16

0.15

1.63

0.75

19.8

0.01

9.9

0.73

38.7

35.9

215

0

0

0.19

0

16.0

0

10.3

0

65.4

32.8

1000

0

0

0

0

15.3

0

15.4

0

108

22.5

100

0.20

0.22

1.58

0.30

20.5

0.68

1.58

0.42

6.08

21.5

215

0

0

0.14

0

15.5

0.20

1.74

0.10

8.84

18.2

1000

0

0

0

0

15.4

0

3.68

0

40.7

17.3

Table 2: Results for Example 2.1 - 2.5 and n = 100, 215, 1000. n: number of observations for each class. MFP / MFN: average number of main effect false positives and negatives. IFP / IFN: average number of interaction effect false positives and negatives. TE: test prediction error rate estimated from 10,000 simulated samples.

17

LDA

QDA

ZW

MDR

IIS−SQDA

200

500

1000

0.5 0.4 0.3 0.2 0.1

60 40 20 0

VFP + VFN

100

SODA

Example 2.5 (Test error rate) Estimated test error rate

Example 2.5 (VFP + VFN)

Oracle

100

Sample size per class

200

500

1000

Sample size per class

Figure 5: Results for Example 2.5. VFP: average number of variable selection false positives. VFN: average number of variable selection false negatives.

105

Example 2.5 (Time in seconds)

103 101

102

Time (s)

104

IIS−SQDA ZW SODA

100

200

500

1000

Sample size per class

Figure 6: Mean running time in seconds for ZW, IIS-SQDA and SODA for Example 2.5.

5

Real data analysis

We applied SODA, Lasso-Logistic, MDR and IIS-SQDA on real datasets to compare their performances. We did not include ZW method due to its similarity with MDR. The classification accuracy of selected models were evaluated by 10-fold cross-validation after the variable selection. For Lasso-Logistic and SODA, we used EBIC0.5 as model selection criterion. We consider two cancer biomedical datasets analyzed in [Efron, 2009]: prostate cancer dataset and Michigan lung cancer 18

dataset. Both of two datasets are in high dimension (p > 5000). We also consider a dataset with a smaller dimension (p = 33), Ionosphere, which is publicly available in the UCI Machine Learning Repository1 .

5.1

Prostate cancer dataset

The microarray technology is widely used for measuring expression abundance of genes. There have been tremendous efforts on building classification methods to diagnose cancer patients from microarray data. In [Singh et al., 2002], researchers measured the gene expressions of 52 prostate cancer patients and 50 controls on p = 6, 033 genes. The goal is to predict whether a person has prostate cancer from the expression of those genes. [Efron, 2009] proposed an empirical Bayes approach for large-scale classification, and compared the performance of it with the shrunken centroids method proposed in [Tibshirani et al., 2002]. With different thresholds, the shrunken centroids method and the empirical Bayes report selected set of predictors truncated at different sizes. The common way of applying two methods is to obtain selected predictors with different thresholds, and pick the best set by cross-validation. We compare the performance of methods by calculating the cross-validation prediction error on selected gene sets with different sizes. The number of selected genes and the 10-fold cross-validation error (CVE) of two methods on different thresholds are shown in Table 3. The shrunken centroids method selected 377 genes at the threshold λ = 2.16 that achieved the lowest CV error, and empirical Bayes method selected the best set with 51 genes. In the solution path of Lasso-Logistic, the lowest EBIC0.5 was achieved at 133.3 with 3 genes, and the corresponding cross-validation error rate was 17%. SODA selected 6 main effects and 0 interaction with EBIC0.5 score at 93.4 and cross-validation error rate 6%. MDR method failed to converge on this dataset. In particular, MDR selects as many genes as possible until the number of selected genes was the same as the number of samples in the smaller class (50). Subsequently the estimated covariance matrix for the smaller class became singular and the procedures could not proceed. ∆BICG , defined as the difference of BICG in Eq (6) between two adjacent steps, is shown for each step in Table 3. MDR proceeds if ∆BICG < 0 and eventually selects 49 genes with cross-validation error rate 52%. We applied IIS-SQDA method by running R code provided by its authors, but for this dataset the analysis did not finish in 48 hours. The reason is as noted in [Fan et al., 2015] that IIS needs to estimate the precision matrices, which can be very slow when the number of predictors p is large.

5.2

Michigan lung cancer dataset

We next consider the Michigan lung cancer dataset that was also analyzed in [Efron, 2009]. This dataset was published in [Beer et al., 2002], in which researchers measured the expression for p = 1

http://archive.ics.uci.edu/ml/datasets/Ionosphere

19

Shrunken centroid

Empirical Bayes

MDR

Lasso-Logistic

SODA

#P

CVE

#P

CVE

∆BIC

#P

CVE

EBIC0.5

#P

CVE

EBIC0.5

#M/#I

CVE

0

0.52

1

0.34

-68

1

0.28

140.3

1

0.28

140.3

1/0

0.28

1

0.48

5

0.30

-74

2

0.28

134.2

2

0.22

133.0

2/0

0.21

4

0.41

10

0.27

-68

5

0.17

133.3

3

0.17

124.1

3/0

0.17

35

0.30

15

0.26

-57

10

0.11

141.4

4

0.16

107.8

4/0

0.15

80

0.16

20

0.21

-64

15

0.11

144.3

5

0.16

100.9

5/0

0.09

172

0.10

25

0.20

-74

20

0.14

151.3

6

0.13

93.4

6/0

0.06

377

0.09

30

0.15

-90

25

0.20

156.7

7

0.13

866

0.12

35

0.11

-108

30

0.28

151.9

8

0.11

1,931

0.23

40

0.12

-122

35

0.31

158.4

9

0.11

3,763

0.33

45

0.09

-141

40

0.41

167.9

10

0.12

6,033

0.34

51

0.09

-166

49

0.52

177.4

11

0.13

Table 3: The summary of results on the prostate cancer dataset by the five methods. The results of shrunken centroids and empirical Bayes methods are copied from Table 1 of [Efron, 2009]. For Lasso-Logistic, MDR and SODA, the selected set with lowest BIC score is highlighted in bold font. ∆BIC: For MDR method, the difference of BICG between two adjacent steps. CVE: prediction error estimated by 10-fold cross-validation. #P: number of selected predictors. #M / #I: number of selected main effect and interaction terms by SODA. 5, 217 genes for 86 lung cancer patients. Among the 86 patients, 62 are labeled as in “good status”, and 26 in “bad status”. The goal is to classify new patients into one of two statuses. Results on this dataset are summarized in Table 4. IIS-SQDA procedure also did not finish in 48 hours for this dataset. In the solution path of Lasso-Logistic, the lowest EBIC0.5 was achieved at 112.2 with 1 gene, and the corresponding cross-validation error rate was 29%. SODA selected 2 main effect and 2 interaction effect terms with EBIC0.5 score at 69.8 and cross-validation error rate 11%. Same as observed from prostate cancer dataset, SODA worked much better than LassoLogistic for finding the minimum of EBIC0.5 (69.8 vs 112.2). Comparing results of Lasso-Logistic and SODA selected models, it is obvious that interaction effects selected by SODA contribute substantially to the classification accuracy. MDR also failed to converge on this dataset. MDR selected as many genes as possible until the number of selected genes was same as the number of samples in the smaller class (26) and achieves cross-validation error rate 28%. The observation that MDR failed to converge for both of these two large p datasets illustrates the fact that the QDA variable selection methods with joint normality assumption work poorly for high-dimensional real datasets.

5.3

Ionosphere dataset

This dataset is a two-class classification problem with 351 samples and 32 predictors. Targets are “Good” and “Bad” radar returns from the ionosphere. “Good” radar returns are those showing evidence of some type of structure in the ionosphere, while “Bad” returns do not. We applied Lasso-Logistic, MDR, IIS-SQDA and SODA on this dataset. Since this dataset

20

Shrunken centroid

Empirical Bayes

MDR

Lasso-Logistic

SODA

#P

CVE

#P

CVE

∆BIC

#P

CVE

EBIC0.5

#P

CVE

EBIC0.5

#M/#I

CVE

0

0.28

5

0.41

-353

1

0.27

112.2

1

0.29

111.2

1/0

0.33

5

0.28

20

0.43

-177

2

0.26

113.9

2

0.25

104.1

2/0

0.25

11

0.29

40

0.39

-178

3

0.25

122.2

3

0.25

98.2

3/0

0.21

21

0.28

60

0.41

-165

4

0.24

121.0

4

0.20

89.6

4/0

0.17

55

0.35

80

0.40

-156

5

0.25

131.5

5

0.22

79.2

5/0

0.12

109

0.35

100

0.39

-134

8

0.24

144.5

6

0.23

260

0.37

120

0.40

-132

11

0.29

150.5

7

0.25

94.4 .. .

5/1 .. .

0.12 .. .

567

0.38

140

0.40

-131

14

0.28

158.4

8

0.22

69.8

2/2

0.11

1,173

0.40

160

0.42

-143

17

0.30

171.1

9

0.23

2,532

0.38

180

0.38

-146

20

0.27

177.6

10

0.23

5,217

0.38

200

0.40

-151

25

0.28

188.6

11

0.23

Table 4: The summary of results on the Michigan lung cancer dataset by five methods. For LassoLogistic, MDR and SODA, the selected set with lowest BIC score is highlighted in bold font. ∆BIC: For MDR method, the difference of BICG between two adjacent steps. CVE: prediction error estimated by 10-fold cross-validation. #P: number of selected predictors. #M / #I: number of selected main effect and interaction terms by SODA. only contains 32 predictors, it is feasible to run Lasso with all interaction terms. Therefore we also run Lasso-Logistic with all main effect terms and 32 × (32 + 1) /2 = 528 interaction terms, which is referred as Lasso-Logistic-2. Results are summarized in Table 5. In the solution path of Lasso-Logistic, the lowest EBIC0.5 was achieved at 302.9 with 6 predictors, and the corresponding cross-validation error rate was 14%. Lasso-Logistic-2 selected 2 main effect terms and 5 interaction terms with EBIC0.5 248.7 and CV error rate 8%. SODA selected 4 main effect and 4 interaction effect terms with EBIC0.5 score at 204.2 and cross-validation error rate 6%. Same as observed above, SODA worked better than Lasso-Logistic for minimizing EBIC0.5 (204.2 v.s. 302.9 and 248.7). MDR method selected all 32 predictors and achieves cross-validation error rate 28%. IIS-SQDA selected 10 main effect and 96 interaction terms and achieved cross-validation error rate 16%. Since MDR selected all 32 predictors, by definition MDR selected model is the full QDA model. Comparing the CVE of this full QDA model and SODA selected model, it is obvious that variable selection substantially reduces the classification error rate and is essential to real dataset analysis.

6

Concluding remarks

We study the variable and interaction selection for logistic regression with second-order terms, which covers QDA as a special case. We propose a two-stage stepwise procedure SODA that is computationally efficient and enjoys nice theoretical property. Asymptotic properties of SODA are studied under high-dimensional setting. Contrast to [Murphy et al., 2010, Zhang and Wang, 2011, Maugis et al., 2011], the consistency of SODA does not require joint normality assumption of re-

21

MDR

Lasso-Logistic

Lasso-Logistic-2

∆BIC

#P

CVE EBIC0.5

#P

CVE EBIC0.5

-326

1

0.20

343.5

2

0.19

-221

3

0.26

329.9

4

-338

5

0.25

302.9

6

-298

7

0.24

313.5

8

SODA

#M / #I

CVE

EBIC0.5

#M / #I

CVE

279.0

1/2

0.13

371.2

1/0

0.21

0.16

253.8

2/3

0.10

343.5

2/0

0.19

0.14

252.7

2/4

0.9

319.6

3/0

0.19

0.16

248.7

2/5

0.8

224.1

5/3

0.07

.. .

.. .

4/4

0.06

-242

9

0.25

312.5

10

0.15

254.4

2/6

0.8

.. .

-200

11

0.24

321.7

12

0.15

258.6

2/7

0.8

204.2

-278

15

0.29

345.3

15

0.15

267.3

2/8

0.8

-361

20

0.28

363.8

18

0.15

286.3

2 / 10

0.8

IIS-SQDA

-434

25

0.30

383.1

22

0.15

290.3

2 / 11

0.8

#M / #I

CVE

-130

32

0.28

445.4

30

0.16

312.0

2 / 13

0.8

10 / 96

0.16

Table 5: The summary of results on the Ionosphere dataset by the five methods. ∆BIC: For MDR method, the difference of BICG between two adjacent steps. CVE: prediction error estimated by 10-fold cross-validation. #P: number of selected predictors. #M / #I: number of selected main effect and interaction terms by SODA. lated and unrelated predictors. Compared to IIS in [Fan et al., 2015], SODA’s forward screening does not need to make normal assumption and does not need to estimate large precision matrices. By empirical studies, we show that SODA not only has better selection accuracy compared to other methods but is also much faster. SODA is applicable for model selection for logistic regression, linear/quadratic discriminant analysis and other discriminant analyses with generative model p (X | Y ) in multivariate exponential family. LDA and QDA complement each other in terms of the bias-variance trade-off. Given finite observations, LDA is simpler and more robust when the response Y is mostly correlated with linear effects of X. QDA has the ability to exploit interaction effects, which may contribute dramatically to classification accuracy, but has much more parameters to estimate and is more vulnerable to selected noise predictors. SODA is designed to be adaptive in the sense that it automatically chooses between LDA or QDA model and take advantages from both sides. Instead of selecting predictors, SODA selects individual main and interaction effect terms, which enables SODA to simultaneously utilize interaction terms and get rid of unrelated terms to avoid large number of parameters to estimate. The main limitation of SODA is that the stepwise detectable condition might not hold when marginal effects are very weak. However, in empirical studies we found the SODA works well and has better performance compared to other methods, which suggests this problem may not be severe in real data. Indeed, even for QDA it takes some efforts to construct mean vectors and covariance matrices that result in only interaction but no main effect. The implementation of SODA procedure is available in R package sodavis on CRAN (http://cran.us.r-project.org).

22

References [Anderson, 1958] Anderson, T. W. (1958). An introduction to multivariate statistical analysis. [Beer et al., 2002] Beer, D. G., Kardia, S. L., Huang, C.-C., Giordano, T. J., Levin, A. M., Misek, D. E., Lin, L., Chen, G., Gharib, T. G., Thomas, D. G., et al. (2002). Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature medicine, 8(8):816–824. [Bien et al., 2013] Bien, J., Taylor, J., Tibshirani, R., et al. (2013). A lasso for hierarchical interactions. The Annals of Statistics, 41(3):1111–1141. [Boser et al., 1992] Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152. ACM. [Breiman, 2001] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. [Broman and Speed, 2002] Broman, K. W. and Speed, T. P. (2002). A model selection approach for the identification of quantitative trait loci in experimental crosses. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4):641–656. [Cai and Liu, 2011] Cai, T. and Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association, 106(496). [Chen and Chen, 2008] Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771. [Chen and Chen, 2012] Chen, J. and Chen, Z. (2012). Extended bic for small-n-large-p sparse glm. Statistica Sinica, 22(2):555. [Chowdhary et al., 2009] Chowdhary, R., Zhang, J., and Liu, J. S. (2009). Bayesian inference of protein–protein interactions from biological literature. Bioinformatics, 25(12):1536–1542. [Clemmensen et al., 2011] Clemmensen, L., Hastie, T., Witten, D., and Ersbøll, B. (2011). Sparse discriminant analysis. Technometrics, 53(4). [Efron, 2009] Efron, B. (2009). Empirical bayes estimates for large-scale prediction problems. Journal of the American Statistical Association, 104(487):1015–1028. [Efron, 2010] Efron, B. (2010). Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, volume 1. Cambridge University Press. [Fan and Fan, 2008] Fan, J. and Fan, Y. (2008). High dimensional classification using features annealed independence rules. Annals of statistics, 36(6):2605. [Fan et al., 2013] Fan, Y., Jin, J., Yao, Z., et al. (2013). Optimal classification in sparse gaussian graphic model. The Annals of Statistics, 41(5):2537–2571. 23

[Fan et al., 2015] Fan, Y., Kong, Y., Li, D., Zheng, Z., et al. (2015). Innovated interaction screening for high-dimensional nonlinear classification. The Annals of Statistics, 43(3):1243–1272. [Foygel and Drton, 2011] Foygel, R. and Drton, M. (2011). Bayesian model choice and information criteria in sparse generalized linear models. arXiv preprint arXiv:1112.5635. [Guo et al., 2007] Guo, Y., Hastie, T., and Tibshirani, R. (2007). Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 8(1):86–100. [Han et al., 2013] Han, F., Zhao, T., and Liu, H. (2013). Coda: High dimensional copula discriminant analysis. The Journal of Machine Learning Research, 14(1):629–671. [Jia and Yu, 2010] Jia, J. and Yu, B. (2010). On model selection consistency of the elastic net when p » n. Technical Report 2. [Jiang et al., 2014] Jiang, B., Liu, J. S., et al. (2014). Variable selection for general index models via sliced inverse regression. The Annals of Statistics, 42(5):1751–1786. [Joachims, 1998] Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Springer. [Mai et al., 2012] Mai, Q., Zou, H., and Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, page asr066. [Maugis et al., 2011] Maugis, C., Celeux, G., and Martin-Magniette, M.-L. (2011). Variable selection in model-based discriminant analysis. Journal of Multivariate Analysis, 102(10):1374–1387. [Murphy et al., 2010] Murphy, T. B., Dean, N., and Raftery, A. E. (2010). Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications. The Annals of Applied Statistics, 4(1):396. [Phillips, 1998] Phillips, P. J. (1998). Support vector machines applied to face recognition, volume 285. Citeseer. [Ravikumar et al., 2010] Ravikumar, P., Wainwright, M. J., Lafferty, J. D., et al. (2010). Highdimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319. [Shao et al., 2011] Shao, J., Wang, Y., Deng, X., Wang, S., et al. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of statistics, 39(2):1241–1265. [Simon and Tibshirani, 2012] Simon, N. and Tibshirani, R. (2012). A permutation approach to testing interactions in many dimensions. arXiv preprint arXiv:1206.6519. [Singh et al., 2002] Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2):203–209. 24

[Tibshirani, 1996] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288. [Tibshirani et al., 2002] Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10):6567–6572. [Waldmann et al., 2013] Waldmann, P., Mészáros, G., Gredler, B., Fuerst, C., and Sölkner, J. (2013). Evaluation of the lasso and the elastic net in genome-wide association studies. Frontiers in genetics, 4. [Wang, 2009] Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488):1512–1524. [Wasserman and Roeder, 2009] Wasserman, L. and Roeder, K. (2009). High dimensional variable selection. Annals of statistics, 37(5A):2178. [Witten and Tibshirani, 2011] Witten, D. M. and Tibshirani, R. (2011). Penalized classification using fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):753–772. [Zhang and Wang, 2011] Zhang, Q. and Wang, H. (2011). On bic’s selection consistency for discriminant analysis. Statistica Sinica, 21(2):731. [Zhang and Liu, 2007] Zhang, Y. and Liu, J. S. (2007). Bayesian inference of epistatic interactions in case-control studies. Nature genetics, 39(9):1167–1173. [Zhao and Yu, 2006] Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563. [Zou and Hastie, 2005] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.

25

LDA

QDA

ZW

MDR

IIS−SQDA

500

1000

100

1000

50 30

200

500

0.1 0.2 0.3 0.4 0.5

Example 2.2 (Test error rate) Estimated test error rate

Example 2.2 (VFP + VFN)

1000

100

200

500

1000

Example 2.3 (VFP + VFN)

Example 2.3 (Test error rate)

500

0.4 0.3 0.2 0.1

40 30 20 10

200

0.5

Sample size per class

Estimated test error rate

Sample size per class

1000

100

200

500

1000

Example 2.4 (VFP + VFN)

Example 2.4 (Test error rate)

500

1000

0.3 0.2 0.1

Estimated test error rate

30 10

200

0.4

Sample size per class

50

Sample size per class

0

VFP + VFN

100

500

Sample size per class

0

VFP + VFN

100

200

Sample size per class

0 10

VFP + VFN

100

0.1 0.2 0.3 0.4 0.5

Estimated test error rate

30 20 10

VFP + VFN

0

200

SODA

Example 2.1 (Test error rate)

Example 2.1 (VFP + VFN)

100

Oracle

100

Sample size per class

200

500

1000

Sample size per class

Figure 4: Results for Example 2.1 - 2.4. VFP: average number of variable selection false positives. VFN: average number of variable selection false negatives. 26