As an alternative to classical statistical methods,

ORIGINAL ARTICLE Logistic Regression–Based Trichotomous Classification Tree and Its Application in Medical Diagnosis Yanke Zhu, SM, Jiqian Fang, PhD ...
Author: Della Perry
24 downloads 0 Views 3MB Size
ORIGINAL ARTICLE

Logistic Regression–Based Trichotomous Classification Tree and Its Application in Medical Diagnosis Yanke Zhu, SM, Jiqian Fang, PhD

The classification tree is a valuable methodology for predictive modeling and data mining. However, the current existing classification trees ignore the fact that there might be a subset of individuals who cannot be well classified based on the information of the given set of predictor variables and who might be classified with a higher error rate; most of the current existing classification trees do not use the combination of variables in each step. An algorithm of a logistic regression–based trichotomous classification tree (LRTCT) is proposed that employs the trichotomous

tree structure and the linear combination of predictor variables in the recursive partitioning process. Compared with the widely used classification and regression tree through the applications on a series of simulated data and 2 real data sets, the LRTCT performed better in several aspects and does not require excessive complicated calculations. Key words: data mining; decision tree; classification and regression tree (CART); logistic regression (LR); logistic regression–based trichotomous classification tree (LRTCT). (Med Decis Making 2016;36:973–989)

A

structures (i.e., for each nonterminal node, it is divided into 2 son nodes). When the outcome variable is a binary variable, the trees always divide each node into 2 distinct categories, with the result that the individuals located in the overlapping field of the 2 categories cannot be correctly classified; in fact, they inevitably often suffer from misclassification. Second, the current classification trees use a consistent split rule and discriminant rule to handle each node. The partition becomes finer and finer as the layer gets deeper and deeper. As we know, the nodes at the deep layer are generally very small, making sensible statistical inference of those nodes very difficult. Third, early classification trees, such as CART, CHAID,7 ID3,8 and C4.5, employ univariate splits and ignore the association among the predictor variables. Some modern classification trees such as QUEST, CRUISE,9 and GUIDE can partition the nodes with linear splits on subsets of predictor variables and linear models in the partitions, but the way in which to look for combinations of variables is not easy. To solve the above problems, an algorithm of the logistic regression–based trichotomous classification tree (LRTCT) is proposed in this article. The algorithm rationale and the procedures will be detailed in the second section. In the third section, a series of simulated data and 2 real data sets are introduced. In the fourth section, to demonstrate the advantages and disadvantages of the LRTCT, CART is selected as a representative

s an alternative to classical statistical methods, classification trees have been useful in the medical field for disease diagnosis and prognostic prediction. Over the past few decades, new techniques have added capabilities that far surpass those of the early classification trees. Despite this, for the current existing classification trees, there are still some problems to be solved. First, most of the existing classification trees, such as the classification and regression tree (CART),1,2 C4.5,3,4 QUEST,5 and GUIDE,6 are based on dichotomous tree

Received 23 January 2015 from the Department of Mathematics, College of Mathematics and Informatics, South China Agricultural University, Guangzhou, Guangdong, China (YZ), and Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-Sen University, Guangzhou, Guangdong, China (JF). Revision accepted for publication 27 October 2015. Supplementary material for this article is available on the Medical Decision Making Web site at http://mdm.sagepub.com/supplemental. Address correspondence to Jiqian Fang, Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-Sen University, #74 Zhongshan 2nd Road, Yuexiu District, Guangzhou, 510080, China; telephone: 86-20-87330671; e-mail: [email protected]. Ó The Author(s) 2016 Reprints and permission: http://www.sagepub.com/journalsPermissions.nav DOI: 10.1177/0272989X15618658

MEDICAL DECISION MAKING/NOVEMBER 2016

973

ZHU AND FANG

among various dichotomous classification trees; comparison of the classification results between LRTCT and CART based on the simulated and real data are presented. In the fifth section, the results are summarized, and future research is suggested. METHODS Rationale of the LRTCT Algorithm For the classification tree algorithm, each individual in the training sample can be viewed as a point in an n-dimensional feature space of predictor variables. The algorithm grows a tree in a sequential fashion by using splitting rules based on n predictor variables. At each junction of the tree, there is a node, and the terminal node is the one at the end of a branch of the tree. From a geometric perspective, the classification tree splits the n-dimensional feature space into a number of small areas. Most of the current classification trees are based on dichotomous tree structures, in which a given region of the feature space splits into 2 sections such that the sample points included in each section are as ‘‘pure’’ as possible, respectively (i.e., in each section, the majority of the points share the same categorical features). Dichotomous tree structures may result in misjudgment of the sample points that locate in the overlapped region of the 2 categories, thus affecting the overall discrimination accuracy of the classification algorithm. To overcome this problem, a trichotomous classification tree algorithm is proposed based on the ideas of Kendall10 and Fang,11 which uses a trichotomous tree structure to recursively split and establish the classification tree model. In other words, at each step, the node splits into 3 son nodes—left, right, and middle—of which the first 2 contain ‘‘pure’’ sample points that are assertive in discrimination and the middle son node contains mixed ambiguous individuals to be discriminated in the next layer. The above process is iteratively performed until the stopping rule for the splitting is met. LRTCT Algorithm The algorithm of LRTCT is described below, primarily focusing on 2 core issues: tree growth and discrimination of the middle node in the final layer. Tree Growth Similar to other classification trees, tree growth in LRTCT is a recursive process of constantly splitting

974 

MEDICAL DECISION MAKING/NOVEMBER 2016

the training samples. The core algorithm of tree growth is to establish the splitting rule and the stopping rule. Recursive Splitting Rule The splitting rule involves 2 issues: first, how to select one optimal split variable from all predictor variables and, second, how to find the optimal split threshold from the many values of the split variable. According to the idea of sequential discriminant analysis proposed by Fang,11 a single predictor variable is selected as the split variable for the root node, but from the second layer on, besides a single predictor variable, the linear combination of predictors could also be considered as the split variable. The root node is denoted by tð0Þ . For each variable Xj ( j 5 1; 2;    ; p), as shown in Figure 1, there will be an overlapping portion of the setting of Xj in 2 categories of samples, denoted by the threshold values as aj 5 maxð min Xj ; min Xj Þ; ð0Þ

Xj 2T1

ð0Þ

bj 5 minð max Xj ; max Xj Þ: ð0Þ

Xj 2T1

2

ð0Þ

where T ð0Þ 5 [ Ti i51

ð1Þ

Xj 2T2

ð0Þ

ð2Þ

Xj 2T2

is the training sample of tð0Þ ,

i 5 1; 2 indicates 2 categories. The T ð0Þ could be divided into 3 subsets based on 3 intervals of Xj , ð‘; aj Þ, ½aj ; bj , and ðbj ; 1‘Þ. Let pLj ; pRj denote the proportions of the individuals in the population T ð0Þ that meet fXj\aj T ð0Þ g and  fXj .bj T ð0Þ g, respectively, define the discrimination ability ( DA) of Xj , by DAðXj Þ 5 pLj 1 pRj :

ð3Þ

Then, a search is made through all Xj (j 5 1; 2;    ; p) to find the optimal split variable Xj , which has the greatest discrimination ability, that is, DAðXj Þ 5

max DAðXj Þ:

j 5 1; 2; p

ð4Þ

The optimal split variable is denoted as X ð1Þ , and the corresponding split threshold values are denoted as að1Þ and bð1Þ . The node tð0Þ is divided into the left son node  ð0Þ ð0Þ tL 5 fX ð1Þ\að1Þ tð0Þ g, the right son node tR 5 fX ð1Þ .  ð0Þ bð1Þ tð0Þ g, and the middle son node tM 5 fað1Þ   ð0Þ ð0Þ X ð1Þ  bð1Þ tð0Þ g. In tL and tR , the heterogeneity in

LOGISTIC REGRESSION–BASED TRICHOTOMOUS CLASSIFICATION TREE

tðkÞ and minsplit is a constant proportion set in advance     ðkÞ ðkÞ ðkÞ 3. p tL 1 p tR  minbucket, where pðtL Þ and ðkÞ pðtR Þare the proportions of the individuals in the left and right son nodes and minbucket is a constant proportion set in advance

Classification of the Final Layer Figure 1 Diagram of logistic regression–based trichotomous classification tree based on the optimal split variable.

category is as low as possible to achieve the highest ð0Þ ð0Þ ‘‘purity’’; tL and tR will serve for assertive discrimið0Þ

nation. However, tM is the overlapping portion of the ð0Þ tM

does not meet the given stopping 2 categories, if rule of recursive splitting, it is denoted by tð1Þ , and then splitting continues in the next step. For tð1Þ , with each of Xj ( j 5 1;    ; p, Xj 6¼ X ð1Þ ) incorporating with X ð1Þ as candidate independent variables to conduct p  1 times of stepwise logistic regression, provides p  1 combination variables denoted by log it Pj (j 5 1; 2;    ; p, Xj 6¼ X ð1Þ ). Taking these p  1 combination variables and all Xk (k 5 1;    ; p) to be candidate split variables, the one with the greatest discrimination ability is selected as the optimal split variable denoted by X ð2Þ , and the corresponding split threshold values are denoted as að2Þ and bð2Þ ; the middle son node denoted by tð2Þ is suspended so as to be further discriminated. If tð2Þ does not meet the stopping rule for recursive splitting, a similar recursive partitioning process is continued. In each step, not only the single predictor variables but also all possible linear combinations of used variables and each of the unused predictor variables (generated by logistic regression) are considered, among which the one with the greatest discrimination ability is used to partition the suspended son node of the last layer, until the stopping rule for splitting is met. Stopping Rule of Recursive Split To prevent overfitting, for any node tðkÞ ( k  1), if any one of the following conditions is met, the recursive splitting will stop. 1. k  K, where k is the layer of the tree and K is set in advance   as the maximum depth 2. p tðkÞ  minsplit, where pðtðkÞ Þ is the proportion of the individuals suspended in the middle son node

ORIGINAL ARTICLE

In the final layer, there might be a number of individuals kept suspended in the node tðkÞ . The numbers of individuals subject to the 2 categories are denoted ðkÞ ðkÞ by n1 and n2 , respectively; they are not necessary equal in general. A logistic regression is performed based on the individuals in tðkÞ with the information from all predictor variables and adjusted by the proðkÞ ðkÞ ðkÞ portion of n1 =ðn1 1 n2 Þ as prior probability (see supplementary Appendix 1). Although we do not encourage people to make assertive classifications for the individuals in the final layer tðkÞ , if one insists on doing so, a decision could be made according to the greater probability. Treatment of Missing Value If an individual is missing the value of split variable, LRTCT will assign this individual into the middle subnode being classified in the next layer, where the missing value is treated by mean substitution. Simulated Data and Real Data In this article, the simulated data and real data were used to compare the algorithms for LRTCT and CART. Methods to Generate Simulated Data The response variable Y was a binary one, and the number of predictor variables was 20. Five hundred pairs of a training sample and a testing sample were generated under each hypothesized condition. Each sample included 2 sets, with sample size of 400 for each. Under the Condition of Multivariate Normal Distribution It is assumed that the predictor variables in both sets follow multivariate normal distributions but with different mean vectors and equal covariance matrices. Let all of the correlation coefficients

975

ZHU AND FANG

between any 2 predictor variables be equal to r, for which the values were set as 0.2, 0.5, and 0.8 to represent low, moderate, and high correlations, respectively. The differences for predictor variables between 2 sets were measured by PEi 5

m2i  m1i si

i 5 1; 2;    ; 20

ð5Þ

where m1i ; m2i are the means of predictor variables Xi ( i 5 1; 2;    ; 20) in set 1 and set 2, and si is the standard deviation of Xi ( i 5 1; 2;    ; 20) in set 1 as well as in set 2. The values were set at 0, 0.4, 0.8, 1.2, and 1.6 for PE1 ; PE4 , PE5 ; PE8 , PE9 ; PE12 , PE13 ; PE16 , and PE17 ; PE20 , respectively. As an example, for r 5 0:2, the histogram and density curve of one simulated data set for X4 , X8 , X12 , X16 , X20 are displayed in Figure A1 of Supplementary Appendix 2. Under the Condition of Multivariate Nonnormal Distributions The multivariate nonnormal distribution was constructed according to Fleishman.12 The skewness and kurtosis of each predictor variable were set as g1 5 3:5 and g2 5 20, and the values of r were also set as 0.2, 0.5, and 0.8 to represent low, moderate, and high correlations, respectively. And 0, 0.2, 0.4, 0.6, and 0.8 were set for the values of PE1 ; PE4 , PE5 ; PE8 , PE9 ; PE12 , PE13 ; PE16 , and PE17 ; PE20 , respectively. As a further example, for r 5 0:2, the histogram and density curve of one simulated data set for X4 , X8 , X12 , X16 , X20 are displayed in Figure A2 of Supplementary Appendix 2. Introduction to 2 Real Data Sets We also used the Wisconsin Breast Cancer Data Set and Pima Indian Data Set for comparison purposes. The former was collected by Dr. William H. Wolberg, University of Wisconsin Hospitals, Madison,13–15 which we obtained from the UCI databases website (ftp://ftp.ics.uci.edu/pub/machine-learning-databases/). It consists of 683 complete observations on nine predictor variables (denoted by X1 ;    ; X9 ), with integer values ranging between 1 and 10 and a binary response variable. In all of the observations, 444 are benign cases and 239 are malignant cases. The second data set was collected by the US National Institute of Diabetes and Digestive and

976 

MEDICAL DECISION MAKING/NOVEMBER 2016

Kidney Diseases,16 and we obtained it from the website (http://www.stats.ox.ac.uk/pub/). It consists of 532 observations on seven predictor variables X1 ;    ; X7 , with 355 assigned the binary response variable with ‘‘No’’ and 239 assigned with ‘‘Yes.’’ The descriptive statistics of the 2 real data sets are summarized in Tables A1 through A4 of Supplementary Appendix 3. The 2 real data sets were randomly divided into a training sample with two thirds of the observations and a testing sample with the remainder; the randomization process was repeated 500 times independently. The training samples were used to develop classification models, and the testing samples were used to compare the results between LRTCT and CART. The computation of the CART algorithm was conducted using the ‘‘rpart’’ package of R software; the LRTCT algorithm was composed by the authors using the R software. RESULTS The accuracy of the 2 algorithms was evaluated by the mean of the true-positive rate (TPR) and the mean of the true-negative rate (TNR) throughout the 500 run of classification. In addition, the method of SROC17,18 was used to synthetically compare the 500 pairs of TPR and TNR in an alternative way. The cost of the classification algorithm was evaluated by the average number of predictor variables (C) used, for the tree models C5

X

pðtÞCðtÞ

ð6Þ

~ t2T

~ is the collection of terminal nodes in the tree where T model, pðtÞ is the percentage of the individuals covered by the terminal node t, and CðtÞis the number of predictor variables used to reach the node t. Results of Simulated Data Classification Results for Simulated Normal Distributed Data Under the conditions of r 5 0.2, 0.5, 0.8, the means and standard deviations of TPR, TNR, C, and area under the SROC curve (AUC) based on the 500 testing samples are summarized in Tables 1 through 3. To see the detailed comparison, in addition to the overall population, the classification results are also compared for 2 subpopulations: the individuals in the

LOGISTIC REGRESSION–BASED TRICHOTOMOUS CLASSIFICATION TREE

Table 1

Classification Results Based on the Testing Samples (Normal Distributions, r 5 0:2) LRTCT

Subpopulation and Proportion (%)

tALL – tFLM 63.60 tFLM 36.40 tALL 100

CART

TPR (s)

TNR (s)

AUC

C (s)

TPR (s)

TNR (s)

AUC

C (s)

0.9497** (0.0302) 0.7825** (0.1132) 0.8960** (0.0297)

0.9526** (0.0305) 0.7732** (0.1209) 0.8958** (0.0307)

0.9933

2.2447 (0.4083) 8.6080 (1.1085) 4.5108 (0.4939)

0.8890 (0.0489) 0.6615 (0.1191) 0.8467 (0.0378)

0.8896 (0.0452) 0.6519 (0.1201) 0.8526 (0.0380)

0.9563

2.9442 (0.5874) 3.1508 (0.7311) 3.0996 (0.7348)

0.8940 0.9560

0.7171 0.9211

Note. tALL – tFLM represents all the terminal nodes except the middle node of final layer, tFLM represents the middle node of final layer, and tALL represents all terminal nodes. Paired t tests were adopted to compare the true-positive rate (TPR) and true-negative rate TNR of the classification and regression tree (CART) and logistic regression–based trichotomous classification tree (LRTCT). AUC = area under the SROC curve. In tFLM , the mean proportions of the positive individuals and the negative individuals were 49.07% and 50.93%, respectively. ** Difference significant at the level of a 5 0:01.

Table 2

Classification Results Based on the Testing Samples (Normal Distributions, r 5 0:5) LRTCT

Subpopulation and Proportion (%)

tALL – tFLM 74.25 tFLM 25.75 tALL 100

CART

TPR (s)

TNR (s)

AUC

C (s)

TPR (s)

TNR (s)

AUC

C (s)

0.9514** (0.0284) 0.8175** (0.1030) 0.9217** (0.0250)

0.9514** (0.0253) 0.8015** (0.1156) 0.9194** (0.0249)

0.9906

2.6941 (0.4473) 9.5900 (0.9674) 4.4321 (0.4272)

0.8536 (0.0463) 0.6419 (0.1307) 0.8211 (0.0479)

0.8495 (0.0464) 0.6240 (0.1272) 0.8148 (0.0480)

0.9246

2.9377 0.6704 3.3655 0.9820 3.2005 (0.9688)

0.9018 0.9735

0.6891 0.8932

Note. In tFLM , the mean proportions of the positive individuals and the negative individuals were 49.90% and 50.10%, respectively. LRTCT = logistic regression–based trichotomous classification tree; CART = classification and regression tree; TPR = true-positive rate; TNR = true-negative rate; AUC = area under the SROC curve. ** Difference significant at the level of a 5 0:01.

Table 3

Classification Results Based on the Testing Samples (Normal Distributions, r 5 0:8) LRTCT

Subpopulation and Proportion (%)

tALL – tFLM 83.18 tFLM 16.82 tALL 100

CART

TPR (s)

TNR (s)

AUC

C (s)

TPR (s)

TNR (s)

AUC

C (s)

0.9713** (0.0193) 0.9072** (0.0816) 0.9625** (0.0186)

0.9721** (0.0173) 0.9051** (0.0831) 0.9626** (0.0176)

0.9976

2.0448 (0.1712) 6.7560 (0.8233) 2.8458 (0.3102)

0.8797 (0.0460) 0.7304 (0.0470) 0.8557 (0.0470)

0.8798 (0.0476) 0.7353 (0.0513) 0.8562 (0.0513)

0.9488

3.1311 (0.7237) 4.1798 (0.8763) 3.7128 (0.7753)

0.9845 0.9941

0.8224 0.9287

Note. In tFLM , the mean proportions of the positive individuals and negative individuals were 50.10% and 49.90%, respectively. LRTCT = logistic regression–based trichotomous classification tree; CART = classification and regression tree; TPR = true-positive rate; TNR = true-negative rate; AUC = area under the SROC curve. ** Difference significant at the level of a 5 0:01.

middle node of the final layer ( tFLM ) and the individuals in the terminal nodes ( tALL  tFLM ). In addition, the scatter plots of 500 pairs of (falsepositive rate [FPR], TPR) and SROC curve based on the testing samples are shown in Figures 2 through Figure 4, where the circle spots refer to CART and

ORIGINAL ARTICLE

the cross spots refer to LRTCT. In these figures, (a) is the scatter plot and SROC curves for the subpopulation tALL  tFLM , (b) is for the subpopulation tFLM , and (c) is for the whole population tALL . From Tables 1 though 3 and Figures 2 through Figure 4, the following results can be observed:

977

ZHU AND FANG

Figure 2 Scatter plot of (false-positive rate [FPR], true-positive rate [TPR]) and SROC curves (normal distribution, r = 0.2) for (a) the subpopulation tALL-FLM, (b) the subpopulation tFLM, and (c) for the subpopulation tALL.

1. For classification results of all terminal nodes except the middle node of the final layer, in terms of TPR, TNR, and AUC, the LRTCT consistently had superior performance to CART. With r increased, these indicators for the LRTCT algorithm gradually increased, whereas few changed for CART. 2. For the middle node of the final layer, although the classification results were all poorer than those for

978 

MEDICAL DECISION MAKING/NOVEMBER 2016

the other terminal nodes, the LRTCT still consistently performed superior to CART. As r increased, not only did the indicators improve but the size of this subpopulation also gradually decreased. 3. For classification results of all terminal nodes as a whole population, in terms of TPR, TNR, and AUC, LRTCT consistently performed superior to CART.

LOGISTIC REGRESSION–BASED TRICHOTOMOUS CLASSIFICATION TREE

Figure 3 Scatter plot of (false-positive rate [FPR], true-positive rate [TPR]) and SROC curves (normal distribution, r = 0.5) for (a) the subpopulation tALL-FLM, (b) the subpopulation tFLM, and (c) the subpopulation tALL.

4. For any individual in the middle node of the final layer, LRTCT had the option of either making a decision of classification or not, whereas CART had not been aware of such a subpopulation and hence was without choice, making only a classification with a potentially higher risk.

ORIGINAL ARTICLE

Classification Results for the Simulated Nonnormal Distribution The results for the nonnormal distributions based on the 500 testing samples are summarized in Tables 4 through 6. In addition, the scatter plot of 500 pairs of

979

ZHU AND FANG

Figure 4 Scatter plot of (false-positive rate [FPR], true-positive rate [TPR]) and SROC curves (normal distribution, r = 0.8) for (a) the subpopulation tALL-FLM, (b) the subpopulation tFLM, and (c) the subpopulation tALL.

(FPR, TPR) and SROC curves based on the testing samples are plotted in Figures 5 through Figure 7. Similarly, (a) is the scatter plot and SROC curves for the subpopulation tALL  tFLM , (b) is for the subpopulation tFLM , and (c) is for the whole population tALL . From Tables 4 through Table 6 and Figures 5 through Figure 7, the following results can be observed:

980 

MEDICAL DECISION MAKING/NOVEMBER 2016

1. For the terminal nodes, except the middle node of the final layer, in terms of AUC, the LRTCT consistently performed superior to CART did with less C. 2. For another subpopulation, the middle node of the final layer, TPR, TNR, AUC, and C were all poorer than the results of the other terminal nodes; the AUC of LRTCT was relatively lower than that of CART.

LOGISTIC REGRESSION–BASED TRICHOTOMOUS CLASSIFICATION TREE

Table 4 Classification Results Based on the Testing Samples (Nonnormal Distributions, r 5 0:2) LRTCT Subpopulation and Proportion (%)

tALL – tFLM 77.21 tFLM 22.80 tALL 100

CART

TPR (s)

TNR (s)

AUC

C (s)

TPR (s)

TNR (s)

AUC

C (s)

0.9354** (0.0614) 0.9648* (0.0413) 0.9548** (0.0236)

0.9409** (0.0251) 0.3912** (0.2821) 0.9160** (0.0274)

0.9849

2.2061 (0.4130) 7.7860 (1.3857) 3.4680 (0.4572)

0.9509 (0.0430) 0.9529 (0.1575) 0.9641 (0.0190)

0.9257 (0.0297) 0.5039 (0.3370) 0.9099 (0.0286)

0.9769

2.1671 (0.3694) 2.5474 (0.4699) 2.4499 (0.5841)

0.9138 0.9709

0.9732 0.9763

Note. In tFLM , the mean proportions of the positive individuals and negative individuals were 8.79% and 91.21%. LRTCT = logistic regression-based trichotomous classification tree; CART = classification and regression tree; TPR = true-positive rate; TNR = true-negative rate; AUC = area under the SROC curve. * Difference is significant at the level of a 5 0:05. ** Difference is significant at the level of a 5 0:01.

Table 5 Classification Results Based on the Testing Samples (Nonnormal Distributions, r 5 0:5) LRTCT Subpopulation and Proportion (%)

tALL – tFLM 76.18 tFLM 23.82 tALL 100

CART

TPR (s)

TNR (s)

AUC

C (s)

TPR (s)

TNR (s)

AUC

C (s)

0.9432 (0.0521) 0.9406** (0.0542) 0.9481** (0.0231)

0.9385** (0.0279) 0.2365** (0.1751) 0.8766** (0.0339)

0.9870

2.2134 (0.3574) 8.2600 (1.1744) 3.6305 (0.3786)

0.9476 (0.0345) 0.9964 (0.0077) 0.9682 (0.0167)

0.9149 (0.0352) 0.3531 (0.2366) 0.8694 (0.0302)

0.9812

2.2339 (0.3325) 2.5243 (0.4539) 2.3261 (0.4252)

0.8150 0.9525

0.9147 0.9312

Note. In tFLM , the mean proportions of the positive individuals and negative individuals were 17.60% and 82.40%, respectively. LRTCT = logistic regression–based trichotomous classification tree; CART = classification and regression tree; TPR = true-positive rate; TNR = true-negative rate; AUC = area under the SROC curve. ** Difference significant at the level of a 5 0:01.

Table 6 Classification Results Based on the Testing Samples (Nonnormal Distributions, r 5 0:8) LRTCT Subpopulation and Proportion (%)

tALL – tFLM 76.08 tFLM 23.92 tALL 100

CART

TPR (s)

TNR (s)

AUC

C (s)

TPR (s)

TNR (s)

AUC

C (s)

0.9545** (0.0285) 0.9037** (0.0730) 0.9414** (0.0242)

0.9440** (0.0274) 0.3751** (0.1593) 0.8624** (0.0353)

0.9848

2.0138 (0.3394) 8.0700 (1.2491) 3.4259 (0.3646)

0.9455 (0.0326) 0.9393 (0.0728) 0.9519 (0.0331)

0.9061 (0.0462) 0.2786 (0.1976) 0.8183 (0.0467)

0.9584

2.2978 (0.8012) 2.6949 (0.9987) 2.4181 (0.9631)

0.7298 0.9496

0.7365 0.9008

Note. In tFLM , the mean proportions of the positive individuals and negative individuals were 29.47% and 70.53%, respectively. LRTCT = logistic regression–based trichotomous classification tree; CART = classification and regression tree; TPR = true-positive rate; TNR = true-negative rate; AUC = area under the SROC curve. ** Difference significant at the level of a 5 0:01.

3. For all terminal nodes as a whole population, in terms of TPR, TNR, and AUC, when the predictors were correlated each other, say r 5 0:5 and r 5 0:8, the LRTCT performed superior to CART; otherwise, there was almost no difference in performance between LRTCT and CART.

ORIGINAL ARTICLE

Classification Results for the Real Data Sets To display the difference between CART and LRTCT, as an example, the tree structure diagrams of CART and LRTCT for the breast cancer data set are plotted in Figure 8.

981

ZHU AND FANG

Figure 5 Scatter plot of (false-positive rate [FPR], true-positive rate [TPR]) and SROC curves (nonnormal distribution, r = 0.2) for (a) the subpopulation tALL-FLM, (b) the subpopulation tFLM, and (c) the subpopulation tALL.

The means and standard deviation of TPR, TNR, C, and AUC from the 500 runs based on the testing samples are summarized in Tables 7 and 8. In addition, for the 2 real data sets, the 500 pairs of (TPR, TNP) were compared for CART and LRTCT by the

982 

MEDICAL DECISION MAKING/NOVEMBER 2016

scatter plots and SROC curves shown in Figures 9 and 10. The tree generated by LRTCT had 8 terminal nodes (denoted as TN1–TN8 ) with 4 split variables; they were X ð1Þ 5 X1 , X ð2Þ 5  0:703 1 0:799X1 1 0:878X4

LOGISTIC REGRESSION–BASED TRICHOTOMOUS CLASSIFICATION TREE

Figure 6 Scatter plot of (false-positive rate [FPR], true-positive rate [TPR]) and SROC curves (nonnormal distribution, r subpopulation tALL-FLM, (b) the subpopulation tFLM, and (c) the subpopulation tALL.

(linear combination of X4 and used variable X1 ), X ð3Þ 5 6:44 1 0:339X1 1 0:432X4 1 0:932X3 (linear combination of X3 and 2 used variables X1 , X4 ), and X ð4Þ 5 5:19 1 0:274X1 1 0:220X4 1 0:365X3 1 0:437X6 (linear combination of X6 and 3 used variables X1 , X3 , X4 ), and the logistic regression equation for the middle node of the final layer is

ORIGINAL ARTICLE

= 0.5) for (a) the

P 0 ðoutcome 5 MalignantjX7 ; X8 Þ 5 1    62 1 1 exp  2:46 1 ln 28 1 62 1 0:551X7 1 0:223X6

From Tables 7 and 8 and Figures 9 and 10, the following results can be observed:

983

ZHU AND FANG

Figure 7 Scatter plot of (false-positive rate [FPR], true-positive rate [TPR]) and SROC curves (nonnormal distribution, r subpopulation tALL-FLM, (b) the subpopulation tFLM, and (c) the subpopulation tALL.

1. For the subpopulation of the terminal nodes except the middle node of the final layer, in terms of AUC, the LRTCT consistently performed superior to CART did in both data sets. 2. For another subpopulation, the middle node of the final layer, although TPR, TNR, AUC, and C were all lower than those for the other terminal nodes,

984 

MEDICAL DECISION MAKING/NOVEMBER 2016

= 0.8) for (a) the

TPR, TNR, and AUC of LRTCT were relatively superior to those of CART. 3. For all terminal nodes as a whole population, in terms of the TPR, TNR, and AUC, the LRTCT performed superior to CART did, and LRTCT had a more prominent advantage in the Pima Indian data set than in the breast cancer data set.

LOGISTIC REGRESSION–BASED TRICHOTOMOUS CLASSIFICATION TREE

Figure 8 Tree structure diagrams of (a) classification and regression tree (CART) and (b) logistic regression–based trichotomous classification tree (LRTCT) for the breast cancer data set.

Table 7 Classification Results Based on the Test Samples (Breast Cancer Data Set) LRTCT Subpopulation and Proportion (%)

tALL – tFLM 82.94 tFLM 17.06 tALL 100

CART

TPR (s)

TNR (s)

AUC

C (s)

TPR (s)

TNR (s)

AUC

C (s)

0.9645 (0.0310) 0.8606 (0.1082) 0.9358 (0.0335)

0.9887** (0.0106) 0.6972** (0.2542) 0.9701** (0.0140)

0.9995

1.9254 (0.3146) 5.2380 (0.8826) 2.4989 (0.3545)

0.9698 0.0334 0.8595 0.0895 0.9379 (0.0326)

0.9466 0.0239 0.4148 0.2504 0.9522 (0.0189)

0.9874

1.8722 (0.2494) 1.9641 (0.3318) 1.9136 (0.3324)

0.9237 0.9923

0.8227 0.9846

Note. In tFLM , the mean proportions of the positive individuals and negative individuals were 31.34% and 68.66%, respectively. LRTCT = logistic regression–based trichotomous classification tree; CART = classification and regression tree; TPR = true-positive rate; TNR = true-negative rate; AUC = area under the SROC curve. ** Difference significant at the level of a 5 0:01.

DISCUSSION In this study, the LRTCT algorithm was proposed based on the ideas of Kendall10 and Fang.11 It is a classification tree algorithm with good performance that combines the advantages of trichotomous structure

ORIGINAL ARTICLE

and the logistic regression model. The rationale and the detailed procedure have been described, and the external validation was evaluated through both the simulation data and the real problem data. It is especially important in medical decision making that a mistaken diagnosis may either cause the

985

ZHU AND FANG

Table 8

Classification Results Based on the Test Samples (Pima Indian Data Set) LRTCT

Subpopulation and Proportion (%)

tALL – tFLM 13.63 tFLM 86.37 tALL 100

CART

TPR (s)

TNR (s)

AUC

C (s)

TPR (s)

TNR (s)

AUC

C (s)

0.6245** (0.4081) 0.7138** (0.0654) 0.7022** (0.0694)

0.9445** (0.0931) 0.7698** (0.0445) 0.8058** (0.0417)

0.9759

1.2418 (0.3265) 4.116 (0.5506) 3.7283 (0.4207)

0.6732 (0.4515) 0.4513 (0.0584) 0.5606 (0.1044)

0.9068 (0.0862) 0.8190 (0.0426) 0.8535 (0.0524)

0.9101

1.5145 (0.2458) 1.8862 (0.3548) 1.8240 (0.3265)

0.8154 0.8361

0.7793 0.7918

Note. In tFLM , the mean proportions of the positive individuals and negative individuals were 61.57% and 38.43%, respectively. LRTCT = logistic regression–based trichotomous classification tree; CART = classification and regression tree; TPR = true-positive rate; TNR = true-negative rate; AUC = area under the SROC curve. ** Difference significant at the level of a 5 0:01.

patient to experience a false alarm or to miss the opportunity for medication, even ultimately resulting in death. Therefore, based on the ‘‘do no harm’’ principle of bioethics, medical professionals are always very cautious when diagnosing in order to minimize the damage of a possible incorrect diagnosis. Learning from the working style of medical doctors, in each run of the LRTCT algorithm, the ambiguous individuals in the middle node are postponed temporarily so as to be classified in the next run; at the end of the tree growth process, the individuals in the middle node of the final layer could be regarded as a subset who are difficult to be classified with the existing predictor variables. In general clinical practice, there must be a set of individuals who are easily diagnosed based on the information provided by a given set of tests and another set of individuals who are difficult to diagnose even though the information of the given set of tests is also available. The LRTCT algorithm is able to differentiate the 2 sets, denoted by the subpopulations tALL  tFLM and tFLM , and is able to deal with them differently, making an assertive decision for the individuals in tALL  tFLM but predicting the relative risk and searching for further information for the individuals in tFLM . In contrast, CART is not able to differentiate the 2 sets, making an assertive decision for all individuals with the same rule and hence inevitably leading to more misclassification of the individuals in tFLM . In addition to univariate splits, the LRTCT can employ linear combinations of the predictor variables to split the node such that the potential association between the predictor variables is used as much as possible, and hence the classification accuracy of LRTCT could be higher than that of CART. Meanwhile, since the used variables still have the

986 

MEDICAL DECISION MAKING/NOVEMBER 2016

opportunity to be applied for linear combinations with the unused variables, LRTCT could use the information of predictor variables more sufficiently. This is also consistent with the working style of general clinical practice. The results from simulation data and real data indicated that in the context of normal distributions, LRTCT had superior predictive accuracy compared with CART. Therefore, if the continuous predictor variables follow nonnormal distributions, we would suggest that LRTCT can be used after certain transformations in order to increase the accuracy of classification. Our study has some limitations. First, the response variable in this article is binary, and we are not sure whether the superiority of the trichotomous tree structure is still kept or not if the response variable is multiple categorical. Second, since the predictor variables in this article are continuous, we do not know whether the linear combinations still work well if some of the predictor variables are categorical. Third, the simulated normal distributions are limited in equal variance-covariance between 2 categories, and the simulated nonnormal distributions are limited in positive skew. Fourth, the number of variables is used to measure the cost of classification, ignoring the different cost of the variables in the real world. In the future, we plan to apply this algorithm to a wider field of medical diagnosis by considering differential diagnosis for multiple categories based on both continuous variables and categorical ones and evaluating the cost-effectiveness of classification. Moreover, the execution speed, flexibility, parallelism, output intelligibility, noise resistance, and so forth of the LRTCT algorithm must be addressed in the future in order to constantly improve the algorithm and eventually identify an algorithm with better performance in a range of aspects.

LOGISTIC REGRESSION–BASED TRICHOTOMOUS CLASSIFICATION TREE

Figure 9 Scatter plot of (false-positive rate [FPR], true-positive rate [TPR]) and SROC curves (breast cancer data set) for (a) the subpopulation tALL-FLM, (b) the subpopulation tFLM, and (c) the sub-population tALL.

ORIGINAL ARTICLE

987

ZHU AND FANG

Figure 10 Scatter plot of (false-positive rate [FPR], true-positive rate [TPR]) and SROC curves (Pima Indian data set) for (a) the subpopulation tALL-FLM, (b) the subpopulation tFLM, and (c) the subpopulation tALL.

REFERENCES 1. Breiman L, Friedman J, Olshen R, et al. Classification and Regression Trees. Monterey (CA): Wadsworth & Brooks Cole; 1984. 2. Jin H, Lu Y. Cost-saving tree-structured survival analysis for hip fracture of study of osteoporotic fractures data. Med Decis Making. 2004;24(4):386–98.

988 

MEDICAL DECISION MAKING/NOVEMBER 2016

3. Quinlan JR. C4.5: Programs for Machine Learning. San Mateo (CA): Morgan Kaufmann; 1993. 4. Quinlan JR. Learning with continuous classes. In: Proceedings of the 5th Australian Joint Conference on Artificial Intelligence. Singapore: World Scientific; 1992. p 343–8. 5. Loh WY, Shih Y-S. Split selection methods for classification trees. Stat Sinica. 1997;7:815–40.

LOGISTIC REGRESSION–BASED TRICHOTOMOUS CLASSIFICATION TREE

6. Loh WY. Improving the precision of classification trees. Ann Appl Stat. 2009;3:1710–37. 7. Kass GV. An exploratory technique for investigating large quantities of categorical data. Ann Appl Stat. 1980;29:119–27. 8. Quinlan JR. Induction of decision trees. Mach Learn. 1986;1: 81–106. 9. Kim H, Loh W-Y. Classification trees with unbiased multiway splits. J Am Stat Assoc. 2001;96:589–604. 10. Kendall MG. Multivariate Analysis. London: Charles Griffin & Co; 1975. 11. Fang JQ. Sequential discriminant analysis. J Appl Math Beijing. 1979;2(3):287–93. 12. Fleishman AI. A method for simulating non-normal distributions. Psychometrika. 1978;43:521–31. 13. Mangasarian OL, Wolberg WH. Cancer diagnosis via linear programming. SIAM News. 1990;23(5):1–18.

ORIGINAL ARTICLE

14. Mangasarian OL, Setiono R, Wolberg WH. Pattern Recognition via Linear Programming: Theory and Application to Medical Diagnosis. Coleman TF, Li Y, eds. Philadelphia: SIAM Publications; 1990. p 22–30. 15. William H, Wolberg WH, Mangasarian OL. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci USA. 1990;87:9193–6. 16. Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Proc Annu Symp Comput Appl Med Care. 1988:261–5. 17. de Vries SO, Hunink MG, Polak JF. Summary receiver operating characteristic curves as a technique for meta-analysis of the diagnostic performance of duplex ultrasonography in peripheral arterial disease. Acad Radiol. 1996;3:361–9. 18. Zhou XH, Obuchowski NA, McClish DK. Statistical Methods in Diagnostic Medicine. New York: John Wiley & Sons; 2002.

989