Improving Decision Tree Performance by Exception Handling

International Journal of Automation and Computing 7(3), August 2010, 372-380 DOI: 10.1007/s11633-010-0517-5 Improving Decision Tree Performance by E...
Author: Sybil Warren
1 downloads 0 Views 2MB Size
International Journal of Automation and Computing

7(3), August 2010, 372-380 DOI: 10.1007/s11633-010-0517-5

Improving Decision Tree Performance by Exception Handling Appavu Alias Balamurugan Subramanian1 1

S. Pramala1

B. Rajalakshmi1

Ramasamy Rajaram2

Department of Information Technology, Thiagarajar College of Engineering, Madurai, India 2

Department of Computer Science, Thiagarajar College of Engineering, Madurai, India

Abstract: This paper focuses on improving decision tree induction algorithms when a kind of tie appears during the rule generation procedure for specific training datasets. The tie occurs when there are equal proportions of the target class outcome in the leaf node0 s records that leads to a situation where majority voting cannot be applied. To solve the above mentioned exception, we propose to base the prediction of the result on the naive Bayes (NB) estimate, k-nearest neighbour (k-NN) and association rule mining (ARM). The other features used for splitting the parent nodes are also taken into consideration. Keywords: Data mining, classification, decision tree, majority voting, naive Bayes (NB), k-nearest-neighbour (k-NN), association rule mining (ARM).

1

Introduction

The process of knowledge discovery in databases (KDD) is defined by Fayyad et al.[1] as “the non trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. Data mining is the core step of the KDD process, which is concerned with a computationally efficient enumeration of patterns present in a database. Classification is a primary data mining task aimed at learning a function that classifies a database record into one of the several predefined classes based on the values of the record attributes[2] . The decision tree is an important classification tool and its various improvements such as ID3[3, 4] , ID4[5] , ID5[6] , ITI[7] , C4.5[8] , and CART[9] over the original decision tree algorithm have been proposed. All of them deal with the concept of incrementally building a decision tree in real time. In decision-tree learning, a decision tree is induced from a set of labelled training instances represented by a tuple of attribute values and a class label. Because of the vast search space, decision-tree learning is typically a greedy, top-down recursive process starting with the entire training data and an empty tree. An attribute that best partitions the training data is chosen as the splitting attribute for the root, and the training data are then partitioned into disjoint subsets satisfying the values of the splitting attribute. For each subset, the algorithm proceeds recursively until all instances in a subset belong to the same class. However, prior decision tree algorithms do not handle the exception, namely, “when two or more classes have equal probabilities in a tree leaf”. This paper investigates exception handling in decision tree construction. We examine few exceptions such as how to produce classifications from a leaf node which contains ties. In such a leaf, each class is represented equally, preventing the tree from using “majority voting” to output a classification prediction. We introduce new techniques for handling these exceptions and use real world datasets to show that these techniques improve clasManuscript received September 3, 2008; revised June 26, 2009

sification accuracy. This paper is organized as follows. Related works are stated in Section 2. Problem statements are illustrated in Section 3. Sections 4 and 5 aim at handling several exceptions occurring during decision tree induction. We compare the proposed algorithm to the most common algorithm of decision tree construction and evaluate the performance of our algorithm on a variety of standard benchmark datasets in Section 6. Section 7 concludes the paper besides presenting a number of issues for future research.

2

Related work

Decision-tree induction algorithm is one of the most successful learning algorithms, due to its various attractive features: simplicity, comprehensibility, absence of parameters, and the applicability to handle mixed-type data. Decision tree learning is one of the most widely used and practical methods for inductive learning. The ID3 algorithm[3] is a useful concept-learning algorithm because it can efficiently construct a well generalized decision tree. For nonincremental learning tasks, this algorithm is often an ideal choice for building a classification rule. However, for incremental learning tasks, it would be far preferable to accept instances incrementally, without the necessity to build a new decision tree each time. There exist several techniques to construct incremental decision tree based models. Some of the earlier efforts include ID4[5] , ID5[6] , ID5R[10] , and ITI[7] . All these systems work by using the ID3 style “information gain” measure to select the attributes. They are all designed to incrementally build a decision tree using one training instance at a time by keeping the necessary statistics (measure for information gain) at each decision node. Many learning tasks are incremental as new instances, or details become available over time. Decision tree induction algorithm works by building a tree, and updating it as new instances become available. The ID3 algorithm[3] can be

373

A. A. B. Subramanian et al. / Improving Decision Tree Performance by Exception Handling

used to learn incrementally by adding each new instance to the training set as it becomes available and by re-running ID3 against the enlarged training set. This is however computationally inefficient. The ID5[6] and ID5R[10] are both incremental decision tree builders that overcome the deficiencies of ID4. The essential difference is that when tree restructuring is required, instead of discarding a sub-tree due to its high entropy, the attribute that is to be placed at the node is pulled up and the tree structure below the node is retained. In the case of ID5[6] , the sub-trees are not recursively updated while in ID5R[10] they are updated recursively. Leaving the sub-trees un-restructured is computationally more efficient. However, the resulting sub-tree is not guaranteed to be the same as the one that would be produced by ID3 on the same training instances. The incremental tree inducer (ITI)[7] is a programme that constructs decision tree automatically from labelled examples. The most useful aspect of the ITI algorithm is that it provides a mechanism for incremental tree induction. If one has already constructed a tree, and then obtains a new labelled example, it is possible to present it to the algorithm, and have the algorithm revise the tree as necessary. The alternative would be to build a new tree from scratch, based on the augmented set of labelled examples, which is typically much more expensive. ITI handles symbolic variables, numeric variables, and missing data values. It includes a virtual pruning mechanism too. The development of decision tree learning leads to and is encouraged by a growing number of commercial systems such as C5.0/See5 (RuleQuest Research), MineSet (SGI), and Intelligent Miner (IBM). Numerous techniques have been developed to speed up decision-tree learning, such as designing a fast tree-growing algorithm, parallelization, and data partitioning. A number of strategies for decision tree improvements have been proposed in the literature [11 − 19]. They aim at “tweaking” an already robust model despite its main and obvious limitation. A number of ensemble classifiers have been proposed in the literature [20−24] which seem to have little improvement on accuracy especially when the added complexity of the method is considered.

3

noted as |A|. For each attribute ai , the set of possible values is denoted as Vi . The individual values are indicated by vij , where j is between 1 and the number of values for attribute ai , denoted as |Vi |. The notations used to represent the features of training data are, Aj , where j = 1, · · · , n for attributes, C for class attributes and the values for each attribute is Vij, where i = 1, · · · , n and j refers to the attribute to which it belongs. The decision tree induction algorithm has the chance of handling three different types of training data sets. The general structure of the training data set is shown in Table 1. Table 1

Structure of training set

A1

A2

A3

A4

A5

···

Am

C

V11

V12

V13

V14

V15

···

V1m

C1

V21

V22

V23

V24

V25

···

V2m

C2

V31

V32

V33

V34

V35

···

V3m

C3

···

···

···

···

···

···

···

···

Vn1

Vn2

Vn3

Vn4

Vn5

···

Vnm

Cn

After analyzing the decision tree algorithms, it has been found that the concept of majority voting has to handle different types of inputs. Consider the attribute Ak between A1 to Am , and C be the class attribute. The values of Ak are V1k and V2k , and let the values of the class attribute be C1 and C2 . Classification can be done if 1) Both Ak and C have only one distinct value; 2) When most of the attribute values V1k , V2k of particular attribute Ak belong to the same class is called as majority voting. The maximum occurrence of the distinct value and its corresponding distinct class label value is obtained from Table 2, which shows that majority voting is successful. Consider the training data set shown in Table 3 where majority voting fails. Table 2

Input format for majority voting condition Ak

C

V1k

C2

V2k

C2

V1k

C2

V1k

C1

Problem statements and overview If AK = V1k then C = C2

This paper studies one type of tie in the decision tree. The tie occurs when there are the same amount of different class cases in a leaf node. For the problem, the association rule mining (ARM), k-nearest neighbor (k-NN), and naive Bayes (NB) are integrated into the decision tree algorithm. ARM, k-NN, and NB are used to make prediction on the instances in the tie leaf node. WEKA[25, 26] uses a weak kind of tie breaking, such as taking the first value mentioned in the value declaration part of the WEKA data file, while the proposed solutions try to break ties more intelligently, such as to opt for using ARM, k-NN and NB. We use ARM, kNN, and NB to select the class value when majority voting does not work. The set of attributes used to describe an instance is denoted by A, and the individual attributes are indicated as ai , where i is between 1 and the number of attributes, de-

If AK = V2k then C = C2 . Table 3

Example training set where majority voting cannot be applied Ak

C

V1k

C1

V2k

C2

V1k

C1

V1k

C2

V1k

C2

Under this condition, only one rule can be generated as If Ak = V2k then C = C2 .

374

International Journal of Automation and Computing 7(3), August 2010

The count of the distinct value is 2 in Table 3. Hence, two rules should be generated from the table but only one rule is generated with the help of majority voting and the other rule cannot be generated If Ak = V1k then C = ? The value for Ak can be found by majority voting as Ak = V1k but its corresponding class value cannot be determined because among the four records there is an equal partition of class attribute C = C1 or C2 , in Table 3. Hence, the majority voting cannot be applied in this case. The decision tree induction algorithm does not give any specific solution to handle this problem.

4

The training tuples in Hayes-Roth dataset is given as input to decision tree induction algorithm. For the given training data set, the decision tree induction algorithm starts its recursive function of identifying the attribute with the highest information gain. On the first recursion, the attribute “age” gets the highest information gain and it divides the data set into two halves, which is the core idea behind the algorithm. The recursive function identifies the attributes at each iteration, and the rules are generated. The majority voting is the stopping condition for the recursive partitioning, and all possible data at that particular iteration is given in Table 4. The possible set of rules generated is shown in Tables 5 and 6. Table 4

Ties between classes

This paper addresses the problem of tree induction for supervised learning. We have proposed a method to handle the situation where the number of instances for different classes is the same at a leaf node. The basic algorithm for decision tree induction is a greedy algorithm that constructs decision trees in a top-down recursive divide-and-conquer manner. We have chosen a training data from the “Hayes-Roth database”. (The data are adapted from [27]. The class attribute, has three distinct values (namely, {1, 2, 3}); therefore, there are three distinct classes (m = 3). Let class C1 correspond to “1”, class C2 correspond to “2”, and class C3 correspond to “3”. There are 21 samples with class label “1” , 21 samples with class label “2”, and 19 samples with class label “3”. The expected information needed to classify a given sample is first derived to compute the information gain of each attribute. 21 21 ) log2 ( )− 59 59 21 21 19 19 ( ) log2 ( ) − ( ) log2 ( ) = 1.5833. 59 59 59 59

I(s1 , s2 ,s3 ) = I(21, 21, 19) = −(

Next, we need to compute the entropy of each attribute. In the attribute “hobby”. The distribution of “1”, “2”, and “3” samples for each value of “hobby” needs to be looked at, and the expected information gain for each of these distributions is computed. Table 5 Number

A sample partitioned dataset

Education level

Hobby

Class

3

1

2

4

3

3

1

1

1

2

1

2

3

2

2

1

2

1

2

2

2

3

3

2

3

1

1

1

3

1

2

3

2

3

2

1

3

3

1

In Table 4, “education level” gets the highest information gain, and hence it splits the table into four as “Education level = 3”, “Education level = 4”, “Education level = 2”, and “Education level = 1”, then the recursive function is applied on the table. In Table 7, both information gain and majority voting cannot be applied and hence decision tree induction algorithm chooses the class label value arbitrarily. Thus, for the same set of training samples, different rules are generated. This type of misconception occurs when recursive partitioning faces a data set which has an equal number of records, but the concept of majority voting works correctly only when the number of records is unequal. Fig. 1 shows majority voting failure with Hayes-Roth dataset.

Generated rules for Hayes-Roth dataset with “1” as the first class value

Generated rules

1

If Age = 1 and Marital status = 1 and Education level = 2, then Class = 1

2

If Age = 1 and Marital status = 1, and Education level = 4 and Hobby=3, then Class = 3

3

If Age = 1 and Marital status = 2 and Education level = 1, then Class = 1

4

If Age = 1 and Marital status = 2 and Education level = 2, then Class = 2

5

If Age = 1 and Marital status = 2 and Education level = 4 and Hobby=3, then Class = 1

6

If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=3, then Class = 1

7

If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=2, then Class = 2

8

If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=1, then Class = 1

9

If Age = 1 and Marital status = 4, then Class = 3

10

If Age = 1 and Marital status = 3, then Class = 1

A. A. B. Subramanian et al. / Improving Decision Tree Performance by Exception Handling Table 6 Number

375

Generated rules for Hayes-ROTH dataset with “2” as the first class value

Generated rules

1

If Age = 1 and Marital status = 1 and Education level = 2, then Class = 1

2

If Age = 1 and Marital status = 1 and Education level = 4, and Hobby=3 then Class = 3

3

If Age = 1 and Marital status = 2 and Education level = 1, then Class = 1

4

If Age = 1 and Marital status = 2 and Education level = 2, then Class = 2

5

If Age = 1 and Marital status = 2 and Education level = 4 and Hobby=3, then Class = 1

6

If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=3, then Class = 1

7

If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=2, then Class = 2

8

If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=1, then Class = 2

9

If Age = 1 and Marital status = 4, then Class = 3

10

If Age = 1 and Marital status = 3, then Class = 1

Fig. 1 The attribute age has the highest information gain and therefore becomes a test attribute at the root node of the decision tree. Branches are grown for each value of age. The samples are partitioned according to each branch Table 7

5

Example for a situation where majority voting is not possible Hobby

Class

1

2

2

2

3

2

1

1

2

1

3

1

Methods for handling ties between classes

This paper addresses the research question in the context of decision tree induction: which class to choose when, classified with respect to an attribute, the number of records having the different class values are equal, i.e., when majority voting fails. The NB probabilistic model, k-NN, and ARM are used to answer the problem.

5.1

Naive Bayes (NB)

In this paper, it has been proposed to use the NB classifier to select the class value when majority voting does not work. The presented approach deals with the shortcomings in information gain based decision tree induction

algorithm. In order to avoid the arbitrary selection of class label due to failure of the “majority voting”, the usage of NB to assign the class label is proposed. The class label chosen by the NB would be better than an arbitrary selection. Thus, decision tree naive Bayes (DTNB) can be called as the extension of decision tree algorithm which does not give contradictory rules. The traditional decision tree induction algorithm data set would fall under any one of these categories: 1) consistent data or 2) inconsistent data. However, with DTNB algorithm, any data set considered for classification would fall under any one of these categories: 1) consistent data identified by alpha rules, 2) inconsistent data identified by alpha rules, 3) consistent data identified by beta rules, or 4) inconsistent data identified by beta rules. Inconsistent data are those which are not classified in the given training data set. Hence, the efficiency of the algorithm decreases due to its inapplicability in classifying the given data. The records, which are considered as inconsistent, are identified by alpha rules. Noisy data may be present in the training data due to human errors, that is, the user records irrelevant values in the data set. The algorithm should be developed in such a way that it is capable of handling noisy data also. When a rule is formed by a set of records in which the value of the class label accounts for more than 50 % than

376

International Journal of Automation and Computing 7(3), August 2010

its counterpart, then that rule is acceptable and those instances that contradict this rule will be recorded as the noisy data. The inconsistent records are those which are identified as noise by beta rules. Consider the attribute “hobby” in Fig. 1, which cannot decide the class label0 s value. The proposed solution considers the path of the tree as the test data and all the remaining data as the training set for this problem. This dataset is given to the probability based algorithm (simple naive Bayesian) and the path of tree is considered as the test data, i.e., If Age = 1 and Marital status = 2 and Education level = 3 and Hobby = 1, then Class = ? NB is a classification technique[28−31] that is both predictive and descriptive. It analyzes the relationship between each independent and dependent variable and derives a conditional probability for each relationship. When a new case is analyzed, a prediction is made by combining the effects of the independent variables on the dependent variable (the outcome that is predicted). NB requires only one pass through the training set to generate a classification model, which makes it the most efficient data mining technique. Bayesian classification is based on Bayes theorem[29, 32] . Let X be a data sample whose label is unknown and let H be some hypothesis, such that the data sample X belongs to a specified class C. According to Bayes theorem, for classification problems, P (H|X) is to be determined, i.e., the probability that the hypothesis H holds true, given the observed data sample X. P (X), P (H), and P (X|H) may be estimated from the given data. Bayes theorem is useful in that it provides a way of calculating the posterior probability, P (H|X), from P (H), P (X), and P (X|H). From Bayes theorem, it follows: P (H|X) =

P (X|H)P (H) P (X)

(1)

where X = (Age = 1 and Marital status = 2 and Education level = 3 and Hobby = 1) NB aims at predicting the class value (Class = 1 or Class = 2), and NB does it probabilistically. We use the notation that Pr(A) denotes the probability of an event A, and Pr(A/B) denotes the probability of A conditional on another event B. The hypothesis H is that class will be “1” or “2”, and our aim to find Pr(H/E). E is evidence. The evidence E is the particular combination of attribute values, Age = 1 and Marital status = 2 and Education level = 3 and Hobby = 1. By maximizing P (X|Ci )P (Ci ), for i = 1, 2, P (Ci ), the prior probability of each class can be computed based on the training samples. Therefore, the naive Bayesian classifier predicts Class = 1 for sample X. The procedure for handling ties between classes using NB are given in Algorithm 1. Algorithm 1. Algorithm for handling ties between classes using NB 1) If attribute-list is empty then 2) If training data rule is null 3) Return N as a leaf node labeled with the most common class in samples // Majority voting This rule traversal from the root to the leaf is alpha rule. If (records satisfy alpha rule) i) Correctly classified (as normal decision tree induction algorithm)

Else ii) Incorrectly classified (as normal decision tree induction algorithm) 4) Else 5) Return N as a leaf node labelled with the attribute corresponding to class label value from probability based algorithm (naive Bayesian algorithm) This traversal of the rule from the root to the leaf is called the beta rule // A unique rule If (records satisfy beta rule) Correctly classified (These records cannot be handled by decision tree induction algorithm) Else Incorrectly classified.

Hence, the class value is assigned based on the given training data set where majority voting is not possible. The rules developed in this manner are called the beta rules as it has the unique feature of handling the exception due to failure of majority voting.

5.2

k-nearest neighbour (k-NN)

In this section, the usage of k-NN is proposed to overcome a few problems that result due to the failure of majority voting. This is an enhancement of the decision tree induction algorithm to deal with the cases of unpredictability. We explain the needs that drive the change, by describing specific problems of the algorithm, suggest solutions, and test them with benchmark data sets. If two or more classes have equal probabilities in a tree leaf, the assigning of class label using k-NN algorithm is proposed to solve the problem. k-NN is suitable for classification models[33−35] . Unlike other predictive algorithms, the training data is not scanned or processed to create the model. Instead, the training data is the model itself. When a new case or instance is presented to the model, the algorithm looks at the entire data to find a subset of cases that are most similar to it and uses them to predict the outcome. For distance measure, we use a simple metric that is computed by summing the scores for each of the independent columns. The procedure for handling ties between classes using instance based learning are given in Algorithm 2. Algorithm 2. Algorithm for handling ties between classes using k-NN 1) If attribute-list is empty then 2) If training data rule is null 3) return N as a leaf node labeled with the most common class in samples // Majority voting This rule traversal from the root to the leaf is alpha rule. If (records satisfy alpha rule) i) Correctly classified (as normal decision tree induction algorithm) Else ii) Incorrectly classified (as normal decision tree induction algorithm) 4) Else 5) Return N as a leaf node labelled with the attribute corresponding to class label value from instance based learning (k-NN algorithm) This traversal of the rule from the root to the leaf is called the beta rule // A unique rule If (records satisfy beta rule) Correctly classified (These records cannot be handled by decision tree induction algorithm) Else Incorrectly classified.

377

A. A. B. Subramanian et al. / Improving Decision Tree Performance by Exception Handling

In the traditional decision tree induction algorithm, the data set would fall under any one of the categories, as explained in the previous section. The proposed algorithm considers the path of the tree as the test data and all the remaining data as the training set for this problem. This training data set is given to the instance based learning (simple k-NN classifier) and path of tree is considered as the test data i.e., let X= (Age = 1 and Marital status = 2 and Education level = 3 and Hobby = 1). Output = y 0 = 1. If Age = 1 and Marital status = 2 and Education level = 3 and Hobby = 1, then Class = 1.

5.3

Association rule mining (ARM)

ARM finds interesting associations and correlation relationships among large sets of data items[36] . If two or more classes have equal probabilities in a tree leaf, the usage of ARM is proposed to assign the class label value. ARM can be used to solve the majority voting failure problem existing in decision tree induction algorithm. After finding out the rules, the class value for the conflicting data is decided as follows: 1) Find out the rule having the highest number of attributes. 2) If two or more rules have the highest number of attributes but different class values, then find out the number of rules for each different class values having the highest number of attributes and those class values are returned back as the appropriate class label. Otherwise, return the corresponding label of the rule having highest number of attributes. The procedure for handling ties between classes using ARM are given in Algorithm 3. Using ARM, we can predict that, if Hobby = 1 and Age = 1 and Educational level = 3 and Marital status = 2, then Class = 1. Algorithm 3. Algorithm for handling ties between classes using ARM Generate a decision tree from the training tuples of data D Input: Training data tuples D Output: Decision tree Procedure: 1) Create a node in the decision tree 2) If all tuples in the training data belong to the same class C, the node created above is a leaf node labeled with class C 3) If there are no remaining attributes on which the tuples may be further partitioned i) If any one class has higher majority count than others // Majority voting success Convert the node created above into a leaf node and label it with majority class C at that instance // Apply majority voting ii) Else // majority voting fails a) Apply ARM algorithm to find out the frequent pattern in training data D for conflicting rule b) Tune the conflicting rule depending on class value obtained from frequent patterns 4) Else Apply attribute selection method to find the “best” splitting-criterion 5) Label the created node with splitting criterion 6) Remove splitting attribute from the attribute list 7) For each outcome of splitting attribute, partition the tuples and grow sub trees for each partition 8) If number of tuples satisfying an outcome of the splitting attribute is empty Apply majority voting

9) Else Apply the above algorithm to those tuples that satisfy outcome of the splitting attribute.

6

Experimental results mance evaluation

and

perfor-

The performance of the proposed methods was evaluated on artificial and real-world domains. Artificial domains are useful because they allow variation in parameters, and aid us in understanding the specific problems that algorithms exhibit, and the test conjectures. Real world domains are useful because they come from real-world problems and are therefore actual problems on which we would like to improve performance. All real-world data sets used are from the University of California Irvine (UCI) repository[27] , which contains over 177 data sets mostly contributed by researchers in the field of machine learning. The data sets that possess majority voting problem, i.e., two records having the same attribute values but different class values, are purposely chosen for experimentation to prove the efficiency of our algorithm. Experiments are conducted to measure the performance of the proposed methods, and it was observed that these data sets are efficiently classified. Tables 8 and 9 give a description of the dataset used in the experiment. The proposed algorithms0 performance is compared with various existing decision tree induction algorithms. Tables 10 and 11 show, for each data set, the estimated predictive accuracy of the decision tree with k-NN, decision tree with ARM, and decision tree with NB versus other traditional decision tree methods. As one can see from Tables 10 and 11, the predictive accuracy of the DTARM, DT-KNN, and DT-NB tends to be better than the accuracy of traditional decision tree induction algorithms. There are four common approaches for estimating the accuracy such as using training data, using test data, crossvalidation, and percentage splitting[37] . The evaluation function used is percentage splitting and 10 fold cross validation. In Tables 10 and 11, datasets with the problem of majority voting are considered and their accuracy is predicatively compared with that of the traditional decision tree induction algorithm. The proposed algorithm has provided better results than C4.5 besides solving the exceptions caused due to the failure of majority voting. This paper presents an algorithm based on decision tree with k-NN, decision tree with ARM, and decision tree with NB which make use of the decision tree as the main structure and conditionally applies ARM, k-NN, and NB at some of the leaf nodes in order to handle cases where a node has equal probability for more than one class. The nearest neighbor technique used in DT-KNN algorithm has a large number of drawbacks when compared to the NB technique used in DT-NB algorithm. These drawbacks include a lack of descriptive output, reliance on an arbitrary distance measure, and the performance implications of scanning the entire data set to predict each new case. The NB method, on the other hand, has a number of qualities. It is the most efficient technique with respect to training and it can make predictions from partial data.

378

International Journal of Automation and Computing 7(3), August 2010

The biggest theoretical drawback with NB is the need for independence between the columns which is usually not a problem in practice. We can see the performance of each classifier, still, some conclusions can be made from this experiment: 1) Using NB, k-NN, and ARM to resolve tie such as two Table 8

or more classes have equal probabilities in a tree leaf, which is better than a random decision at the node). 2) Using NB to resolve tie such as two or more classes have equal probabilities in a tree leaf, which is better than an ARM and k-NN at the node. The accuracy across various classifiers as shown in Fig. 2.

The datasets that possess majority voting problem in the UCI repository

Dataset

Number of instances

Number of attributes

Abalone

4177

9

Associated task Classification

Pima-diabetes

768

9

Classification Classification

Flag

194

30

Glass

214

11

Classification

Housing

506

14

Classification

Image segmentation

210

20

Classification

Ionosphere

351

35

Classification

Iris

150

5

Classification

Liver disorder

345

7

Classification

Parkinson

195

24

Classification

Yeast

1484

10

Classification

Table 9

The datasets that possess majority voting problem in the UCI repository (after discretization)

Dataset

Number of instances

Number of attributes

Blood transfusion

748

5

Associated task Classification

Contraceptive method choice

1473

10

Classification Classification

Concrete

1030

9

Forest-fires

517

13

Classification

Haberman

306

4

Classification

Hayes-Roth

132

6

Classification

Solar flare 1

323

13

Classification

Solar flare 2

1066

13

Classification

Spect-heart

267

23

Classification

Teaching assistant evaluation

151

6

Classification

Table 10

Predictive accuracy of the proposed methods versus C4.5, NB, k-NN with percentage splitting

Data set

C4.5

NB

k-NN

Decision tree with ARM

Decision tree with NB

Decision tree with k-NN

Blood transfusion

84.61

86.26

83.51

85.16

90.11

87.36

Solar flare 1

67.9

56.79

70.37

85.18

91.46

88.89

Glass

66.04

28.3

69.81

75.47

86.79

81.13

Haberman

59.57

66.67

54.54

87.88

90.91

87.88

Hayes-Roth

67.5

72.5

57.5

85.00

92.50

87.50

Liver disorder

63.95

46.51

66.28

86.04

90.70

87.20

Parkinson

50

50

33.33

86.67

90.00

86.67

Teaching assistant evaluation

54

70.27

89.19

86.48

94.59

86.48

Mean

64.20

59.66

65.57

84.735

90.88

86.64

Table 11

Predictive accuracy of the proposed methods versus C4.5, NB, k-NN with ten-fold cross validation

Data set

C4.5

NB

k-NN

Decision tree with ARM

Decision tree with NB

Decision tree with k-NN

Solar flare 1

71.21

65.01

69.66

85.39

89.16

87.62

Hayes-Roth

72.73

80.30

57.58

89.23

93.94

91.67

Parkinson

80.51

69.23

89.74

97.89

98.97

97.89

Teaching assistant evaluation

52.98

50.33

62.25

86.00

93.38

89.40

Mean

69.36

66.22

69.81

89.63

93.86

91.65

A. A. B. Subramanian et al. / Improving Decision Tree Performance by Exception Handling

379

[7] P. E. Utgoff. An improved algorithm for incremental induction of decision trees. In Proceedings of the 11th International Conference on Machine Learning, pp. 318–325, 1994. [8] J. R. Quinlan. C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993. [9] L. Breiman, J. H. Friedman, R. A. Olsen, C. J. Stone. Classification and Regression Trees, Wadsworth and Brooks, 1984. [10] P. E. Utgoff. Incremental induction of decision trees. Machine Learning, vol. 4, no. 2, pp. 161–186, 1989. [11] S. A. Balamurugan, R. Rajaram. Effective and efficient feature selection for large scale data using Bayes0 theorem. International Journal of Automation and Computing, vol. 6, no. 1, pp. 62–71, 2009. [12] W. Buntine. Learning classification trees. Statistics and Computing, vol. 2, no. 2, pp. 63 –73, 1992. [13] C. R. P. Hartmann, P. K. Varshney, K. G. Mehrotra, C. L. Gerberich. Application of information theory to the construction of efficient decision trees. IEEE Transactions on Information Theory, vol. 28, no. 4, pp. 565–577, 1982.

Fig. 2

7

Predictive accuracy on classifiers

Conclusions

This paper proposes enhancements to the basic decision tree induction algorithm. The enhancement handles the case where the numbers of instances for different classes are the same in a leaf node. The results on the UCI data sets show that the proposed modification procedure has better predictive accuracy over the existing methods in the literature. Future work includes the use of a more robust classifier instead of NB, k-NN, and ARM. Building a tree for each tied feature with lower cost is also an area that demands attention in the near future.

References [1] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth. From data mining to knowledge discovery: An overview. Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Menlo Park, CA, USA: American Association for Artificial Intelligence, pp. 1–34, 1996. [2] J. W. Han, M. Kamber. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2006. [3] J. R. Quinlan. Induction of decision trees. Journal of Machine Learning, vol. 1, no. 1, pp. 81–106, 1986. [4] J. R. Quinlan. Simplifying decision trees. International Journal of Man-machine Studies, vol. 27, no. 3, pp. 221–234, 1987. [5] P. E. Utgoff. Improved training via incremental learning. In Proceedings of the 6th International Workshop on Machine Learning, Morgan Kaufmann Publishers Inc., Ithaca, New York, USA, pp. 362–365, 1989. [6] P. E. Utgoff. ID5: An incremental ID3. In Proceedings of the 5th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., Ann Arbor, MI, USA, pp. 107–120, 1988.

[14] J. Mickens, M. Szummer, D. Narayanan. Snitch interactive decision trees for troubleshooting misconfigurations. In Proceedings of the 2nd USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques, USENIX Association, Cambridge, MA, USA, Article No. 8, 2007. [15] R. Kohavi, C. Kunz. Option decision trees with majority votes. In Proceedings of the 14th International Conference on Machine Learning, Morgan Kaufmann, pp. 161–169, 1997. [16] R. Carina, A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning, ACM, Pittsburgh, Pennsylvania, USA, pp. 161–168, 2006. [17] J. C. Schlimmer, D. Fisher. A case study of incremental concept induction. In Proceedings of the 5th National Conference on Artificial Intelligence, Morgan Kaufmann, Philadelpha, USA, pp. 496–501, 1986. [18] J. C. Schlimmer, R. Granger. Beyond incremental processing: Tracking concept drift. In Proceedings of the 5th National Conference on Artificial Intelligence, vol. 1, pp. 502– 507, 1986. [19] P. E. Utgoff, N. C. Berkman, J. A. Clouse. Decision tree induction based on efficient tree restructuring. Machine Learning, vol. 29, no. 1, pp. 5–44, 2004. [20] H. A. Chipman, E. I. George, R. E. McCulloch. Bayesian CART model search. Journal of the American Statistical Association, vol. 93, no. 443, pp. 935–948, 1998. [21] R. Kohavi. Scaling up the accuracy of naive Bayes classifiers: A decision tree hybrid. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 202–207, 1996. [22] L. M. Wang, S. M. Yuan, L. Li, H. J. Li. Improving the performance of decision tree: A hybrid approach. Conceptual Modeling, Lecture Notes in Computer Science, Springer, vol. 3288, pp. 327–335, 2004. [23] Y. Li, K. H. Ang, G. C. Y. Chong, W. Y. Feng, K. C. Tan, H. Kashiwagi. CAutoCSD-evolutionary search and optimisation enabled computer automated control system design. International Journal of Automation and Computing, vol. 1, no. 1, pp. 76–88, 2006. [24] Z. H. Zhou, Z. Q. Chen. Hybrid decision tree. Journal of Knowledge-based Systems, vol. 15, no. 8, pp. 515–528, 2002. [25] WEKA. Open Source Collection of Machine Learning Algorithm.

380

International Journal of Automation and Computing 7(3), August 2010

[26] I. H. Witten, E. Frank. Data Mining-practical Machine Learning Tools and Techniques with Java Implementation, 2nd Edition, 2004. [27] C. L. Blake, C. J. Merz. UCI Repository of Machine Learning Databases, [Online], Available: http:// www.ics.uci.edu/˜mlearn/mlrepository.html, 2008. [28] E. Frank, M. Hall, B. Pfahringer. Locally weighted naive Bayes. In Proceedings of Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, pp. 249–256, 2003. [29] J. Joyce. Bayes Theorem, Stanford Encyclopedia of Philosophy, 2003. [30] L. Jiang, D. Wang, Z. Cai, X. Yan. Survey of improving naive Bayes for classification. In Proceedings of the 3rd International Conference on Advanced Data Mining and Applications, Springer, vol. 4632, pp. 134–145, 2007. [31] P. Langley, W. Iba, K. Thompson. An analysis of Bayesian classifiers. In Proceedings of the 10th National Conference on Artificial Intelligence, AAAI press and MIT press, pp. 223–228, 1992. [32] J. M. Bernardo, A. F. Smith. Bayesian Theory, John Wiley & Sons, 1993. [33] D. W. Aha, D. Kibler, M. K. Albert. Instance-based learning algorithms. Machine Learning, vol. 6, no. 1, pp. 37–66, 1991. [34] T. M. Cover, P. E. Hart. Nearest neighbour pattern classification. IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967. [35] S. M. Weiss. Small sample error rate estimation for knearest neighbour classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 3, pp. 285– 289, 1991. [36] T. Fukuda, Y. Morimoto, S Morishita, T. Tokuyama. Data mining with optimized two-dimensional association rules. ACM Transactions on Database Systems, vol. 26, no. 2, pp. 179–213, 2001. [37] R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143, 1995.

Appavu Alias Balamurugan Subramanian is a Ph. D. candidate in the Department of Information and Communication Engineering at Anna University, Chennai, India. He is also a faculty at Thiagarajar College of Engineering, Madurai, India. His research interests include data mining and text mining. E-mail: [email protected] (Corresponding author) S. Pramala received the B. Tech. degree in the Department of Information Technology at Thiagarajar College of Engineering, Madurai, India in 2009, and the B. Tech. degree in information technology in May, 2010. Her research interests include data mining, where classification and prediction methods dominate her domain area, mining of frequent patterns, associations and correlations existing in test data. E-mail: [email protected] B. Rajalakshmi is doing her undergraduate course in information technology. She qualifies as a final year student (2009) at Thiagarajar College of Engineering, Madurai, India. She received B. Tech. degree in information technology in May, 2010. Her research interests include textual data mining, usage of various classification techniques for efficient retrieval of data, data pruning, and pre-processing. E-mail: [email protected] Ramasamy Rajaram received the Ph. D. degree from Madurai Kamaraj University, India. He is a professor of Department of Computer Science and Information Technology at Thiagarajar College of Engineering, Madurai, India. His research interests include data mining and information security. E-mail: [email protected]