DATA MINING DECISION TREE INDUCTION
1
Classification Techniques Linear Models Support Vector Machines
Decision Tree based Methods Rule-based Methods Memory based reasoning
Neural Networks Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
2
Example of a Decision Tree Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund Yes
No
NO
MarSt Single, Divorced TaxInc
< 80K NO
Married NO
> 80K YES
10
Training Data
Model: Decision Tree 3
Another Decision Tree Example MarSt Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married NO
Single, Divorced Refund No
Yes NO
TaxInc < 80K NO
> 80K YES
More than one tree may perfectly fit the data
10
4
Decision Tree Classification Task Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree Induction algorithm
Induction Learn Model Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply Model
Class
Decision Tree
Deduction
10
Test Set 5
Apply Model to Test Data Test Data Start from the root of tree.
Refund Yes
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
No
NO
MarSt Single, Divorced
TaxInc < 80K NO
Married NO
> 80K
YES
6
Apply Model to Test Data Test Data
Refund Yes
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
No
NO
MarSt Single, Divorced
TaxInc < 80K NO
Married NO
> 80K
YES
7
Apply Model to Test Data Test Data
Refund Yes
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
No
NO
MarSt Single, Divorced TaxInc < 80K NO
Married NO
> 80K YES
8
Apply Model to Test Data Test Data
Refund Yes
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
No
NO
MarSt Single, Divorced TaxInc < 80K NO
Married NO
> 80K YES
9
Apply Model to Test Data Test Data
Refund Yes
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
No
NO
MarSt Single, Divorced TaxInc < 80K NO
Married NO
> 80K YES
10
Apply Model to Test Data Test Data
Refund Yes
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
No
NO
MarSt Single, Divorced TaxInc < 80K NO
Married
Assign Cheat to “No”
NO > 80K YES
11
Decision Tree Terminology
12
Decision Tree Induction Many Algorithms:
Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT
John Ross Quinlan is a computer science researcher in data
mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms. 13
Decision Tree Classifier 10 9 8 7 6 5 4 3 2 1
Antenna Length
Ross Quinlan
Abdomen Length > 7.1? yes
no Antenna Length > 6.0? no
1
2 3
4 5
6 7
Abdomen Length
8 9 10
Grasshopper
Katydid yes
Katydid 14
Antennae shorter than body?
Yes
No
3 Tarsi?
Grasshopper
Yes
No
Foretiba has ears? Yes
No
Cricket
Decision trees predate computers
Katydids
Camel Cricket
15
Definition
Decision tree is a classifier in the form of a tree structure – Decision node: specifies a test on a single attribute – Leaf node: indicates the value of the target attribute – Arc/edge: split of one attribute – Path: a disjunction of test to make the final decision
Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node.
16
Decision Tree Classification • Decision tree generation consists of two phases – Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • This can also be called supervised segmentation • This emphasizes that we are segmenting the instance space – Tree pruning • Identify and remove branches that reflect noise or outliers 17
Decision Tree Representation Each internal node tests an attribute Each branch corresponds to attribute value
Each leaf node assigns a classification outlook sunny
overcast
humidity
rain
wind
yes
high
normal
strong
weak
no
yes
no
yes 18
How do we Construct a Decision Tree? Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-
and-conquer manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes. Test attributes are selected on the basis of a heuristic or statistical measure (e.g., info. gain)
Why do we call this a greedy algorithm? Because it makes locally optimal decisions (at
each node). 19
When Do we Stop Partitioning? All samples for a node belong to same class
No remaining attributes majority voting used to assign class
No samples left
20
How to Pick Locally Optimal Split Hunt’s algorithm: recursively partition
training records into successively purer subsets. How to measure purity/impurity? Entropy and associated information gain Gini Classification error rate Never used in practice but good for understanding and simple exercises
21
How to Determine Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Own Car? Yes
Car Type? No
Family
Student ID? Luxury
c1
Sports C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
C0: 1 C1: 7
C0: 1 C1: 0
...
c10 C0: 1 C1: 0
c11 C0: 0 C1: 1
c20
...
C0: 0 C1: 1
Which test condition is the best? Why is student id a bad feature to use?
22
How to Determine Best Split Greedy approach: Nodes with homogeneous class distribution are preferred
Need a measure of node impurity: C0: 5 C1: 5
C0: 9 C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
23
Information Theory Think of playing "20 questions": I am thinking of an
integer between 1 and 1,000 -- what is it? What is the first question you would ask? What question will you ask? Why? Entropy measures how much more information you need
before you can identify the integer. Initially, there are 1000 possible values, which we assume are equally likely. What is the maximum number of question you need to ask? 24
Entropy Entropy (disorder, impurity) of a set of examples, S, relative to a
binary classification is: Entropy (S ) p1 log 2 ( p1 ) p0 log 2 ( p0 )
where p1 is the fraction of positive examples in S and p0 is fraction of negatives. If all examples are in one category, entropy is zero (we define
0log(0)=0) If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1. For multi-class problems with c categories, entropy generalizes to: c
Entropy ( S ) pi log 2 ( pi ) i 1
25
Entropy for Binary Classification The entropy is 0 if the outcome is certain. The entropy is maximum if we have no knowledge
of the system (or any outcome is equally possible).
Entropy of a 2-class problem with regard to the portion of one of the two groups
26
Information Gain in Decision Tree Induction • Is the expected reduction in entropy caused by partitioning the examples according to this attribute. • Assume that using attribute A, a current set will be partitioned into some number of child sets • The encoding information that would be gained by branching on A
Gain( A) E (Current set ) E (all child sets ) The summation in the above formula is a bit misleading since when doing the summation we weight each entropy by the fraction of total examples in the particular child set. This applies to GINI and error rate also.
27
Examples for Computing Entropy Entropy(t ) p( j | t ) log p( j | t ) j
2
NOTE: p( j | t) is computed as the relative frequency of class j at node t
C1 C2
0 6
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0
C1 C2
1 5
P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65
C1 C2
2 4
P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
C1 C2
3 3
P(C1) = 3/6=1/2 P(C2) = 3/6 = 1/2 Entropy = – (1/2) log2 (1/2) – (1/2) log2 (1/2) = -(1/2)(-1) – (1/2)(-1) = ½ + ½ = 1
28
How to Calculate log2x Many calculators only have a button for log10x
and logex (“log” typically means log10) You can calculate the log for any base b as follows: logb(x) = logk(x) / logk(b) Thus log2(x) = log10(x) / log10(2) Since log10(2) = .301, just calculate the log base 10
and divide by .301 to get log base 2. You can use this for HW if needed 29
Splitting Based on INFO... Information Gain:
GAIN
n Entropy( p) Entropy(i ) n k
split
i
i 1
Parent Node, p is split into k partitions; ni is number of records in partition i
Uses a weighted average of the child nodes, where weight
is based on number of examples Used in ID3 and C4.5 decision tree learners WEKA’s J48 is a Java version of C4.5 Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
How Split on Continuous Attributes? For continuous attributes Partition the continuous value of attribute A into
a discrete set of intervals Create a new boolean attribute Ac , looking for a threshold c One method is to try all possible splits
true if Ac c Ac false otherwise How to choose c ? 31
Person Homer Marge Bart Lisa Maggie Abe Selma Otto Krusty Comic
Hair Length
Weight
Age
Class
0” 10” 2” 6” 4” 1” 8” 10” 6”
250 150 90 78 20 170 160 180 200
36 34 10 8 1 70 41 38 45
M F M F F M F M M
8”
290
38 32
?
Entropy ( S )
p p log2 pn p n
n n log2 pn p n
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 no
yes Hair Length