Overview Previous Lecture • Classification Problem
Classification Techniques (2)
• Classification based on Regression • Distance-based Classification (KNN) This Lecture • Classification using Decision Trees • Classification using Rules • Quality of Classifiers Data Mining Lecture 4: Classification 2
2
Classification Using Decision Trees
Decision Tree
• A partitioning based technique
Given: – D = {t1, …, tn} where ti= – Database schema contains {A1, A2, …, Ah} – Classes C = {C1, …., Cm}
– Divides the search space into rectangular regions
• Each tuple is placed into a class based on the region within which it falls • Internal nodes associated with attribute and arcs with values for that attribute • DT approaches differ in how the tree is built • Algorithms: Hunt’s, ID3, C4.5, CART Data Mining Lecture 4: Classification 2
Decision or Classification Tree is a tree associated with D such that – Each internal node is labeled with attribute, Ai – Each arc is labeled with predicate which can be applied to attribute at parent – Each leaf node is labeled with a class, Cj
3
Data Mining Lecture 4: Classification 2
Example of a Decision Tree t ca
o eg
al ric t ca
o eg
al ric
in nt co
uo
Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Another Example of Decision Tree
us as cl
4
s
t ca
Splitting Attributes
Refund Yes
No
NO
MarSt Single, Divorced TaxInc
< 80K NO
Married NO
> 80K YES
10
or eg
a ic
l
t ca
or eg
a ic
l in nt co
uo
us as cl
Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
s
Married
MarSt
NO
Single, Divorced Refund No
Yes NO
TaxInc < 80K NO
> 80K YES
There could be more than one tree that fits the same data!
10
Training Data
Model: Decision Tree Data Mining Lecture 4: Classification 2
5
Data Mining Lecture 4: Classification 2
6
1
Decision Tree Classification Task Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
Attrib2
100K
Attrib3
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Apply Model to Test Data Start from the root of tree.
Tree Induction algorithm
Class
Induction
Refund Yes Learn Model
Training Set
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
Attrib2
80K
Attrib3
?
Class
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
80K
Married
?
0 1
Test Data
MarSt Single, Divorced
Decision Tree
Apply Model
Taxable Income Cheat
No
No
NO Model
1 0
Refund Marital Status
TaxInc < 80K
Married NO
> 80K
Deduction
YES
NO
1 0
Test Set Data Mining Lecture 4: Classification 2
7
Apply Model to Test Data
No
Refund
Married
Taxable Income Cheat 80K
Refund
Test Data
TaxInc < 80K
TaxInc
NO < 80K
Data Mining Lecture 4: Classification 2
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
Refund
Test Data
Yes
10
Married
Data Mining Lecture 4: Classification 2
TaxInc < 80K NO
11
No
80K
Married
?
Test Data
MarSt
NO
YES
Taxable Income Cheat
0 1
Single, Divorced
> 80K
Refund Marital Status
No
NO
MarSt
NO
NO
Data Mining Lecture 4: Classification 2
1 0
TaxInc
Married
Apply Model to Test Data
No
Single, Divorced
?
Test Data
YES
9
Apply Model to Test Data
NO
80K
Married
> 80K
NO
Refund
No 0 1
No
Single, Divorced
YES
Yes
Taxable Income Cheat
MarSt
Married
> 80K
NO
< 80K
Yes NO
MarSt Single, Divorced
Refund Marital Status
?
1 0
No
NO
8
Apply Model to Test Data
Refund Marital Status
Yes
Data Mining Lecture 4: Classification 2
Married NO
> 80K YES
Data Mining Lecture 4: Classification 2
12
2
Apply Model to Test Data
Refund Yes
Decision Tree Classification Task
Refund Marital Status
Taxable Income Cheat
No
80K
Married
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
Attrib2
100K
Attrib3
No
Tree Induction algorithm
Class
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
1 0
Test Data
No
NO
?
MarSt Single, Divorced
Assign Cheat to “No”
Married
Induction Learn Model
Model
1 0
Training Set
TaxInc < 80K NO
Apply Model
NO > 80K YES
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
Attrib2
80K
Attrib3
?
Class
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Decision Tree
Deduction
1 0
Test Set Data Mining Lecture 4: Classification 2
13
General Structure of Hunt’s Algorithm Tid Refund Marital Status
• Let Dt be the set of training records that reach a node t • General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, then use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
Data Mining Lecture 4: Classification 2
Hunt’s Algorithm Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
60K
7
Yes
Divorced 220K
No
8
No
Single
Yes
85K
Yes Don’t Cheat
Yes
Single
125K
No
2
No
Married
100K
No
No
Single
70K
No
4
Yes
Married
120K
No
Don’t Cheat
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Refund Yes
No
Taxable Income Cheat
1
3
Refund Yes
Tid Refund Marital Status
No
Refund Don’t Cheat
14
No
60K
10
9
No
Married
75K
No
10
No
Single
90K
Yes
1 0
Data Mining Lecture 4: Classification 2
Dt
Don’t Cheat
Don’t Cheat
Marital Status
Single, Divorced
?
Cheat
Married
Decision Tree Induction
Single, Divorced
Married Don’t Cheat
Taxable Income
Don’t Cheat
15
Marital Status
< 80K
>= 80K
Don’t Cheat
Cheat
Data Mining Lecture 4: Classification 2
16
Decision Tree Induction • Greedy strategy – Split the records based on an attribute test that optimizes certain criterion.
• Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?
– Determine when to stop splitting
Data Mining Lecture 4: Classification 2
17
Data Mining Lecture 4: Classification 2
18
3
DT Split Areas
How to Specify Test Condition? • Depends on attribute types
Gender
– Nominal – Ordinal – Continuous
M
• Depends on number of ways to split
F 1.0
Height
– 2-way split – Multi-way split
2.5
Data Mining Lecture 4: Classification 2
19
Data Mining Lecture 4: Classification 2
20
Splitting Based on Nominal Attributes
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.
• Multi-way split: Use as many partitions as distinct values. Size
Small
CarType Family
Large
Medium
Luxury Sports
• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury}
CarType {Family}
OR
{Family, Luxury}
• Binary split: Divides values into two subsets. Need to find optimal partitioning. Size
{Small, Medium}
{Large}
OR
{Medium, Large}
Size {Small}
CarType {Sports}
• What about this split? {Small, Large}
Data Mining Lecture 4: Classification 2
21
Splitting Based on Continuous Attributes
Size {Medium}
Data Mining Lecture 4: Classification 2
22
Splitting Based on Continuous Attributes
• Different ways of handling – Discretization to form an ordinal categorical attribute
Taxable Income > 80K?
• Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.
< 10K Yes
> 80K
No [10K,25K)
– Binary Decision: (A < v) or (A ≥ v) • considers all possible splits and finds the best cut • can be more compute intensive
Data Mining Lecture 4: Classification 2
Taxable Income?
(i) Binary split
23
[25K,50K)
[50K,80K)
(ii) Multi-way split
Data Mining Lecture 4: Classification 2
24
4
Comparing Decision Trees
DT Induction Issues that affect Performance • • • • • • •
Choosing Splitting Attributes Ordering of Splitting Attributes Split Points Tree Structure Stopping Criteria Training Data (size of) Pruning
Balanced Deep Data Mining Lecture 4: Classification 2
25
Data Mining Lecture 4: Classification 2
How to determine the Best Split
How to determine the Best Split • Greedy approach:
Before Splitting: 10 records of class 0, 10 records of class 1
Own Car?
Car Type? Family
No
Yes
c1
c10
Sports C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:
Student ID? Luxury
C0: 1 C1: 7
C0: 1 C1: 0
...
c11
C0: 1 C1: 0
C0: 0 C1: 1
c20
C0: 0 C1: 1
...
Which test condition is the best?
Data Mining Lecture 4: Classification 2
C0 C1
C0: 5 C1: 5
C0: 9 C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
27
Data Mining Lecture 4: Classification 2
How to Find the Best Split Before Splitting:
26
28
Measure of Impurity: GINI
N00 N01
• Gini Index for a given node t :
M0
GINI (t ) = 1 − ∑ [ p ( j | t )]2 A?
Yes Node N1 C0 C1
No Node N2 C0 C1
N10 N11
N20 N21
M2
M1
Yes Node N3
N40 N41
C0 C1
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information
M4
M3
M12
No Node N4
N30 N31
C0 C1
j
B?
C1 0 C2 6 Gini=0.000
M34
C1 1 C2 5 Gini=0.278
C1 2 C2 4 Gini=0.444
C1 3 C2 3 Gini=0.500
Gain = M0 – M12 vs M0 – M34 Data Mining Lecture 4: Classification 2
29
Data Mining Lecture 4: Classification 2
30
5
Examples for computing GINI
Splitting Based on GINI
GINI (t ) = 1 − ∑ [ p ( j | t )]
2
• Used in CART • When a node p is split into k partitions (children), the quality of split is computed as:
j
C1 C2
0 6
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 k
GINI split = ∑
P(C2) = 5/6
i =1
ni GINI (i ) n
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
where,
P(C2) = 4/6
ni = number of records at child i, n = number of records at node p.
Gini = 1 – (2/6)2 – (4/6)2 = 0.444 Data Mining Lecture 4: Classification 2
31
Data Mining Lecture 4: Classification 2
Binary Attributes: Computing GINI Index
32
Categorical Attributes: Computing GINI Index • For each distinct value, gather counts for each class in the dataset • Use the count matrix to make decisions
• Splits into two partitions • Effect of Weighing partitions: – Larger and Purer Partitions are sought for. Parent
B? Yes
No
C1
6
C2
6
Multi-way split
Gini = 0.500
Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194 Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528
Node N1
C1 C2 Gini
Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333
Data Mining Lecture 4: Classification 2
–
Number of possible splitting values = Number of distinct values
• Each splitting value has a count matrix associated with it –
Class counts in each of the partitions, A < v and A ≥ v
• Simple method to choose best v – –
Family Sports Luxury 1 2 1 4 1 1 0.393
33
Tid Refund
Marital Status
Taxable Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
Cheat
Married
75K
No
Sorted Values
Single
90K
Yes
Split Positions
Data Mining Lecture 4: Classification 2
No
No
No
Yes
60
70
75
85
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income
No
55
65
72
90
80
95
87
92
97
110
122
172
230
Yes
0
3
0
3
0
3
0
3
1
2
2
1
3
0
3
0
3
0
3
0
3
0
No
0
7
1
6
2
5
3
4
3
4
3
4
3
4
4
3
5
2
6
1
7
0
Gini Yes
34
– Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index
No
Taxable Income > 80K?
C1 C2 Gini
• For efficient computation: for each attribute,
10
For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work.
1 4 0.400
Continuous Attributes: Computing GINI Index
9 1 0
3 2
C1 C2 Gini
CarType {Family, Luxury} 2 2 1 5 0.419
{Sports}
Data Mining Lecture 4: Classification 2
Continuous Attributes: Computing GINI Index • Use Binary Decisions based on one value • Several choices for the splitting value
CarType {Sports, {Family} Luxury}
CarType
Node N2
N1 N2 C1 5 1 C2 2 4 Gini=0.333
Two-way split (find best partition of values)
0.420
0.400
0.375
0.343
0.417
0.400
0.300
0.343
0.375
0.400
0.420
No
35
Data Mining Lecture 4: Classification 2
36
6
Information
DT Induction
Decision Tree Induction is often based on Information Theory
• When all the marbles in the bowl are mixed up, little information is given. • When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given.
Use this approach with DT Induction !
Data Mining Lecture 4: Classification 2
37
Information/Entropy
Data Mining Lecture 4: Classification 2
38
Entropy
Given probabilities p1, p2, .., ps whose sum is 1, Entropy is defined as:
• Entropy measures the amount of randomness or surprise or uncertainty. • Goal in classification – no surprise – entropy = 0
log (1/p) Data Mining Lecture 4: Classification 2
39
ID3
H(p,1-p) Data Mining Lecture 4: Classification 2
40
Height Example Data N am e K ristina Jim M aggie M artha S tephanie B ob K athy D ave W orth S teven D ebbie T odd K im Amy W ynette
• Creates a decision tree using information theory concepts and tries to reduce the expected number of comparisons. • ID3 chooses to split on an attribute that gives the highest information gain:
Data Mining Lecture 4: Classification 2
41
G en der F M F F F M F M M M F M F F F
H eig ht 1.60 2.02 1.90 1.88 1.71 1.85 1.60 1.72 2.12 2.10 1.78 1.95 1.89 1.81 1.75
O utpu t1 S ho rt T all M ed ium M ed ium S ho rt M ed ium S ho rt S ho rt T all T all M ed ium M ed ium M ed ium M ed ium M ed ium
Data Mining Lecture 4: Classification 2
O utp ut2 M edium M edium T all T all M edium M edium M edium M edium T all T all M edium M edium T all M edium M edium 42
7
ID3 Example (Output1)
C4.5 Algorithm
• Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 • Gain using gender: – Female: 3/9 log(9/3) + 6/9 log(9/6) = 0.2764 – Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392 – Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152 – Gain: 0.4384 – 0.34152 = 0.09688 • Gain using height: 0.4384 – (2/15)(0.301) = 0.3983 • Choose height as first splitting attribute
• ID3 favors attributes with large number of divisions (is vulnerable to overfitting)
• Improved version of ID3: – – – – –
Missing Data Continuous Data Pruning Rules GainRatio: • Takes into account the cardinality of each split area
Data Mining Lecture 4: Classification 2
43
Data Mining Lecture 4: Classification 2
44
CART: Classification and Regression Trees
CART Example
• Creates a Binary Tree • Uses entropy to choose the best splitting attribute and point • Formula to choose split point, s, for node t:
• At the start, there are six choices for split point (right branch on equality):
• PL, PR probability that a tuple in the training set will be on the left or right side of the tree.
• Split at 1.8
Data Mining Lecture 4: Classification 2
– – – – – –
45
Decision Tree Based Classification
Data Mining Lecture 4: Classification 2
46
Decision Boundary
• Advantages:
1 0.9
Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets
x < 0.43?
0.8 0.7
Yes
No
0.6
y
– – – –
Φ(Gender) = 2(6/15)(9/15)(2/15 +4/15+3/15) = 0.224 Φ(1.6) = 0 Φ(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 Φ(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 Φ(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 Φ(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
y < 0.47?
0.5
y < 0.33?
0.4
Yes
0.3 0.2
:4 :0
0.1
No :0 :4
Yes
No
:0 :3
:4 :0
0 0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
• Border line between two neighboring regions of different classes is known as decision boundary • Decision boundary is parallel to axes because test condition involves a single attribute at-a-time Data Mining Lecture 4: Classification 2
47
Data Mining Lecture 4: Classification 2
48
8
Oblique Decision Trees
Tree Replication P
x+y