Data Mining Classification: Alternative Techniques
Lecture Notes for Chapter 5
Rule-Based Classifier l
Classify records by using a collection of “if…then…” rules
l
Rule:
(Condition) → y
– where
Introduction to Data Mining by Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle
Blood Type
warm cold cold warm cold cold warm warm warm cold cold warm warm cold cold cold warm warm warm warm
Give Birth
4/18/2004
yes no no yes no no yes no yes yes no no yes no no no no no yes no
Can Fly
Live in Water
no no no no no no yes yes no no no no no no no no no yes no yes
no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no
1
(Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No Introduction to Data Mining
4/18/2004
2
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians
4/18/2004
Tid Refund Marital
A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
Name
hawk grizzly bear
Blood Type
warm warm
Give Birth
Can Fly
Live in Water
Class
no yes
yes no
no no
? ?
The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal 3
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
4
How does Rule-based Classifier Work? Taxable
Coverage of a rule: Status Income Class 1 Yes Single 125K No – Fraction of records 2 No Married 100K No that satisfy the 3 No Single 70K No antecedent of a rule 4 Yes Married 120K No 5 No Divorced 95K Yes l Accuracy of a rule: 6 No Married 60K No – Fraction of the 7 Yes Divorced 220K No records that satisfy 8 No 85K Single Yes 9 No Married 75K No the antecedent, that 10 No Single 90K Yes also satisfy the (Status=Single) → No consequent of a rule Coverage = 40%, Accuracy = 50% 0 1
Introduction to Data Mining
(Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
u
© Tan,Steinbach, Kumar
l
Rule Coverage and Accuracy
© Tan,Steinbach, Kumar
u
Application of Rule-Based Classifier
mammals reptiles fishes mammals amphibians reptiles mammals birds mammals fishes reptiles birds mammals fishes amphibians reptiles mammals birds mammals birds
Introduction to Data Mining
l
y is the class label
Class
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians © Tan,Steinbach, Kumar
Condition is a conjunction of attribute tests
u
– LHS: rule antecedent or condition – RHS: rule consequent – Examples of classification rules:
Rule-based Classifier (Example) Name
u
4/18/2004
5
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians Name
lemur turtle dogfish shark
Blood Type
warm cold cold
Give Birth
Can Fly
Live in Water
Class
yes no yes
no no no
no sometimes yes
? ? ?
A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
6
Characteristics of Rule-Based Classifier l
l
From Decision Trees To Rules
Mutually exclusive rules – Every record is covered by at most one rule
Classification Rules (Refund=Yes) ==> No
Refund
Exhaustive rules – Classifier has exhaustive coverage if it accounts for every possible combination of attribute values – Each record is covered by at least one rule
Yes
No
NO
Marita l Status
{Single, Divorced}
(Refund=No, Marital Status={Single,Divorced}, Taxable Income No
{Married}
(Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes (Refund=No, Marital Status={Married}) ==> No
NO
Taxable Income < 80K
> 80K
NO
YES
Rules are mutually exclusive and exhaustive Rule set contains as much information as the tree
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
7
Rules Can Be Simplified
© Tan,Steinbach, Kumar
No
NO
Marita l Status
{Single, Divorced}
{Married} NO
Taxable Income < 80K
> 80K
NO
YES
Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
l
60K
u
l
(Refund=No) ∧ (Status=Married) → No Introduction to Data Mining
4/18/2004
9
Ordered Rule Set l
Use a default class
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
10
Rule Ordering Schemes
Rules are rank ordered according to their priority
l
Rule-based ordering
l
Class-based ordering
– An ordered rule set is known as a decision list
l
Unordered rule set: use voting schemes Ordered rule set
Rules are no longer exhaustive – A record may not trigger any rules – Solution? u
Simplified Rule: (Status=Married) → No © Tan,Steinbach, Kumar
8
Rules are no longer mutually exclusive – A record may trigger more than one rule – Solution? u
1 0
Initial Rule:
4/18/2004
Effect of Rule Simplification
Refund Yes
Introduction to Data Mining
– Individual rules are ranked based on their quality
When a test record is presented to the classifier – It is assigned to the class label of the highest ranked rule it has triggered
– Rules that belong to the same class appear together
– If none of the rules fired, it is assigned to the default class R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians Name
turtle © Tan,Steinbach, Kumar
Blood Type
cold
Give Birth
Can Fly
Live in Water
Class
no
no
sometimes
?
Introduction to Data Mining
4/18/2004
11
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
12
Building Classification Rules l
Direct Method: Sequential Covering
Direct Method:
1.
Extract rules directly from data u e.g.: RIPPER, CN2, Holte’s 1R u
2. 3. 4.
l
Indirect Method:
Start from an empty rule-list Grow a rule using the Learn-One-Rule function Remove training records covered by the rule Repeat Step (2) and (3) until stopping criterion is met
Extract rules from other classification models (e.g. decision trees, neural networks, etc). u e.g: C4.5rules u
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
13
Example of Sequential Covering
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
14
Example of Sequential Covering…
R1
R1
R2 (iii) Step 2
(ii) Step 1
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
15
Aspects of Sequential Covering l
Rule Growing
l
Instance Elimination
l
Rule Evaluation
l
Stopping Criterion
l
Rule Pruning
© Tan,Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
16
4/18/2004
18
Rule Growing l
Introduction to Data Mining
(iv) Step 3
4/18/2004
17
Two common strategies
© Tan,Steinbach, Kumar
Introduction to Data Mining
Rule Growing (Examples) l
Rule growing: Ripper
CN2 Algorithm:
Suppose R0 covers 60 positive cases and 40 negative. We compare two possible refinements of R0: 1. R1 covers 50 positive and 10 negative cases. 2. R2 covers 30 positive and 5 negative cases.
– Start from an empty conjunct: {} – Add conjuncts that minimizes the entropy measure: {A}, {A,B}, … – Determine the rule consequent by taking majority class of instances covered by the rule l
RIPPER Algorithm:
Their information gains are:
– Start from an empty rule: {} => class – Add conjuncts that maximizes FOIL’s information gain measure: R0: {} => class (initial rule) R1: {A} => class (rule after adding conjunct) u Gain(R0, R1) = p1 [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ] u where p0: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1 u u
© Tan,Steinbach, Kumar
Introduction to Data Mining
Hence we prefer refinement R1 even though it has lower accuracy. 4/18/2004
19
Instance Elimination l
© Tan,Steinbach, Kumar
Introduction to Data Mining
Why do we need to eliminate instances?
If we don’t remove the cases covered by R1 then R3 covers 8 positive cases and 4 negative cases. Otherwise it covers 6 positive and 2 negative cases.
Why do we remove positive instances? – Ensure that the next rule is different
l
For R2 it doesn’t make any difference: it covers 7 positive and 3 negative cases.
Why do we remove negative instances? – Prevent underestimating accuracy of rule – Compare rules R2 and R3 in the diagram
© Tan,Steinbach, Kumar
Introduction to Data Mining
So if we don’t remove the covered cases, the accuracy of R3 is underestimated (8/12 instead of 6/8). 4/18/2004
21
Rule Evaluation l
Metrics: – Accuracy
– Laplace
n = c n
n +1 = c n+k
n + kp – M-estimate = c n+k
20
Instance Elimination
– Otherwise, the next rule is identical to previous rule l
4/18/2004
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
22
Rule evaluation: example R: condition ⇒ class = c
Suppose we have a 2-class problem with 60 positive and 100 negative examples. Consider 1. R1 covers 50 positive examples and 5 negative. 2. R2 covers 2 positive examples and no negative. Accuracy of R1 is 50/55 = 0.91. Accuracy of R2 is 2/2 = 1.
n : Number of instances covered by rule nc : Number of instances of class c covered by rule
With Laplace correction this becomes: R1: (50+1)/(55+2) = 0.89
R2: (2+1)/(2+2) = 0.75
k : Number of classes p : Prior probability of class c
With m-estimate (p=60/160=3/8): R1: (50+2×3/8)/(55+2)=0.89 R2: (2+2×3/8)/(2+2) = 0.69
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
23
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
24
Stopping Criterion and Rule Pruning l
l
Summary of Direct Method
Stopping criterion – Compute the gain – If gain is not significant, stop growing rule Rule Pruning – Similar to post-pruning of decision trees – Reduced Error Pruning: Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning u If error improves, prune the conjunct u
l
Grow a single rule
l
Prune the rule (if necessary)
l
Remove instances of rule
l
Add rule to Current Rule Set
l
Repeat
u
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
25
Direct Method: RIPPER l
l
Introduction to Data Mining
4/18/2004
27
Pruning in Ripper: Example (cf. exc. 2) Suppose we have the rule R: A and B ⇒ class=c. Suppose we have a validation set (not used for growing the rule) that contains 500 positive examples (i.e. with class=c) and 500 negative examples. Suppose that among all cases in the validation set that satisfy condition A, there are 200 positive examples and 50 negative examples. Among all cases in the validation set that meet both condition A and condition B, there are 100 positive examples and 5 negative examples. Should condition B be pruned?
© Tan,Steinbach, Kumar
Introduction to Data Mining
Introduction to Data Mining
4/18/2004
26
Direct Method: RIPPER
For 2-class problem, choose one of the classes as positive class, and the other as negative class – Learn rules for positive class – Negative class will be default class For multi-class problem – Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) – Learn the rule set for smallest class first, treat the rest as negative class – Repeat with next smallest class as positive class
© Tan,Steinbach, Kumar
© Tan,Steinbach, Kumar
4/18/2004
29
l
Growing a rule: – Start from empty rule-body – Add conjuncts as long as they improve (?) FOIL’s information gain – Stop when rule no longer covers negative examples – Prune the rule immediately using incremental reduced error pruning – Measure for pruning: v = (p-n)/(p+n) p: number of positive examples covered by the rule in the validation set u n: number of negative examples covered by the rule in the validation set u
– Pruning method: delete any final sequence of conditions that maximizes v © Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
28
Pruning in Ripper LHS Rule
v=(p-n)/(p+n)
{}
(500-500)/1000=0
{A}
(200-50)/250=0.6
{A,B}
(100-5)/105=0.9
Since {A,B} maximizes v, the rule is not pruned.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
30
Direct Method: RIPPER l
Indirect Methods
Building a Rule Set: – Use sequential covering algorithm Finds the best rule that covers the current set of positive examples u Eliminate both positive and negative examples covered by the rule u
– Each time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far (description length depends on accuracy and complexity of the rule). u
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
31
© Tan,Steinbach, Kumar
Introduction to Data Mining
Indirect Method: C4.5rules
Rule Pruning in C4.5rules
Extract rules from an unpruned decision tree l For each rule, R: A → class=c, – consider an alternative rule R-: A- → class=c where A- is obtained by removing one of the conditions in A – Compare the pessimistic error rate for R against all R- s. – Prune if one of the R- s has lower pessimistic error rate – Repeat until we can no longer improve generalization error
R: if A then class=c
l
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
33
4/18/2004
R-: if A- then class=c
where A- has one condition less than A. Let X be the condition considered for removal from A. Make a cross-table: A- and X Not X
Class=c
Class ≠ c
Y1 Y2
E1 E2
R covers Y1 + E1 cases, and makes E1 errors. Pessimistic estimate UCF(E1,Y1 + E1). © Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
Rule Pruning in C4.5rules
Rule Pruning in C4.5rules
Consider the rule
Condition Deleted TSH > 6
Y1+Y2
E1+E2
3
1
Pessimistic Error rate 55%
FTI ≤ 64
6
1
34%
TSH 2 measured=t
1
68%
2 T4U measured=t
1
68%
Thyroid surgery=t
59
97%
if TSH > 6 FTI ≤ 64 TSH measured = true T4U measured = true thyroid surgery = true then class negative
Suppose this rule covers 3 training cases, 2 of which have class negative. The pessimistic error estimate, using CF=0.25 is U25%(1,3) is 68%, because P(E ≤ 1) is approximately 0.25 under a binomial distribution with N=3, and probability of error is 0.68 on each draw.
32
3
34
Lowest pessimistic error is obtained by deleting FTI ≤ 64 © Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
35
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
36
Indirect Method: C4.5rules l
Example Name
Instead of ordering the rules, order subsets of rules (class ordering) – Each subset is a collection of rules with the same rule consequent (class) – Compute description length of each subset
human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle
Description length = L(error) + g L(model) g is a parameter that takes into account the presence of redundant attributes in a rule set (default value = 0.5) u u
– Order according to increasing description length © Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
37
C4.5 versus C4.5rules versus RIPPER
No
(Give Birth=Yes) → Mammals (Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles
Live In Water?
Mammals
Yes
( ) → Amphibians
RIPPER:
No
(Live in Water=Yes) → Fishes
Fishes
Yes
Birds
© Tan,Steinbach, Kumar
(Give Birth=No, Can Fly=No, Live In Water=No) → Reptiles
Can Fly?
Amphibians
(Can Fly=Yes,Give Birth=No) → Birds
No
() → Mammals
no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no
yes no no no yes yes yes yes yes no yes yes yes no yes yes yes yes no yes
Class
mammals reptiles fishes mammals amphibians reptiles mammals birds mammals fishes reptiles birds mammals fishes amphibians reptiles mammals birds mammals birds 4/18/2004
38
PREDICT ED CLASS Amphibians Fishes Reptiles Birds ACT UAL Amphibians 2 0 0 CLASS Fishes 0 2 0 Reptiles 1 0 3 Birds 1 0 0 M ammals 0 0 1
0 0 0 3 0
M ammals 0 1 0 0 6
PREDICT ED CLASS Amphibians Fishes Reptiles Birds ACT UAL Amphibians 0 0 0 CLASS Fishes 0 3 0 Reptiles 0 0 3 Birds 0 0 1 M ammals 0 2 1
0 0 0 2 0
M ammals 2 0 1 1 4
Reptiles
Introduction to Data Mining
4/18/2004
39
Advantages of Rule-Based Classifiers As highly expressive as decision trees Easy to interpret l Easy to generate l Can classify new instances rapidly l Performance comparable to decision trees l l
© Tan,Steinbach, Kumar
Introduction to Data Mining
Live in Water Have Legs
RIPPER:
(Have Legs=No) → Reptiles
Sometimes
Can Fly
no no no no no no yes yes no no no no no no no no no yes no yes
C4.5 and C4.5rules:
(Give Birth=No, Can Fly=Yes) → Birds (Give Birth=No, Live in Water=Yes) → Fishes
Yes
© Tan,Steinbach, Kumar
Lay Eggs
no yes yes no yes yes no yes no no yes yes no yes yes yes yes yes no yes
C4.5 versus C4.5rules versus RIPPER
C4.5rules:
Give Birth?
Give Birth
yes no no yes no no yes no yes yes no no yes no no no no no yes no
Introduction to Data Mining
4/18/2004
41
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
40