Data Mining Classification: Alternative Techniques. Rule-Based Classifier. Lecture Notes for Chapter 5. Introduction to Data Mining

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Rule-Based Classifier l Classify records by using a collection of “i...
9 downloads 0 Views 141KB Size
Data Mining Classification: Alternative Techniques

Lecture Notes for Chapter 5

Rule-Based Classifier l

Classify records by using a collection of “if…then…” rules

l

Rule:

(Condition) → y

– where

Introduction to Data Mining by Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar

Introduction to Data Mining

human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle

Blood Type

warm cold cold warm cold cold warm warm warm cold cold warm warm cold cold cold warm warm warm warm

Give Birth

4/18/2004

yes no no yes no no yes no yes yes no no yes no no no no no yes no

Can Fly

Live in Water

no no no no no no yes yes no no no no no no no no no yes no yes

no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no

1

(Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No Introduction to Data Mining

4/18/2004

2

R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians

4/18/2004

Tid Refund Marital

A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds

Name

hawk grizzly bear

Blood Type

warm warm

Give Birth

Can Fly

Live in Water

Class

no yes

yes no

no no

? ?

The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal 3

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

4

How does Rule-based Classifier Work? Taxable

Coverage of a rule: Status Income Class 1 Yes Single 125K No – Fraction of records 2 No Married 100K No that satisfy the 3 No Single 70K No antecedent of a rule 4 Yes Married 120K No 5 No Divorced 95K Yes l Accuracy of a rule: 6 No Married 60K No – Fraction of the 7 Yes Divorced 220K No records that satisfy 8 No 85K Single Yes 9 No Married 75K No the antecedent, that 10 No Single 90K Yes also satisfy the (Status=Single) → No consequent of a rule Coverage = 40%, Accuracy = 50% 0 1

Introduction to Data Mining

(Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds

u

© Tan,Steinbach, Kumar

l

Rule Coverage and Accuracy

© Tan,Steinbach, Kumar

u

Application of Rule-Based Classifier

mammals reptiles fishes mammals amphibians reptiles mammals birds mammals fishes reptiles birds mammals fishes amphibians reptiles mammals birds mammals birds

Introduction to Data Mining

l

y is the class label

Class

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians © Tan,Steinbach, Kumar

Condition is a conjunction of attribute tests

u

– LHS: rule antecedent or condition – RHS: rule consequent – Examples of classification rules:

Rule-based Classifier (Example) Name

u

4/18/2004

5

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians Name

lemur turtle dogfish shark

Blood Type

warm cold cold

Give Birth

Can Fly

Live in Water

Class

yes no yes

no no no

no sometimes yes

? ? ?

A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

6

Characteristics of Rule-Based Classifier l

l

From Decision Trees To Rules

Mutually exclusive rules – Every record is covered by at most one rule

Classification Rules (Refund=Yes) ==> No

Refund

Exhaustive rules – Classifier has exhaustive coverage if it accounts for every possible combination of attribute values – Each record is covered by at least one rule

Yes

No

NO

Marita l Status

{Single, Divorced}

(Refund=No, Marital Status={Single,Divorced}, Taxable Income No

{Married}

(Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes (Refund=No, Marital Status={Married}) ==> No

NO

Taxable Income < 80K

> 80K

NO

YES

Rules are mutually exclusive and exhaustive Rule set contains as much information as the tree

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

7

Rules Can Be Simplified

© Tan,Steinbach, Kumar

No

NO

Marita l Status

{Single, Divorced}

{Married} NO

Taxable Income < 80K

> 80K

NO

YES

Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

l

60K

u

l

(Refund=No) ∧ (Status=Married) → No Introduction to Data Mining

4/18/2004

9

Ordered Rule Set l

Use a default class

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

10

Rule Ordering Schemes

Rules are rank ordered according to their priority

l

Rule-based ordering

l

Class-based ordering

– An ordered rule set is known as a decision list

l

Unordered rule set: use voting schemes Ordered rule set

Rules are no longer exhaustive – A record may not trigger any rules – Solution? u

Simplified Rule: (Status=Married) → No © Tan,Steinbach, Kumar

8

Rules are no longer mutually exclusive – A record may trigger more than one rule – Solution? u

1 0

Initial Rule:

4/18/2004

Effect of Rule Simplification

Refund Yes

Introduction to Data Mining

– Individual rules are ranked based on their quality

When a test record is presented to the classifier – It is assigned to the class label of the highest ranked rule it has triggered

– Rules that belong to the same class appear together

– If none of the rules fired, it is assigned to the default class R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians Name

turtle © Tan,Steinbach, Kumar

Blood Type

cold

Give Birth

Can Fly

Live in Water

Class

no

no

sometimes

?

Introduction to Data Mining

4/18/2004

11

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

12

Building Classification Rules l

Direct Method: Sequential Covering

Direct Method:

1.

Extract rules directly from data u e.g.: RIPPER, CN2, Holte’s 1R u

2. 3. 4.

l

Indirect Method:

Start from an empty rule-list Grow a rule using the Learn-One-Rule function Remove training records covered by the rule Repeat Step (2) and (3) until stopping criterion is met

Extract rules from other classification models (e.g. decision trees, neural networks, etc). u e.g: C4.5rules u

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

13

Example of Sequential Covering

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

14

Example of Sequential Covering…

R1

R1

R2 (iii) Step 2

(ii) Step 1

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

15

Aspects of Sequential Covering l

Rule Growing

l

Instance Elimination

l

Rule Evaluation

l

Stopping Criterion

l

Rule Pruning

© Tan,Steinbach, Kumar

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

16

4/18/2004

18

Rule Growing l

Introduction to Data Mining

(iv) Step 3

4/18/2004

17

Two common strategies

© Tan,Steinbach, Kumar

Introduction to Data Mining

Rule Growing (Examples) l

Rule growing: Ripper

CN2 Algorithm:

Suppose R0 covers 60 positive cases and 40 negative. We compare two possible refinements of R0: 1. R1 covers 50 positive and 10 negative cases. 2. R2 covers 30 positive and 5 negative cases.

– Start from an empty conjunct: {} – Add conjuncts that minimizes the entropy measure: {A}, {A,B}, … – Determine the rule consequent by taking majority class of instances covered by the rule l

RIPPER Algorithm:

Their information gains are:

– Start from an empty rule: {} => class – Add conjuncts that maximizes FOIL’s information gain measure: R0: {} => class (initial rule) R1: {A} => class (rule after adding conjunct) u Gain(R0, R1) = p1 [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ] u where p0: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1 u u

© Tan,Steinbach, Kumar

Introduction to Data Mining

Hence we prefer refinement R1 even though it has lower accuracy. 4/18/2004

19

Instance Elimination l

© Tan,Steinbach, Kumar

Introduction to Data Mining

Why do we need to eliminate instances?

If we don’t remove the cases covered by R1 then R3 covers 8 positive cases and 4 negative cases. Otherwise it covers 6 positive and 2 negative cases.

Why do we remove positive instances? – Ensure that the next rule is different

l

For R2 it doesn’t make any difference: it covers 7 positive and 3 negative cases.

Why do we remove negative instances? – Prevent underestimating accuracy of rule – Compare rules R2 and R3 in the diagram

© Tan,Steinbach, Kumar

Introduction to Data Mining

So if we don’t remove the covered cases, the accuracy of R3 is underestimated (8/12 instead of 6/8). 4/18/2004

21

Rule Evaluation l

Metrics: – Accuracy

– Laplace

n = c n

n +1 = c n+k

n + kp – M-estimate = c n+k

20

Instance Elimination

– Otherwise, the next rule is identical to previous rule l

4/18/2004

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

22

Rule evaluation: example R: condition ⇒ class = c

Suppose we have a 2-class problem with 60 positive and 100 negative examples. Consider 1. R1 covers 50 positive examples and 5 negative. 2. R2 covers 2 positive examples and no negative. Accuracy of R1 is 50/55 = 0.91. Accuracy of R2 is 2/2 = 1.

n : Number of instances covered by rule nc : Number of instances of class c covered by rule

With Laplace correction this becomes: R1: (50+1)/(55+2) = 0.89

R2: (2+1)/(2+2) = 0.75

k : Number of classes p : Prior probability of class c

With m-estimate (p=60/160=3/8): R1: (50+2×3/8)/(55+2)=0.89 R2: (2+2×3/8)/(2+2) = 0.69

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

23

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

24

Stopping Criterion and Rule Pruning l

l

Summary of Direct Method

Stopping criterion – Compute the gain – If gain is not significant, stop growing rule Rule Pruning – Similar to post-pruning of decision trees – Reduced Error Pruning: Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning u If error improves, prune the conjunct u

l

Grow a single rule

l

Prune the rule (if necessary)

l

Remove instances of rule

l

Add rule to Current Rule Set

l

Repeat

u

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

25

Direct Method: RIPPER l

l

Introduction to Data Mining

4/18/2004

27

Pruning in Ripper: Example (cf. exc. 2) Suppose we have the rule R: A and B ⇒ class=c. Suppose we have a validation set (not used for growing the rule) that contains 500 positive examples (i.e. with class=c) and 500 negative examples. Suppose that among all cases in the validation set that satisfy condition A, there are 200 positive examples and 50 negative examples. Among all cases in the validation set that meet both condition A and condition B, there are 100 positive examples and 5 negative examples. Should condition B be pruned?

© Tan,Steinbach, Kumar

Introduction to Data Mining

Introduction to Data Mining

4/18/2004

26

Direct Method: RIPPER

For 2-class problem, choose one of the classes as positive class, and the other as negative class – Learn rules for positive class – Negative class will be default class For multi-class problem – Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) – Learn the rule set for smallest class first, treat the rest as negative class – Repeat with next smallest class as positive class

© Tan,Steinbach, Kumar

© Tan,Steinbach, Kumar

4/18/2004

29

l

Growing a rule: – Start from empty rule-body – Add conjuncts as long as they improve (?) FOIL’s information gain – Stop when rule no longer covers negative examples – Prune the rule immediately using incremental reduced error pruning – Measure for pruning: v = (p-n)/(p+n) p: number of positive examples covered by the rule in the validation set u n: number of negative examples covered by the rule in the validation set u

– Pruning method: delete any final sequence of conditions that maximizes v © Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

28

Pruning in Ripper LHS Rule

v=(p-n)/(p+n)

{}

(500-500)/1000=0

{A}

(200-50)/250=0.6

{A,B}

(100-5)/105=0.9

Since {A,B} maximizes v, the rule is not pruned.

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

30

Direct Method: RIPPER l

Indirect Methods

Building a Rule Set: – Use sequential covering algorithm Finds the best rule that covers the current set of positive examples u Eliminate both positive and negative examples covered by the rule u

– Each time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far (description length depends on accuracy and complexity of the rule). u

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

31

© Tan,Steinbach, Kumar

Introduction to Data Mining

Indirect Method: C4.5rules

Rule Pruning in C4.5rules

Extract rules from an unpruned decision tree l For each rule, R: A → class=c, – consider an alternative rule R-: A- → class=c where A- is obtained by removing one of the conditions in A – Compare the pessimistic error rate for R against all R- s. – Prune if one of the R- s has lower pessimistic error rate – Repeat until we can no longer improve generalization error

R: if A then class=c

l

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

33

4/18/2004

R-: if A- then class=c

where A- has one condition less than A. Let X be the condition considered for removal from A. Make a cross-table: A- and X Not X

Class=c

Class ≠ c

Y1 Y2

E1 E2

R covers Y1 + E1 cases, and makes E1 errors. Pessimistic estimate UCF(E1,Y1 + E1). © Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Rule Pruning in C4.5rules

Rule Pruning in C4.5rules

Consider the rule

Condition Deleted TSH > 6

Y1+Y2

E1+E2

3

1

Pessimistic Error rate 55%

FTI ≤ 64

6

1

34%

TSH 2 measured=t

1

68%

2 T4U measured=t

1

68%

Thyroid surgery=t

59

97%

if TSH > 6 FTI ≤ 64 TSH measured = true T4U measured = true thyroid surgery = true then class negative

Suppose this rule covers 3 training cases, 2 of which have class negative. The pessimistic error estimate, using CF=0.25 is U25%(1,3) is 68%, because P(E ≤ 1) is approximately 0.25 under a binomial distribution with N=3, and probability of error is 0.68 on each draw.

32

3

34

Lowest pessimistic error is obtained by deleting FTI ≤ 64 © Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

35

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

36

Indirect Method: C4.5rules l

Example Name

Instead of ordering the rules, order subsets of rules (class ordering) – Each subset is a collection of rules with the same rule consequent (class) – Compute description length of each subset

human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle

Description length = L(error) + g L(model) g is a parameter that takes into account the presence of redundant attributes in a rule set (default value = 0.5) u u

– Order according to increasing description length © Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

37

C4.5 versus C4.5rules versus RIPPER

No

(Give Birth=Yes) → Mammals (Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles

Live In Water?

Mammals

Yes

( ) → Amphibians

RIPPER:

No

(Live in Water=Yes) → Fishes

Fishes

Yes

Birds

© Tan,Steinbach, Kumar

(Give Birth=No, Can Fly=No, Live In Water=No) → Reptiles

Can Fly?

Amphibians

(Can Fly=Yes,Give Birth=No) → Birds

No

() → Mammals

no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no

yes no no no yes yes yes yes yes no yes yes yes no yes yes yes yes no yes

Class

mammals reptiles fishes mammals amphibians reptiles mammals birds mammals fishes reptiles birds mammals fishes amphibians reptiles mammals birds mammals birds 4/18/2004

38

PREDICT ED CLASS Amphibians Fishes Reptiles Birds ACT UAL Amphibians 2 0 0 CLASS Fishes 0 2 0 Reptiles 1 0 3 Birds 1 0 0 M ammals 0 0 1

0 0 0 3 0

M ammals 0 1 0 0 6

PREDICT ED CLASS Amphibians Fishes Reptiles Birds ACT UAL Amphibians 0 0 0 CLASS Fishes 0 3 0 Reptiles 0 0 3 Birds 0 0 1 M ammals 0 2 1

0 0 0 2 0

M ammals 2 0 1 1 4

Reptiles

Introduction to Data Mining

4/18/2004

39

Advantages of Rule-Based Classifiers As highly expressive as decision trees Easy to interpret l Easy to generate l Can classify new instances rapidly l Performance comparable to decision trees l l

© Tan,Steinbach, Kumar

Introduction to Data Mining

Live in Water Have Legs

RIPPER:

(Have Legs=No) → Reptiles

Sometimes

Can Fly

no no no no no no yes yes no no no no no no no no no yes no yes

C4.5 and C4.5rules:

(Give Birth=No, Can Fly=Yes) → Birds (Give Birth=No, Live in Water=Yes) → Fishes

Yes

© Tan,Steinbach, Kumar

Lay Eggs

no yes yes no yes yes no yes no no yes yes no yes yes yes yes yes no yes

C4.5 versus C4.5rules versus RIPPER

C4.5rules:

Give Birth?

Give Birth

yes no no yes no no yes no yes yes no no yes no no no no no yes no

Introduction to Data Mining

4/18/2004

41

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

40

Suggest Documents