Data Mining Classification: Alternative Techniques. Rule-Based Classifier. Lecture Notes for Chapter 5. Introduction to Data Mining

Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 5 Rule-Based Classifier  Classify records by using a collection of “i...
Author: Blaze Foster
5 downloads 0 Views 226KB Size
Data Mining Classification: Alternative Techniques

Lecture Notes for Chapter 5

Rule-Based Classifier 

Classify records by using a collection of “if…then…” rules



Rule:

(Condition) → y

– where

Introduction to Data Mining by Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar

Introduction to Data Mining

human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle

Blood Type

warm cold cold warm cold cold warm warm warm cold cold warm warm cold cold cold warm warm warm warm

Give Birth

4/18/2004

yes no no yes no no yes no yes yes no no yes no no no no no yes no

Can Fly

Live in Water

no no no no no no yes yes no no no no no no no no no yes no yes

no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no

1



(Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds



(Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

2

Application of Rule-Based Classifier

mammals reptiles fishes mammals amphibians reptiles mammals birds mammals fishes reptiles birds mammals fishes amphibians reptiles mammals birds mammals birds

Introduction to Data Mining



R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians Name

hawk grizzly bear

Blood Type

warm warm

Give Birth

Can Fly

Live in Water

Class

no yes

yes no

no no

? ?

The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal 3

Rule Coverage and Accuracy Tid Refund Marital

A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds

4/18/2004

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

4

How does Rule-based Classifier Work?

Coverage of a rule: Status 1 Yes Single – Fraction of records 2 No Married that satisfy the 3 No Single antecedent of a rule 4 Yes Married 5 No Divorced  Accuracy of a rule: 6 No Married – Fraction of records 7 Yes Divorced that satisfy both the 8 No Single 9 No Married antecedent and 10 No Single consequent of a (Status=Single) → →No rule



y is the class label

Class

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians © Tan,Steinbach, Kumar

Condition is a conjunctions of attributes



– LHS: rule antecedent or condition – RHS: rule consequent – Examples of classification rules:

Rule-based Classifier (Example) Name



Taxable Income Class

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds

125K

No

R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals

100K

No

R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles

70K

No

R5: (Live in Water = sometimes) → Amphibians

120K

No

95K

Yes

60K

No

220K

No

85K

Yes

75K

No

90K

Yes

R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes

Name

lemur turtle dogfish shark

Blood Type

warm cold cold

Give Birth

Can Fly

Live in Water

Class

yes no yes

no no no

no sometimes yes

? ? ?

A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5

0 1

A dogfish shark triggers none of the rules

Coverage = 40%, Accuracy = 50% © Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

5

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

6

Characteristics of Rule-Based Classifier 

Mutually exclusive rules – Classifier contains mutually exclusive rules if the rules are independent of each other – Every record is covered by at most one rule

From Decision Trees To Rules Classification Rules Refund Yes

No

NO

Marita l Status

{Single, Divorced}



Introduction to Data Mining

4/18/2004

< 80K

No Marita l Status

{Single, Divorced}

{Married} NO

Taxable Income < 80K

> 80K

NO

YES

7

Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes



60K

Introduction to Data Mining

4/18/2004

8





(Refund=No) ∧ (Status=Married) → →No Introduction to Data Mining

Rules are no longer mutually exclusive – A record may trigger more than one rule – Solution? 

4/18/2004

9

Ordered rule set Unordered rule set – use voting schemes

Rules are no longer exhaustive – A record may not trigger any rules – Solution? 

Use a default class

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

10

Rule Ordering Schemes

Rules are rank ordered according to their priority



Rule-based ordering



Class-based ordering

– An ordered rule set is known as a decision list



Rules are mutually exclusive and exhaustive

Effect of Rule Simplification

Ordered Rule Set 

YES

© Tan,Steinbach, Kumar

Simplified Rule: (Status=Married) → →No © Tan,Steinbach, Kumar

NO

Rule set contains as much information as the tree

0 1

Initial Rule:

(Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes

> 80K

NO

Refund Yes

{Married}

Taxable Income

Rules Can Be Simplified

NO

(Refund=No, Marital Status={Single,Divorced}, Taxable Income No

(Refund=No, Marital Status={Married}) ==> No

Exhaustive rules – Classifier has exhaustive coverage if it accounts for every possible combination of attribute values – Each record is covered by at least one rule

© Tan,Steinbach, Kumar

(Refund=Yes) ==> No

– Individual rules are ranked based on their quality

When a test record is presented to the classifier – It is assigned to the class label of the highest ranked rule it has triggered

– Rules that belong to the same class appear together

– If none of the rules fired, it is assigned to the default class R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles R5: (Live in Water = sometimes) → Amphibians Name

turtle © Tan,Steinbach, Kumar

Blood Type

cold

Give Birth

Can Fly

Live in Water

Class

no

no

sometimes

?

Introduction to Data Mining

4/18/2004

11

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

12

Building Classification Rules 

Direct Method: Sequential Covering

Direct Method:

1.

Extract rules directly from data  e.g.: RIPPER, CN2, Holte’s 1R 

2. 3. 4.



Indirect Method:

Start from an empty rule Grow a rule using the Learn-One-Rule function Remove training records covered by the rule Repeat Step (2) and (3) until stopping criterion is met

Extract rules from other classification models (e.g. decision trees, neural networks, etc).  e.g: C4.5rules



© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

13

Example of Sequential Covering

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

14

Example of Sequential Covering…

R1

R1

R2 (iii) Step 2

(ii) Step 1

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

15

Aspects of Sequential Covering 

Rule Growing



Instance Elimination



Rule Evaluation



Stopping Criterion



Rule Pruning

© Tan,Steinbach, Kumar

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

16

4/18/2004

18

Rule Growing 

Introduction to Data Mining

(iv) Step 3

4/18/2004

17

Two common strategies

© Tan,Steinbach, Kumar

Introduction to Data Mining

Rule Growing (Examples) 

Instance Elimination

CN2 Algorithm:



– Start from an empty conjunct: {} – Add conjuncts that minimizes the entropy measure: {A}, {A,B}, … – Determine the rule consequent by taking majority class of instances covered by the rule 

– Otherwise, the next rule is identical to previous rule 

RIPPER Algorithm: – Start from an empty rule: {} => class – Add conjuncts that maximizes FOIL’s information gain measure:

Introduction to Data Mining

4/18/2004



Metrics: – Accuracy

– Laplace

19

– M-estimate

© Tan,Steinbach, Kumar

nc + 1 n+k

=

nc + kp n+k

Introduction to Data Mining

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

20

Stopping Criterion and Rule Pruning

n = c n

=

Why do we remove negative instances? – Prevent underestimating accuracy of rule – Compare rules R2 and R3 in the diagram

Rule Evaluation 

Why do we remove positive instances? – Ensure that the next rule is different

R0: {} => class (initial rule)  R1: {A} => class (rule after adding conjunct)  Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]  where t: number of positive instances covered by both R0 and R1 p0: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1 

© Tan,Steinbach, Kumar

Why do we need to eliminate instances?

n : Number of instances covered by rule nc : Number of instances covered by rule



Stopping criterion – Compute the gain – If gain is not significant, discard the new rule



Rule Pruning – Similar to post-pruning of decision trees – Reduced Error Pruning: Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning  If error improves, prune the conjunct

k : Number of classes



p : Prior probability



4/18/2004

21

Advantages of Rule-Based Classifiers

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

22

Instance-Based Classifiers Set of Stored Cases

As highly expressive as decision trees Easy to interpret  Easy to generate  Can classify new instances rapidly  Performance comparable to decision trees  

Atr1

……...

AtrN

Class A

• Store the training records • Use training records to predict the class label of unseen cases

B B C A

Unseen Case Atr1

……...

AtrN

C B

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

23

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

24

Instance Based Classifiers 

Nearest Neighbor Classifiers

Examples: – Nearest neighbor



Uses k “closest” points (nearest neighbors) for performing classification 

Basic idea: – If it walks like a duck, quacks like a duck, then it’s probably a duck

– Rote-learner

Compute Distance

Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly

Test Record



Training Records

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

25

Nearest-Neighbor Classifiers Unknown record



Choose k of the “nearest” records

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

26

Definition of Nearest Neighbor

Requires three things – The set of stored records – Distance Metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve X



X

X

(b) 2-nearest neighbor

(c) 3-nearest neighbor

To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

27

Nearest Neighbor Classification 



∑ ( pi i

−q )

K-nearest neighbors of a record x are data points that have the k smallest distance to x © Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

28

Nearest Neighbor Classification…

Compute distance between two points: – Euclidean distance

d ( p, q ) =

(a) 1-nearest neighbor



Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes

2

i

Determine the class from nearest neighbor list – take the majority vote of class labels among the k-nearest neighbors – Weigh the vote according to distance 

weight factor, w = 1/d2

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

29

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

30

Nearest Neighbor Classification… 

Nearest neighbor Classification…

Scaling issues – Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes – Example:



height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb  income of a person may vary from $10K to $1M 

k-NN classifiers are lazy learners – It does not build models explicitly – Unlike eager learners such as decision tree induction and rule-based systems – Classifying unknown records are relatively expensive



© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

31

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Bayes Classifier

Example of Bayes Theorem

A probabilistic framework for solving classification problems  Conditional Probability: P( A, C ) P (C | A) = P( A)





P( A | C ) = 

P( A, C ) P(C )

© Tan,Steinbach, Kumar

4/18/2004

33

Consider each attribute and class label as random variables



Given a record with attributes (A1, A2,…,An) – Goal is to predict class C – Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

34

4/18/2004



Approach: – compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem P (C | A A K A ) = 1

2

n

P ( A A K A | C ) P (C ) P(A A K A ) 1

n

2

1

2

n

– Choose value of C that maximizes P(C | A1, A2, …, An) – Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C)

Can we estimate P(C| A1, A2,…,An ) directly from data? Introduction to Data Mining

P( S | M ) P( M ) 0.5 ×1 / 50000 = = 0.0002 P(S ) 1 / 20

Bayesian Classifiers



© Tan,Steinbach, Kumar

If a patient has stiff neck, what’s the probability he/she has meningitis?

P(M | S ) =

Bayesian Classifiers



– Prior probability of any patient having meningitis is 1/50,000 – Prior probability of any patient having stiff neck is 1/20 

P( A | C ) P (C ) P ( A)

Introduction to Data Mining

Given: – A doctor knows that meningitis causes stiff neck 50% of the time

Bayes theorem:

P(C | A) =

32

 35

How to estimate P(A1, A2, …, An | C )?

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

36

Naïve Bayes Classifier 

How to Estimate Probabilities from Data? l l t ca

Assume independence among attributes Ai when class is given: – P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

Tid

– Can estimate P(Ai| Cj) for all Ai and Cj. – New point is classified to Cj if P(Cj) Π P(Ai| Cj) is maximal.

Refund

eg

or

a ic

t ca

eg

or

a ic

n co

tin

uo

us

s as  cl

Marital Status

Taxable Income

Evade

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

Class: P(C) = Nc/N – e.g., P(No) = 7/10, P(Yes) = 3/10



For discrete attributes: P(Ai | Ck) = |Aik|/ Nc k – where |Aik| is number of instances having attribute Ai and belongs to class Ck – Examples: P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0

10

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

37

How to Estimate Probabilities from Data? 

For continuous attributes: – Discretize the range into bins  

one ordinal attribute per bin violates independence assumption

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

38

Naïve Bayes (Summary) 

Robust to isolated noise points



Handle missing values by ignoring the instance during probability estimate calculations



Robust to irrelevant attributes



Independence assumption may not hold for some attributes – Use other techniques such as Bayesian Belief Networks (BBN)

k

– Two-way split: (A < v) or (A > v) 

choose only one of the two splits as new attribute

– Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation)  Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c)  

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

39

Artificial Neural Networks (ANN)

X1 X2 X3 1 1 1 1 0 0 0 0

0 0 1 1 0 1 1 0

0 1 0 1 1 0 1 0

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

40

Artificial Neural Networks (ANN)

X1 X2 X3

Y

1 1 1 1 0 0 0 0

0 1 1 1 0 0 1 0

0 0 1 1 0 1 1 0

0 1 0 1 1 0 1 0

Y 0 1 1 1 0 0 1 0

Y = I ( 0 . 3 X 1 + 0 .3 X 2 + 0 . 3 X 3 − 0 .4 > 0 ) Output Y is 1 if at least two of the three inputs are equal to 1.

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

1 where I ( z ) =  0 41

© Tan,Steinbach, Kumar

if z is true otherwise

Introduction to Data Mining

4/18/2004

42

Artificial Neural Networks (ANN) 

Model is an assembly of inter-connected nodes and weighted links



Output node sums up each of its input value according to the weights of its links



General Structure of ANN

Perceptron Model

Y = I ( ∑ wi X i − t )

Compare output node against some threshold t

or

Training ANN means learning the weights of the neurons

i

Y = sign(∑ wi X i − t ) i

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

43

Algorithm for learning ANN

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

44

Support Vector Machines



Initialize the weights (w0, w1, …, wk)



Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples 2 – Objective function: E = ∑ [Yi − f ( wi , X i )] i

– Find the weights wi’s that minimize the above objective function 

e.g., backpropagation algorithm (see lecture notes) 

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

45

Support Vector Machines



Introduction to Data Mining

Introduction to Data Mining

4/18/2004

46

4/18/2004

48

Support Vector Machines

One Possible Solution

© Tan,Steinbach, Kumar

Find a linear hyperplane (decision boundary) that will separate the data

© Tan,Steinbach, Kumar

 4/18/2004

47

Another possible solution

© Tan,Steinbach, Kumar

Introduction to Data Mining

Support Vector Machines

Support Vector Machines

 

Other possible solutions

© Tan,Steinbach, Kumar



Introduction to Data Mining

4/18/2004

49

Support Vector Machines

Which one is better? B1 or B2? How do you define better?

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

50

Support Vector Machines

r r w• x + b = 0 r r w • x + b = +1

r r w • x + b = −1



1 r f (x) =  − 1

Find hyperplane maximizes the margin => B1 is better than B2

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

51

Support Vector Machines 

We want to maximize:

r r if w • x + b ≥ 1 r r if w • x + b ≤ − 1

© Tan,Steinbach, Kumar

Introduction to Data Mining

2 Margin = r 2 || w || 4/18/2004

52

Support Vector Machines

2 Margin = r 2 || w ||



– Which is equivalent to minimizing: L( w) =

What if the problem is not linearly separable?

r || w ||2 2

– But subjected to the following constraints: r r if w • x i + b ≥ 1 1 r f ( xi ) =  r r  − 1 if w • x i + b ≤ − 1 

This is a constrained optimization problem – Numerical approaches to solve it (e.g., quadratic programming)

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

53

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

54

Support Vector Machines 

Nonlinear Support Vector Machines

What if the problem is not linearly separable? – Introduce slack variables r  Need to minimize: || w ||2  N  L ( w) = + C  ∑ ξik  2  i =1  



Subject to:

1 r f ( xi ) =  − 1

© Tan,Steinbach, Kumar

r r if w • x i + b ≥ 1 - ξ i r r if w • x i + b ≤ − 1 + ξ i

Introduction to Data Mining

4/18/2004

55

Nonlinear Support Vector Machines 

What if decision boundary is not linear?

Introduction to Data Mining

4/18/2004

Introduction to Data Mining

57

General Idea



Construct a set of classifiers from the training data



Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

© Tan,Steinbach, Kumar

56

Introduction to Data Mining

4/18/2004

58

Why does it work? 

Suppose there are 25 base classifiers – Each classifier has error rate, ε = 0.35 – Assume classifiers are independent – Probability that the ensemble classifier makes a wrong prediction: 25

 25 

∑  i ε (1 − ε )

i =13

© Tan,Steinbach, Kumar

4/18/2004

Ensemble Methods

Transform data into higher dimensional space

© Tan,Steinbach, Kumar

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

59

© Tan,Steinbach, Kumar



i

25−i



Introduction to Data Mining

= 0.06

4/18/2004

60

Examples of Ensemble Methods 

Bagging

How to generate an ensemble of classifiers? – Bagging



Original Data Bagging (Round 1) Bagging (Round 2) Bagging (Round 3)

– Boosting

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

61

Boosting 

Sampling with replacement 1 7 1 1

2 8 4 8

3 10 9 5

4 8 1 10

5 2 2 5

6 5 3 5

7 10 2 9

8 10 7 6

9 5 3 3

10 9 2 7



Build classifier on each bootstrap sample



Each sample has probability (1 – 1/n)n of being selected

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

62

Boosting

An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records – Initially, all N records are assigned equal weights – Unlike bagging, weights may change at the end of boosting round

Records that are wrongly classified will have their weights increased  Records that are classified correctly will have their weights decreased 

Original Data Boosting (Round 1) Boosting (Round 2) Boosting (Round 3)

1 7 5 4

2 3 4 4

3 2 9 8

4 8 4 10

5 7 2 4

6 9 5 5

7 4 1 4

8 10 7 6

9 6 4 3

10 3 2 4

• Example 4 is hard to classify • Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

63

© Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

64

Suggest Documents