Classification Techniques (2)

Overview Previous Lecture • Classification Problem Classification Techniques (2) • Classification based on Regression • Distance-based Classificatio...
Author: Bonnie Walters
62 downloads 0 Views 468KB Size
Overview Previous Lecture • Classification Problem

Classification Techniques (2)

• Classification based on Regression • Distance-based Classification (KNN) This Lecture • Classification using Decision Trees • Classification using Rules • Quality of Classifiers Data Mining Lecture 4: Classification 2

2

Classification Using Decision Trees

Decision Tree

• A partitioning based technique

Given: – D = {t1, …, tn} where ti= – Database schema contains {A1, A2, …, Ah} – Classes C = {C1, …., Cm}

– Divides the search space into rectangular regions

• Each tuple is placed into a class based on the region within which it falls • Internal nodes associated with attribute and arcs with values for that attribute • DT approaches differ in how the tree is built • Algorithms: Hunt’s, ID3, C4.5, CART Data Mining Lecture 4: Classification 2

Decision or Classification Tree is a tree associated with D such that – Each internal node is labeled with attribute, Ai – Each arc is labeled with predicate which can be applied to attribute at parent – Each leaf node is labeled with a class, Cj

3

Data Mining Lecture 4: Classification 2

Example of a Decision Tree t ca

o eg

al ric t ca

o eg

al ric

in nt co

uo

Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

Another Example of Decision Tree

us as cl

4

s

t ca

Splitting Attributes

Refund Yes

No

NO

MarSt Single, Divorced TaxInc

< 80K NO

Married NO

> 80K YES

10

or eg

a ic

l

t ca

or eg

a ic

l in nt co

uo

us as cl

Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

s

Married

MarSt

NO

Single, Divorced Refund No

Yes NO

TaxInc < 80K NO

> 80K YES

There could be more than one tree that fits the same data!

10

Training Data

Model: Decision Tree Data Mining Lecture 4: Classification 2

5

Data Mining Lecture 4: Classification 2

6

1

Decision Tree Classification Task Tid

Attrib1

1

Yes

Large

125K

No

2

No

Medium

Attrib2

100K

Attrib3

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Apply Model to Test Data Start from the root of tree.

Tree Induction algorithm

Class

Induction

Refund Yes Learn Model

Training Set

Tid

Attrib1

11

No

Small

55K

?

12

Yes

Medium

Attrib2

80K

Attrib3

?

Class

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

80K

Married

?

0 1

Test Data

MarSt Single, Divorced

Decision Tree

Apply Model

Taxable Income Cheat

No

No

NO Model

1 0

Refund Marital Status

TaxInc < 80K

Married NO

> 80K

Deduction

YES

NO

1 0

Test Set Data Mining Lecture 4: Classification 2

7

Apply Model to Test Data

No

Refund

Married

Taxable Income Cheat 80K

Refund

Test Data

TaxInc < 80K

TaxInc

NO < 80K

Data Mining Lecture 4: Classification 2

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

Refund

Test Data

Yes

10

Married

Data Mining Lecture 4: Classification 2

TaxInc < 80K NO

11

No

80K

Married

?

Test Data

MarSt

NO

YES

Taxable Income Cheat

0 1

Single, Divorced

> 80K

Refund Marital Status

No

NO

MarSt

NO

NO

Data Mining Lecture 4: Classification 2

1 0

TaxInc

Married

Apply Model to Test Data

No

Single, Divorced

?

Test Data

YES

9

Apply Model to Test Data

NO

80K

Married

> 80K

NO

Refund

No 0 1

No

Single, Divorced

YES

Yes

Taxable Income Cheat

MarSt

Married

> 80K

NO

< 80K

Yes NO

MarSt Single, Divorced

Refund Marital Status

?

1 0

No

NO

8

Apply Model to Test Data

Refund Marital Status

Yes

Data Mining Lecture 4: Classification 2

Married NO

> 80K YES

Data Mining Lecture 4: Classification 2

12

2

Apply Model to Test Data

Refund Yes

Decision Tree Classification Task

Refund Marital Status

Taxable Income Cheat

No

80K

Married

Tid

Attrib1

1

Yes

Large

125K

No

2

No

Medium

Attrib2

100K

Attrib3

No

Tree Induction algorithm

Class

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

1 0

Test Data

No

NO

?

MarSt Single, Divorced

Assign Cheat to “No”

Married

Induction Learn Model

Model

1 0

Training Set

TaxInc < 80K NO

Apply Model

NO > 80K YES

Tid

Attrib1

11

No

Small

55K

?

12

Yes

Medium

Attrib2

80K

Attrib3

?

Class

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Decision Tree

Deduction

1 0

Test Set Data Mining Lecture 4: Classification 2

13

General Structure of Hunt’s Algorithm Tid Refund Marital Status

• Let Dt be the set of training records that reach a node t • General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, then use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.

Data Mining Lecture 4: Classification 2

Hunt’s Algorithm Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

60K

7

Yes

Divorced 220K

No

8

No

Single

Yes

85K

Yes Don’t Cheat

Yes

Single

125K

No

2

No

Married

100K

No

No

Single

70K

No

4

Yes

Married

120K

No

Don’t Cheat

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

Refund Yes

No

Taxable Income Cheat

1

3

Refund Yes

Tid Refund Marital Status

No

Refund Don’t Cheat

14

No

60K

10

9

No

Married

75K

No

10

No

Single

90K

Yes

1 0

Data Mining Lecture 4: Classification 2

Dt

Don’t Cheat

Don’t Cheat

Marital Status

Single, Divorced

?

Cheat

Married

Decision Tree Induction

Single, Divorced

Married Don’t Cheat

Taxable Income

Don’t Cheat

15

Marital Status

< 80K

>= 80K

Don’t Cheat

Cheat

Data Mining Lecture 4: Classification 2

16

Decision Tree Induction • Greedy strategy – Split the records based on an attribute test that optimizes certain criterion.

• Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?

– Determine when to stop splitting

Data Mining Lecture 4: Classification 2

17

Data Mining Lecture 4: Classification 2

18

3

DT Split Areas

How to Specify Test Condition? • Depends on attribute types

Gender

– Nominal – Ordinal – Continuous

M

• Depends on number of ways to split

F 1.0

Height

– 2-way split – Multi-way split

2.5

Data Mining Lecture 4: Classification 2

19

Data Mining Lecture 4: Classification 2

20

Splitting Based on Nominal Attributes

Splitting Based on Ordinal Attributes

• Multi-way split: Use as many partitions as distinct values.

• Multi-way split: Use as many partitions as distinct values. Size

Small

CarType Family

Large

Medium

Luxury Sports

• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury}

CarType {Family}

OR

{Family, Luxury}

• Binary split: Divides values into two subsets. Need to find optimal partitioning. Size

{Small, Medium}

{Large}

OR

{Medium, Large}

Size {Small}

CarType {Sports}

• What about this split? {Small, Large}

Data Mining Lecture 4: Classification 2

21

Splitting Based on Continuous Attributes

Size {Medium}

Data Mining Lecture 4: Classification 2

22

Splitting Based on Continuous Attributes

• Different ways of handling – Discretization to form an ordinal categorical attribute

Taxable Income > 80K?

• Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.

< 10K Yes

> 80K

No [10K,25K)

– Binary Decision: (A < v) or (A ≥ v) • considers all possible splits and finds the best cut • can be more compute intensive

Data Mining Lecture 4: Classification 2

Taxable Income?

(i) Binary split

23

[25K,50K)

[50K,80K)

(ii) Multi-way split

Data Mining Lecture 4: Classification 2

24

4

Comparing Decision Trees

DT Induction Issues that affect Performance • • • • • • •

Choosing Splitting Attributes Ordering of Splitting Attributes Split Points Tree Structure Stopping Criteria Training Data (size of) Pruning

Balanced Deep Data Mining Lecture 4: Classification 2

25

Data Mining Lecture 4: Classification 2

How to determine the Best Split

How to determine the Best Split • Greedy approach:

Before Splitting: 10 records of class 0, 10 records of class 1

Own Car?

Car Type? Family

No

Yes

c1

c10

Sports C0: 6 C1: 4

C0: 4 C1: 6

C0: 1 C1: 3

C0: 8 C1: 0

– Nodes with homogeneous class distribution are preferred

• Need a measure of node impurity:

Student ID? Luxury

C0: 1 C1: 7

C0: 1 C1: 0

...

c11

C0: 1 C1: 0

C0: 0 C1: 1

c20

C0: 0 C1: 1

...

Which test condition is the best?

Data Mining Lecture 4: Classification 2

C0 C1

C0: 5 C1: 5

C0: 9 C1: 1

Non-homogeneous,

Homogeneous,

High degree of impurity

Low degree of impurity

27

Data Mining Lecture 4: Classification 2

How to Find the Best Split Before Splitting:

26

28

Measure of Impurity: GINI

N00 N01

• Gini Index for a given node t :

M0

GINI (t ) = 1 − ∑ [ p ( j | t )]2 A?

Yes Node N1 C0 C1

No Node N2 C0 C1

N10 N11

N20 N21

M2

M1

Yes Node N3

N40 N41

C0 C1

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information

M4

M3

M12

No Node N4

N30 N31

C0 C1

j

B?

C1 0 C2 6 Gini=0.000

M34

C1 1 C2 5 Gini=0.278

C1 2 C2 4 Gini=0.444

C1 3 C2 3 Gini=0.500

Gain = M0 – M12 vs M0 – M34 Data Mining Lecture 4: Classification 2

29

Data Mining Lecture 4: Classification 2

30

5

Examples for computing GINI

Splitting Based on GINI

GINI (t ) = 1 − ∑ [ p ( j | t )]

2

• Used in CART • When a node p is split into k partitions (children), the quality of split is computed as:

j

C1 C2

0 6

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

P(C1) = 0/6 = 0

P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 k

GINI split = ∑

P(C2) = 5/6

i =1

ni GINI (i ) n

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

where,

P(C2) = 4/6

ni = number of records at child i, n = number of records at node p.

Gini = 1 – (2/6)2 – (4/6)2 = 0.444 Data Mining Lecture 4: Classification 2

31

Data Mining Lecture 4: Classification 2

Binary Attributes: Computing GINI Index

32

Categorical Attributes: Computing GINI Index • For each distinct value, gather counts for each class in the dataset • Use the count matrix to make decisions

• Splits into two partitions • Effect of Weighing partitions: – Larger and Purer Partitions are sought for. Parent

B? Yes

No

C1

6

C2

6

Multi-way split

Gini = 0.500

Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194 Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528

Node N1

C1 C2 Gini

Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333

Data Mining Lecture 4: Classification 2



Number of possible splitting values = Number of distinct values

• Each splitting value has a count matrix associated with it –

Class counts in each of the partitions, A < v and A ≥ v

• Simple method to choose best v – –

Family Sports Luxury 1 2 1 4 1 1 0.393

33

Tid Refund

Marital Status

Taxable Income

Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

Cheat

Married

75K

No

Sorted Values

Single

90K

Yes

Split Positions

Data Mining Lecture 4: Classification 2

No

No

No

Yes

60

70

75

85

Yes

Yes

No

No

No

No

100

120

125

220

Taxable Income

No

55

65

72

90

80

95

87

92

97

110

122

172

230























Yes

0

3

0

3

0

3

0

3

1

2

2

1

3

0

3

0

3

0

3

0

3

0

No

0

7

1

6

2

5

3

4

3

4

3

4

3

4

4

3

5

2

6

1

7

0

Gini Yes

34

– Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index

No

Taxable Income > 80K?

C1 C2 Gini

• For efficient computation: for each attribute,

10

For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work.

1 4 0.400

Continuous Attributes: Computing GINI Index

9 1 0

3 2

C1 C2 Gini

CarType {Family, Luxury} 2 2 1 5 0.419

{Sports}

Data Mining Lecture 4: Classification 2

Continuous Attributes: Computing GINI Index • Use Binary Decisions based on one value • Several choices for the splitting value

CarType {Sports, {Family} Luxury}

CarType

Node N2

N1 N2 C1 5 1 C2 2 4 Gini=0.333

Two-way split (find best partition of values)

0.420

0.400

0.375

0.343

0.417

0.400

0.300

0.343

0.375

0.400

0.420

No

35

Data Mining Lecture 4: Classification 2

36

6

Information

DT Induction

Decision Tree Induction is often based on Information Theory

• When all the marbles in the bowl are mixed up, little information is given. • When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given.

Use this approach with DT Induction !

Data Mining Lecture 4: Classification 2

37

Information/Entropy

Data Mining Lecture 4: Classification 2

38

Entropy

Given probabilities p1, p2, .., ps whose sum is 1, Entropy is defined as:

• Entropy measures the amount of randomness or surprise or uncertainty. • Goal in classification – no surprise – entropy = 0

log (1/p) Data Mining Lecture 4: Classification 2

39

ID3

H(p,1-p) Data Mining Lecture 4: Classification 2

40

Height Example Data N am e K ristina Jim M aggie M artha S tephanie B ob K athy D ave W orth S teven D ebbie T odd K im Amy W ynette

• Creates a decision tree using information theory concepts and tries to reduce the expected number of comparisons. • ID3 chooses to split on an attribute that gives the highest information gain:

Data Mining Lecture 4: Classification 2

41

G en der F M F F F M F M M M F M F F F

H eig ht 1.60 2.02 1.90 1.88 1.71 1.85 1.60 1.72 2.12 2.10 1.78 1.95 1.89 1.81 1.75

O utpu t1 S ho rt T all M ed ium M ed ium S ho rt M ed ium S ho rt S ho rt T all T all M ed ium M ed ium M ed ium M ed ium M ed ium

Data Mining Lecture 4: Classification 2

O utp ut2 M edium M edium T all T all M edium M edium M edium M edium T all T all M edium M edium T all M edium M edium 42

7

ID3 Example (Output1)

C4.5 Algorithm

• Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 • Gain using gender: – Female: 3/9 log(9/3) + 6/9 log(9/6) = 0.2764 – Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392 – Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152 – Gain: 0.4384 – 0.34152 = 0.09688 • Gain using height: 0.4384 – (2/15)(0.301) = 0.3983 • Choose height as first splitting attribute

• ID3 favors attributes with large number of divisions (is vulnerable to overfitting)

• Improved version of ID3: – – – – –

Missing Data Continuous Data Pruning Rules GainRatio: • Takes into account the cardinality of each split area

Data Mining Lecture 4: Classification 2

43

Data Mining Lecture 4: Classification 2

44

CART: Classification and Regression Trees

CART Example

• Creates a Binary Tree • Uses entropy to choose the best splitting attribute and point • Formula to choose split point, s, for node t:

• At the start, there are six choices for split point (right branch on equality):

• PL, PR probability that a tuple in the training set will be on the left or right side of the tree.

• Split at 1.8

Data Mining Lecture 4: Classification 2

– – – – – –

45

Decision Tree Based Classification

Data Mining Lecture 4: Classification 2

46

Decision Boundary

• Advantages:

1 0.9

Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets

x < 0.43?

0.8 0.7

Yes

No

0.6

y

– – – –

Φ(Gender) = 2(6/15)(9/15)(2/15 +4/15+3/15) = 0.224 Φ(1.6) = 0 Φ(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 Φ(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 Φ(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 Φ(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32

y < 0.47?

0.5

y < 0.33?

0.4

Yes

0.3 0.2

:4 :0

0.1

No :0 :4

Yes

No

:0 :3

:4 :0

0 0

0.1

0.2

0.3

0.4

0.5

x

0.6

0.7

0.8

0.9

1

• Border line between two neighboring regions of different classes is known as decision boundary • Decision boundary is parallel to axes because test condition involves a single attribute at-a-time Data Mining Lecture 4: Classification 2

47

Data Mining Lecture 4: Classification 2

48

8

Oblique Decision Trees

Tree Replication P

x+y

Suggest Documents