DATA MINING DECISION TREE INDUCTION

DATA MINING DECISION TREE INDUCTION 1 Classification Techniques  Linear Models Support Vector Machines  Decision Tree based Methods  Rule-based...
Author: Mariah Dorsey
19 downloads 0 Views 2MB Size
DATA MINING DECISION TREE INDUCTION

1

Classification Techniques  Linear Models Support Vector Machines

 Decision Tree based Methods  Rule-based Methods  Memory based reasoning

 Neural Networks  Naïve Bayes and Bayesian Belief Networks

 Support Vector Machines

2

Example of a Decision Tree Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

Splitting Attributes

Refund Yes

No

NO

MarSt Single, Divorced TaxInc

< 80K NO

Married NO

> 80K YES

10

Training Data

Model: Decision Tree 3

Another Decision Tree Example MarSt Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

Married NO

Single, Divorced Refund No

Yes NO

TaxInc < 80K NO

> 80K YES

More than one tree may perfectly fit the data

10

4

Decision Tree Classification Task Tid

Attrib1

Attrib2

Attrib3

Class

1

Yes

Large

125K

No

2

No

Medium

100K

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Tree Induction algorithm

Induction Learn Model Model

10

Training Set

Tid

Attrib1

Attrib2

Attrib3

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Apply Model

Class

Decision Tree

Deduction

10

Test Set 5

Apply Model to Test Data Test Data Start from the root of tree.

Refund Yes

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

No

NO

MarSt Single, Divorced

TaxInc < 80K NO

Married NO

> 80K

YES

6

Apply Model to Test Data Test Data

Refund Yes

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

No

NO

MarSt Single, Divorced

TaxInc < 80K NO

Married NO

> 80K

YES

7

Apply Model to Test Data Test Data

Refund Yes

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

No

NO

MarSt Single, Divorced TaxInc < 80K NO

Married NO

> 80K YES

8

Apply Model to Test Data Test Data

Refund Yes

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

No

NO

MarSt Single, Divorced TaxInc < 80K NO

Married NO

> 80K YES

9

Apply Model to Test Data Test Data

Refund Yes

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

No

NO

MarSt Single, Divorced TaxInc < 80K NO

Married NO

> 80K YES

10

Apply Model to Test Data Test Data

Refund Yes

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

No

NO

MarSt Single, Divorced TaxInc < 80K NO

Married

Assign Cheat to “No”

NO > 80K YES

11

Decision Tree Terminology

12

Decision Tree Induction  Many Algorithms:    

Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT

 John Ross Quinlan is a computer science researcher in data

mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms. 13

Decision Tree Classifier 10 9 8 7 6 5 4 3 2 1

Antenna Length

Ross Quinlan

Abdomen Length > 7.1? yes

no Antenna Length > 6.0? no

1

2 3

4 5

6 7

Abdomen Length

8 9 10

Grasshopper

Katydid yes

Katydid 14

Antennae shorter than body?

Yes

No

3 Tarsi?

Grasshopper

Yes

No

Foretiba has ears? Yes

No

Cricket

Decision trees predate computers

Katydids

Camel Cricket

15

Definition 

Decision tree is a classifier in the form of a tree structure – Decision node: specifies a test on a single attribute – Leaf node: indicates the value of the target attribute – Arc/edge: split of one attribute – Path: a disjunction of test to make the final decision



Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node.

16

Decision Tree Classification • Decision tree generation consists of two phases – Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • This can also be called supervised segmentation • This emphasizes that we are segmenting the instance space – Tree pruning • Identify and remove branches that reflect noise or outliers 17

Decision Tree Representation  Each internal node tests an attribute  Each branch corresponds to attribute value

 Each leaf node assigns a classification outlook sunny

overcast

humidity

rain

wind

yes

high

normal

strong

weak

no

yes

no

yes 18

How do we Construct a Decision Tree?  Basic algorithm (a greedy algorithm)  Tree is constructed in a top-down recursive divide-

and-conquer manner  At start, all the training examples are at the root  Examples are partitioned recursively based on selected attributes.  Test attributes are selected on the basis of a heuristic or statistical measure (e.g., info. gain)

 Why do we call this a greedy algorithm?  Because it makes locally optimal decisions (at

each node). 19

When Do we Stop Partitioning?  All samples for a node belong to same class

 No remaining attributes  majority voting used to assign class

 No samples left

20

How to Pick Locally Optimal Split  Hunt’s algorithm: recursively partition

training records into successively purer subsets.  How to measure purity/impurity?  Entropy and associated information gain  Gini  Classification error rate  Never used in practice but good for understanding and simple exercises

21

How to Determine Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Own Car? Yes

Car Type? No

Family

Student ID? Luxury

c1

Sports C0: 6 C1: 4

C0: 4 C1: 6

C0: 1 C1: 3

C0: 8 C1: 0

C0: 1 C1: 7

C0: 1 C1: 0

...

c10 C0: 1 C1: 0

c11 C0: 0 C1: 1

c20

...

C0: 0 C1: 1

Which test condition is the best? Why is student id a bad feature to use?

22

How to Determine Best Split  Greedy approach:  Nodes with homogeneous class distribution are preferred

 Need a measure of node impurity: C0: 5 C1: 5

C0: 9 C1: 1

Non-homogeneous,

Homogeneous,

High degree of impurity

Low degree of impurity

23

Information Theory  Think of playing "20 questions": I am thinking of an

integer between 1 and 1,000 -- what is it? What is the first question you would ask?  What question will you ask?  Why?  Entropy measures how much more information you need

before you can identify the integer.  Initially, there are 1000 possible values, which we assume are equally likely.  What is the maximum number of question you need to ask? 24

Entropy  Entropy (disorder, impurity) of a set of examples, S, relative to a

binary classification is: Entropy (S )   p1 log 2 ( p1 )  p0 log 2 ( p0 )

where p1 is the fraction of positive examples in S and p0 is fraction of negatives.  If all examples are in one category, entropy is zero (we define

0log(0)=0)  If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1.  For multi-class problems with c categories, entropy generalizes to: c

Entropy ( S )    pi log 2 ( pi ) i 1

25

Entropy for Binary Classification  The entropy is 0 if the outcome is certain.  The entropy is maximum if we have no knowledge

of the system (or any outcome is equally possible).

Entropy of a 2-class problem with regard to the portion of one of the two groups

26

Information Gain in Decision Tree Induction • Is the expected reduction in entropy caused by partitioning the examples according to this attribute. • Assume that using attribute A, a current set will be partitioned into some number of child sets • The encoding information that would be gained by branching on A

Gain( A)  E (Current set )   E (all child sets ) The summation in the above formula is a bit misleading since when doing the summation we weight each entropy by the fraction of total examples in the particular child set. This applies to GINI and error rate also.

27

Examples for Computing Entropy Entropy(t )   p( j | t ) log p( j | t ) j

2

NOTE: p( j | t) is computed as the relative frequency of class j at node t

C1 C2

0 6

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0

C1 C2

1 5

P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

C1 C2

2 4

P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

C1 C2

3 3

P(C1) = 3/6=1/2 P(C2) = 3/6 = 1/2 Entropy = – (1/2) log2 (1/2) – (1/2) log2 (1/2) = -(1/2)(-1) – (1/2)(-1) = ½ + ½ = 1

28

How to Calculate log2x  Many calculators only have a button for log10x

and logex (“log” typically means log10)  You can calculate the log for any base b as follows:  logb(x) = logk(x) / logk(b)  Thus log2(x) = log10(x) / log10(2)  Since log10(2) = .301, just calculate the log base 10

and divide by .301 to get log base 2.  You can use this for HW if needed 29

Splitting Based on INFO...  Information Gain:

GAIN

 n   Entropy( p)   Entropy(i )   n  k

split

i

i 1

Parent Node, p is split into k partitions; ni is number of records in partition i

 Uses a weighted average of the child nodes, where weight

is based on number of examples  Used in ID3 and C4.5 decision tree learners  WEKA’s J48 is a Java version of C4.5  Disadvantage: Tends to prefer splits that result in large

number of partitions, each being small but pure.

How Split on Continuous Attributes?  For continuous attributes  Partition the continuous value of attribute A into

a discrete set of intervals  Create a new boolean attribute Ac , looking for a threshold c  One method is to try all possible splits

true if Ac  c Ac    false otherwise How to choose c ? 31

Person Homer Marge Bart Lisa Maggie Abe Selma Otto Krusty Comic

Hair Length

Weight

Age

Class

0” 10” 2” 6” 4” 1” 8” 10” 6”

250 150 90 78 20 170 160 180 200

36 34 10 8 1 70 41 38 45

M F M F F M F M M

8”

290

38 32

?

Entropy ( S )  

 p  p log2    pn p  n  

 n  n log2   pn p  n  

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 no

yes Hair Length