Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , 2010

Learning a good prediction rule • Learn a mapping

• Best prediction rule • Hypothesis space/Function class – – – –

Parametric classes (Gaussian, binomial etc.) Conditionally independent class densities (Naïve Bayes) Linear decision boundary (Logistic regression) Nonparametric class (Histograms, nearest neighbor, kernel estimators, Decision Trees – Today)

• Given training data, find a hypothesis/function in close to the best prediction rule.

that is

2

First … • What does a decision tree represent • Given a decision tree, how do we assign label to a test point

3

Decision Tree for Tax Fraud Detection Query Data

Refund Yes

No

NO

TaxInc < 80K

NO

Married NO

> 80K YES

Taxable Income Cheat

No

80K

Married

?

10

MarSt Single, Divorced

Refund Marital Status

• Each internal node: test one feature Xi • Each branch from a node: selects one value for Xi • Each leaf node: predict Y 4

Decision Tree for Tax Fraud Detection Query Data

Refund Yes

No

NO

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

MarSt Single, Divorced TaxInc < 80K

NO

Married NO

> 80K YES 5

Decision Tree for Tax Fraud Detection Query Data

Refund Yes

No

NO

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

MarSt Single, Divorced TaxInc < 80K

NO

Married NO

> 80K YES 6

Decision Tree for Tax Fraud Detection Query Data

Refund Yes

No

NO

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10 10

MarSt Single, Divorced TaxInc < 80K

NO

Married NO

> 80K YES 7

Decision Tree for Tax Fraud Detection Query Data

Refund Yes

No

NO

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10 10

MarSt Single, Divorced TaxInc < 80K

NO

Married NO

> 80K YES 8

Decision Tree for Tax Fraud Detection Query Data

Refund Yes

No

NO

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10 10

MarSt Single, Divorced TaxInc < 80K

NO

Married NO

> 80K YES 9

Decision Tree for Tax Fraud Detection Query Data

Refund Yes

No

NO

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10 10

MarSt Single, Divorced TaxInc < 80K

NO

Married

Assign Cheat to “No”

NO > 80K YES 10

Decision Tree more generally…

1

1

0

0

1

1

1 0

1

1 0

1

• Features can be discrete, continuous or categorical • Each internal node: test some set of features {Xi} • Each branch from a node: selects a set of value for {Xi} • Each leaf node: predict Y

1 1

11

So far… • What does a decision tree represent • Given a decision tree, how do we assign label to a test point

Now … • How do we learn a decision tree from training data • What is the decision on each leaf 12

So far… • What does a decision tree represent • Given a decision tree, how do we assign label to a test point

Now … • How do we learn a decision tree from training data • What is the decision on each leaf 13

How to learn a decision tree • Top-down induction *ID3, C4.5, CART, …+ Refund Yes

No

NO

MarSt Single, Divorced

Married

TaxInc < 80K NO

NO > 80K YES

14

Which feature is best to split? X1

X2

Y

T T T T F F F F

T F T F T F T F

T T T T T F F F

T

Y: 4 Ts 0 Fs Absolutely sure

F

Y: 1 Ts 3 Fs Kind of sure

F

T

Y: 3 Ts 1 Fs Kind of sure

Y: 2 Ts 2 Fs Absolutely unsure

Good split if we are more certain about classification after split – Uniform distribution of labels is bad 15

Which feature is best to split?

Pick the attribute/feature which yields maximum information gain:

H(Y) – entropy of Y

H(Y|Xi) – conditional entropy of Y

16

Entropy • Entropy of a random variable Y

Y ~ Bernoulli(p)

Uniform Max entropy

Entropy, H(Y)

More uncertainty, more entropy!

Deterministic Zero entropy p

Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) 17

Andrew Moore’s Entropy in a Nutshell

Low Entropy

High Entropy ..the values (locations of soup) sampled entirely from within the soup bowl

..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room 18

Information Gain • Advantage of attribute = decrease in uncertainty – Entropy of Y before split

– Entropy of Y after splitting based on Xi • Weight by probability of following each branch

• Information gain is difference Max Information gain = min conditional entropy 19

Information Gain X1 T T

X2 T F

Y T T

T T F F F F

T F T F T F

T T T F F F

T

Y: 4 Ts 0 Fs

F

Y: 1 Ts 3 Fs

T

Y: 3 Ts 1 Fs

F

Y: 2 Ts 2 Fs

>0 20

Which feature is best to split? Pick the attribute/feature which yields maximum information gain:

H(Y) – entropy of Y

H(Y|Xi) – conditional entropy of Y

Feature which yields maximum reduction in entropy provides maximum information about Y

21

Expressiveness of Decision Trees • Decision trees can express any function of the input features. • E.g., for Boolean functions, truth table row → path to leaf:

• There is a decision tree which perfectly classifies a training set with one path to leaf for each example • But it won't generalize well to new examples - prefer to find more compact decision trees 22

Decision Trees - Overfitting One training example per leaf – overfits, need compact/pruned decision tree

23

Bias-Variance Tradeoff average classifier

Classifiers based on different training data

coarse partition

bias large

variance small

fine partition

bias small

variance large

Ideal classifier

24

When to Stop? • Many strategies for picking simpler trees: – Pre-pruning • Fixed depth • Fixed number of leaves

– Post-pruning

Refund Yes

No MarSt Single, Divorced

• Chi-square test – Convert decision tree to a set of rules – Eliminate variable values in rules which are independent of label (using chi-square test for independence) – Simplify rule set by eliminating unnecessary rules

Married

NO

– Information Criteria: MDL(Minimum Description Length) 25

Information Criteria • Penalize complex models by introducing cost

log likelihood

cost regression classification

penalize trees with more leaves 26

Information Criteria - MDL Penalize complex models based on their information content.

MDL (Minimum Description Length)

# bits needed to describe f (description length)

Example: Binary Decision trees

k leaves => 2k – 1 nodes 2k – 1 bits to encode tree structure + k bits to encode label of each leaf (0/1)

5 leaves => 9 bits to encode structure

So far… • What does a decision tree represent • Given a decision tree, how do we assign label to a test point

Now … • How do we learn a decision tree from training data • What is the decision on each leaf 28

How to assign label to each leaf Classification – Majority vote

Regression – ?

29

How to assign label to each leaf Classification – Majority vote

Regression – Constant/ Linear/Poly fit

30

Regression trees Num Children? ≥2