SVM and Decision Tree

SVM and Decision Tree Le Song Machine Learning I CSE 6740, Fall 2013 Which decision boundary is better? Suppose the training samples are linearly s...
Author: Guest
8 downloads 0 Views 1MB Size
SVM and Decision Tree Le Song

Machine Learning I CSE 6740, Fall 2013

Which decision boundary is better? Suppose the training samples are linearly separable We can find a decision boundary which gives zero training error

Class 2

But there are many such decision boundaries Which one is better?

Class 1

2

Compare two decision boundaries Suppose we perturb the data, which boundary is more susceptible to error?

3

Constraints on data points Constraints on data points For all π‘₯ in class 2, 𝑦 = 1 and 𝑀 ⊀ π‘₯ + 𝑏 β‰₯ 𝑐 For all π‘₯ in class 1, 𝑦 = βˆ’1 and 𝑀 ⊀ π‘₯ + 𝑏 ≀ βˆ’π‘

Or more compactly, (𝑀 ⊀ π‘₯ + 𝑏)𝑦 β‰₯ 𝑐 w xb ο€½ 0 T

𝑀

Class 2

Class 1

c

c 4

Classifier margin Pick two data points π‘₯ 1 and π‘₯ 2 which are on each dashed line respectively 1 𝑀

The margin is 𝛾 =

𝑀

⊀

1

π‘₯ βˆ’π‘₯

2

w xb ο€½ 0

=

2𝑐 |𝑀|

T

𝑀

π‘₯1

Class 2 π‘₯2

Class 1

c

c 5

Maximum margin classifier Find decision boundary 𝑀 as far from data point as possible 2𝑐 max 𝑀,𝑏 | 𝑀 | 𝑠. 𝑑. 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 β‰₯ 𝑐, βˆ€π‘– w xb ο€½ 0 T

𝑀

π‘₯1

Class 2 π‘₯2

Class 1

c

c 6

Support vector machines with hard margin min

𝑠. 𝑑.

𝑀,𝑏 𝑦𝑖 𝑀 ⊀ π‘₯𝑖

𝑀

2

+ 𝑏 β‰₯ 1, βˆ€π‘–

Convert to standard form

1 ⊀ min 𝑀 𝑀 𝑀,𝑏 2 𝑠. 𝑑. 1 βˆ’ 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 ≀ 0, βˆ€π‘–

The Lagrangian function

1 ⊀ 𝐿 𝑀, 𝛼, 𝛽 = 𝑀 𝑀 + 2

π‘š

𝛼𝑖 1 βˆ’ 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 𝑖

7

Deriving the dual problem π‘š

1 ⊀ 𝐿 𝑀, 𝛼, 𝛽 = 𝑀 𝑀 + 2

𝛼𝑖 1 βˆ’ 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 𝑖

Taking derivative and set to zero π‘š

πœ•πΏ =π‘€βˆ’ πœ•π‘€ πœ•πΏ = πœ•π‘

𝛼𝑖 𝑦 𝑖 π‘₯ 𝑖 = 0

π‘š

𝑖

𝛼𝑖 𝑦 𝑖 = 0 𝑖

8

Plug back relation of w and b 𝐿 𝑀, 𝛼, 𝛽 = 1 π‘š 𝑖π‘₯𝑖 𝛼 𝑦 𝑖 𝑖 2

π‘š 𝑖 𝛼𝑖

1

βˆ’ 𝑦𝑖

⊀

π‘š 𝑗π‘₯𝑗 𝛼 𝑦 𝑗 𝑗

+

⊀ π‘š 𝑗 𝑗 𝑗 𝛼𝑗 𝑦 π‘₯

After simplification

π‘š

𝐿 𝑀, 𝛼, 𝛽 = 𝑖

1 𝛼𝑖 βˆ’ 2

π‘₯𝑖 + 𝑏

π‘š

𝛼𝑖 𝛼𝑗

𝑦𝑖 𝑦 𝑗

⊀ 𝑗 𝑖 π‘₯ π‘₯

𝑖,𝑗

9

The dual problem π‘š

max 𝛼

𝑖

1 𝛼𝑖 βˆ’ 2

π‘š

𝛼𝑖 𝛼𝑗

𝑦𝑖 𝑦 𝑗

⊀ 𝑗 𝑖 π‘₯ π‘₯

𝑖,𝑗

𝑠. 𝑑. 𝛼𝑖 β‰₯ 0, 𝑖 = 1, … , π‘š π‘š

𝛼𝑖 𝑦 𝑖 = 0 𝑖

This is a constrained quadratic programming Nice and convex, and global maximum can be found 𝑀 can be found as 𝑀 = How about 𝑏?

π‘š 𝑖 𝑖 𝑖 𝛼𝑖 𝑦 π‘₯

10

Support vectors Note that the KKT condition 𝛼𝑖 1 βˆ’ 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏

=0

For data points with 1 βˆ’ 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏

< 0 , 𝛼𝑖 = 0

For data points with 1 βˆ’ 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏

= 0 , 𝛼𝑖 > 0

Class 2

a8=0.6

a10=0 a7=0

a5=0 a4=0 a9=0 Class 1

a2=0

Call the training data points whose ai's are nonzero the support vectors (SV)

a1=0.8 a6=1.4

a3=0 11

Computing b and obtain the classifer Pick any data point with 𝛼𝑖 > 0, solve for 𝑏 with 1 βˆ’ 𝑦𝑖 𝑀 ⊀ π‘₯𝑖 + 𝑏 = 0 For a new test point z Compute

π‘€βŠ€π‘§ + 𝑏 =

𝛼𝑖 𝑦 𝑖 π‘₯ 𝑖 𝑧 + 𝑏 π‘–βˆˆπ‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘ π‘£π‘’π‘π‘‘π‘œπ‘Ÿπ‘ 

Classify 𝑧 as class 1 if the result is positive, and class 2 otherwise

12

Interpretation of support vector machines The optimal w is a linear combination of a small number of data points. This β€œsparse” representation can be viewed as data compression To compute the weights 𝛼𝑖 , and to use support vector machines we need to specify only the inner products (or ⊀ 𝑗 𝑖 kernel) between the examples π‘₯ π‘₯ We make decisions by comparing each new example z with only the support vectors: 𝑦 βˆ— = 𝑠𝑖𝑔𝑛

𝛼𝑖 𝑦 𝑖 π‘₯ 𝑖 𝑧 + 𝑏 π‘–βˆˆπ‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘ π‘£π‘’π‘π‘‘π‘œπ‘Ÿπ‘  13

Soft margin constraints What if the data is not linearly separable? We will allow points to violate the hard margin constraint (𝑀 ⊀ π‘₯ + 𝑏)𝑦 β‰₯ 1 βˆ’ πœ‰ w xb ο€½ 0 T

𝑀

πœ‰1 πœ‰2

Class 1

Class 2 πœ‰3

1

1 14

Soft margin SVM min

𝑀

𝑀,𝑏,πœ‰

2

π‘š

πœ‰π‘–

+𝐢 𝑖=1 𝑖

𝑠. 𝑑. 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 β‰₯ 1 βˆ’ πœ‰ , πœ‰ 𝑖 β‰₯ 0, βˆ€π‘– Convert to standard form

1 ⊀ min 𝑀 𝑀 𝑀,𝑏 2 𝑠. 𝑑. 1 βˆ’ 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 βˆ’ πœ‰ 𝑖 ≀ 0, πœ‰ 𝑖 β‰₯ 0, βˆ€π‘–

The Lagrangian function 1 ⊀ = 𝑀 𝑀+ 2

π‘š

𝐿 𝑀, 𝛼, 𝛽 πΆπœ‰ 𝑖 + 𝛼𝑖 1 βˆ’ 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 βˆ’ πœ‰ 𝑖 βˆ’ 𝛽𝑖 πœ‰ 𝑖

𝑖

15

Deriving the dual problem 1 ⊀ = 𝑀 𝑀+ 2

π‘š

𝐿 𝑀, 𝛼, 𝛽 πΆπœ‰ 𝑖 + 𝛼𝑖 1 βˆ’ 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 βˆ’ πœ‰ 𝑖 βˆ’ 𝛽𝑖 πœ‰ 𝑖

𝑖

Taking derivative and set to zero πœ•πΏ =π‘€βˆ’ πœ•π‘€

π‘š

π‘š

𝛼𝑖 𝑦 𝑖 π‘₯ 𝑖 = 0 𝑖

πœ•πΏ = 𝛼𝑖 𝑦 𝑖 = 0 πœ•π‘ 𝑖 πœ•πΏ = 𝐢 βˆ’ 𝛼𝑖 βˆ’ 𝛽𝑖 = 0 𝑖 πœ•πœ‰ 16

Plug back relation of 𝑀, 𝑏 and πœ‰ 𝐿 𝑀, 𝛼, 𝛽 = 1 π‘š 𝑖π‘₯𝑖 𝛼 𝑦 𝑖 𝑖 2

π‘š 𝑖 𝛼𝑖

1

βˆ’ 𝑦𝑖

⊀

π‘š 𝑗π‘₯𝑗 𝛼 𝑦 𝑗 𝑗

+

⊀ π‘š 𝑗 𝑗 𝑗 𝛼𝑗 𝑦 π‘₯

After simplification

π‘š

𝐿 𝑀, 𝛼, 𝛽 = 𝑖

1 𝛼𝑖 βˆ’ 2

π‘₯𝑖 + 𝑏

π‘š

𝛼𝑖 𝛼𝑗

𝑦𝑖 𝑦 𝑗

⊀ 𝑗 𝑖 π‘₯ π‘₯

𝑖,𝑗

17

The dual problem π‘š

max 𝛼

𝑖

1 𝛼𝑖 βˆ’ 2

π‘š

𝛼𝑖 𝛼𝑗

𝑦𝑖 𝑦 𝑗

⊀ 𝑗 𝑖 π‘₯ π‘₯

𝑖,𝑗

𝑠. 𝑑. 𝐢 βˆ’ 𝛼𝑖 βˆ’ 𝛽𝑖 = 0, 𝛼𝑖 β‰₯ 0, 𝛽𝑖 β‰₯ 0, 𝑖 = 1, … , π‘š π‘š

𝛼𝑖 𝑦 𝑖 = 0 𝑖

The constraint 𝐢 βˆ’ 𝛼𝑖 βˆ’ 𝛽𝑖 = 0, 𝛼𝑖 β‰₯ 0, 𝛽𝑖 β‰₯ 0 can be simplified to 𝐢 β‰₯ 𝛼𝑖 β‰₯ 0 This is a constrained quadratic programming Nice and convex, and global maximum can be found 18

Learning nonlinear decision boundary Linearly separable

Nonlinearly separable

The XOR gate

Speech recognition 19

A decision tree for Tax Fraud Input: a vector of attributes 𝑋 = [Refund,MarSt,TaxInc]

Output: π‘Œ= Cheating or Not

H as a procedure:

Each internal node: test one attribute 𝑋𝑖 Each branch from a node: selects one value for 𝑋𝑖 Each leaf node: predict π‘Œ

Refund

Yes

No

NO

MarSt Single, Divorced TaxInc

< 80K NO

Married NO

> 80K YES 20

Apply model to test data I Query Data Start from the root of tree.

R e fu n d

No

Refund Yes

M a r it a l

T a x a b le

S ta tu s

In c o m e

C heat

M a r r ie d

80K

?

10

No

NO

MarSt Single, Divorced TaxInc < 80K

NO

Married NO

> 80K YES 21

Apply model to test data II Query Data R e fu n d

No

Refund Yes

M a r it a l

T a x a b le

S ta tu s

In c o m e

C heat

M a r r ie d

80K

?

10

No

NO

MarSt Single, Divorced TaxInc

< 80K NO

Married

NO > 80K YES 22

Apply model to test data III Query Data R e fu n d

No

Refund Yes

M a r it a l

T a x a b le

S ta tu s

In c o m e

C heat

M a r r ie d

80K

?

10

No

NO

MarSt Single, Divorced TaxInc

< 80K NO

Married

NO > 80K YES 23

Apply model to test data IV Query Data R e fu n d

No

Refund Yes

M a r it a l

T a x a b le

S ta tu s

In c o m e

C heat

M a r r ie d

80K

?

10

No

NO

MarSt Single, Divorced TaxInc

< 80K NO

Married

NO > 80K YES 24

Apply model to test data V Query Data R e fu n d

No

Refund Yes

M a r it a l

T a x a b le

S ta tu s

In c o m e

C heat

M a r r ie d

80K

?

10

No

NO

MarSt Single, Divorced TaxInc

< 80K NO

Married

Assign Cheat to β€œNo”

NO > 80K YES 25

Expressiveness of decision tree Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row β†’ path to leaf:

Trivially, there is a consistent decision tree for any training set with one path to leaf for each example. Prefer to find more compact decision trees 26

Hypothesis spaces (model space How many distinct decision trees with n Boolean attributes? = number of Boolean functions n n 2 = number of distinct truth tables with 2 rows = 2 E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry  οƒ˜Rain)? Each attribute can be in (positive), in (negative), or out οƒž 3n distinct conjunctive hypotheses

More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent with training set οƒž may get worse predictions

27

Decision tree learning

Tid

Attrib1

Attrib2

Attrib3

Class

1

Yes

Large

125K

No

2

No

Medium

100K

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Tree Induction algorithm Induction Learn Model Model

10

Training Set

Tid

Attrib1

11

No

Small

Attrib2

55K

Attrib3

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Apply Model

Class

Decision Tree

Deduction

10

Test Set

28

Example of a decision tree

T id

R e fu n d

Splitting Attributes

M a r it a l

T a x a b le

S ta tu s

In c o m e

C heat

1

Yes

S in g le

125K

No

2

No

M a r r ie d

100K

No

3

No

S in g le

70K

No

4

Yes

M a r r ie d

120K

No

5

No

D iv o r c e d

95K

Yes

6

No

M a r r ie d

60K

No

7

Yes

D iv o r c e d

220K

No

8

No

S in g le

85K

Yes

9

No

M a r r ie d

75K

No

10

No

S in g le

90K

Yes

Refund Yes

No

NO

MarSt Single, Divorced TaxInc

< 80K NO

Married NO

> 80K YES

10

Training Data

Model: Decision Tree 29

Another example of a decision tree

MarSt T id

R e fu n d

Married

M a r it a l

T a x a b le

S ta tu s

In c o m e

C heat

NO 1

Yes

S in g le

125K

No

2

No

M a r r ie d

100K

No

3

No

S in g le

70K

No

4

Yes

M a r r ie d

120K

No

5

No

D iv o r c e d

95K

Yes

6

No

M a r r ie d

60K

No

7

Yes

D iv o r c e d

220K

No

8

No

S in g le

85K

Yes

9

No

M a r r ie d

75K

No

10

No

S in g le

90K

Yes

Single, Divorced Refund No

Yes NO

TaxInc < 80K NO

> 80K YES

There could be more than one tree that fits the same data!

10

Training Data

30

Top-Down Induction of Decision tree Main loop: 𝐴 ← the β€œbest” decision attribute for next node Assign A as the decision attribute for node For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples perfectly classified, then STOP; ELSE iterate over new leaf nodes

31

Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion.

Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split?

Determine when to stop splitting

32

Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. CarType Family

Luxury Sports

Binary split: Divides values into two subsets. Need to find optimal partitioning.

{Sports, Luxury}

CarType {Family}

OR

{Family, Luxury}

CarType {Sports}

Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Size Small

Large

Medium

Binary split: Divides values into two subsets. Need to find optimal partitioning.

{Small, Medium}

Size {Large}

OR

{Medium, Large}

Size {Small}

Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.

Binary Decision: (𝐴 < 𝑑) or (𝐴 ο‚³ 𝑑) consider all possible splits and finds the best cut can be more compute intensive Taxable Income > 80K?

Taxable Income? < 10K

Yes

> 80K

No [10K,25K)

(i) Binary split

[25K,50K)

[50K,80K)

(ii) Multi-way split

How to determine the Best Split Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative"

Homogeneous,

Non-homogeneous,

Low degree of impurity

High degree of impurity

Greedy approach: Nodes with homogeneous class distribution are preferred

Need a measure of node impurity

How to compare attribute? Entropy Entropy H(X) of a random variable X

H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Information theory: Most efficient code assigns -log2P(X=i) bits to encode the message X=I, So, expected number of bits to code one random X is:

Sample Entropy

S is a sample of training examples p+ is the proportion of positive examples in S p- is the proportion of negative examples in S Entropy measure the impurity of S

Examples for computing Entropy

C1

0

P(C1) = 0/6 = 0

C2

6

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1

1

P(C1) = 1/6

C2

5

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1

2

P(C1) = 2/6

C2

4

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

P(C2) = 6/6 = 1

P(C2) = 5/6

P(C2) = 4/6

How to compare attribute? Conditional Entropy of variable 𝑋 given variable π‘Œ Given specific Y=v entropy H(X|Y=v) of X:

Conditional entropy H(X|Y) of X: average of H(X|Y=v)

Mutual information (aka information gain) of X given Y :

Information Gain Information gain (after split a node): ni  ( p) ο€­  οƒ₯ Entropy  i ο€½1 n k

GAIN

split

ο€½ Entropy

οƒΆ (i ) οƒ· οƒΈ

𝑛 samples in parent node 𝑝 is split into π‘˜ partitions; 𝑛𝑖 is number of records in partition 𝑖

Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN)

Problem of splitting using information gain Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

Gain Ratio: GainRATIO

split

ο€½

GAIN

k

Split

SplitINFO

SplitINFO

ο€½ ο€­οƒ₯ i ο€½1

ni n

log

ni n

Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain 42

Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values

Early termination (to be discussed later)

Decision Tree Based Classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets

Example: C4.5 Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. You can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz