SVM and Decision Tree Le Song
Machine Learning I CSE 6740, Fall 2013
Which decision boundary is better? Suppose the training samples are linearly separable We can find a decision boundary which gives zero training error
Class 2
But there are many such decision boundaries Which one is better?
Class 1
2
Compare two decision boundaries Suppose we perturb the data, which boundary is more susceptible to error?
3
Constraints on data points Constraints on data points For all π₯ in class 2, π¦ = 1 and π€ β€ π₯ + π β₯ π For all π₯ in class 1, π¦ = β1 and π€ β€ π₯ + π β€ βπ
Or more compactly, (π€ β€ π₯ + π)π¦ β₯ π w xο«b ο½ 0 T
π€
Class 2
Class 1
c
c 4
Classifier margin Pick two data points π₯ 1 and π₯ 2 which are on each dashed line respectively 1 π€
The margin is πΎ =
π€
β€
1
π₯ βπ₯
2
w xο«b ο½ 0
=
2π |π€|
T
π€
π₯1
Class 2 π₯2
Class 1
c
c 5
Maximum margin classifier Find decision boundary π€ as far from data point as possible 2π max π€,π | π€ | π . π‘. π¦ π π€ β€ π₯ π + π β₯ π, βπ w xο«b ο½ 0 T
π€
π₯1
Class 2 π₯2
Class 1
c
c 6
Support vector machines with hard margin min
π . π‘.
π€,π π¦π π€ β€ π₯π
π€
2
+ π β₯ 1, βπ
Convert to standard form
1 β€ min π€ π€ π€,π 2 π . π‘. 1 β π¦ π π€ β€ π₯ π + π β€ 0, βπ
The Lagrangian function
1 β€ πΏ π€, πΌ, π½ = π€ π€ + 2
π
πΌπ 1 β π¦ π π€ β€ π₯ π + π π
7
Deriving the dual problem π
1 β€ πΏ π€, πΌ, π½ = π€ π€ + 2
πΌπ 1 β π¦ π π€ β€ π₯ π + π π
Taking derivative and set to zero π
ππΏ =π€β ππ€ ππΏ = ππ
πΌπ π¦ π π₯ π = 0
π
π
πΌπ π¦ π = 0 π
8
Plug back relation of w and b πΏ π€, πΌ, π½ = 1 π ππ₯π πΌ π¦ π π 2
π π πΌπ
1
β π¦π
β€
π ππ₯π πΌ π¦ π π
+
β€ π π π π πΌπ π¦ π₯
After simplification
π
πΏ π€, πΌ, π½ = π
1 πΌπ β 2
π₯π + π
π
πΌπ πΌπ
π¦π π¦ π
β€ π π π₯ π₯
π,π
9
The dual problem π
max πΌ
π
1 πΌπ β 2
π
πΌπ πΌπ
π¦π π¦ π
β€ π π π₯ π₯
π,π
π . π‘. πΌπ β₯ 0, π = 1, β¦ , π π
πΌπ π¦ π = 0 π
This is a constrained quadratic programming Nice and convex, and global maximum can be found π€ can be found as π€ = How about π?
π π π π πΌπ π¦ π₯
10
Support vectors Note that the KKT condition πΌπ 1 β π¦ π π€ β€ π₯ π + π
=0
For data points with 1 β π¦ π π€ β€ π₯ π + π
< 0 , πΌπ = 0
For data points with 1 β π¦ π π€ β€ π₯ π + π
= 0 , πΌπ > 0
Class 2
a8=0.6
a10=0 a7=0
a5=0 a4=0 a9=0 Class 1
a2=0
Call the training data points whose ai's are nonzero the support vectors (SV)
a1=0.8 a6=1.4
a3=0 11
Computing b and obtain the classifer Pick any data point with πΌπ > 0, solve for π with 1 β π¦π π€ β€ π₯π + π = 0 For a new test point z Compute
π€β€π§ + π =
πΌπ π¦ π π₯ π π§ + π πβπ π’πππππ‘ π£πππ‘πππ
Classify π§ as class 1 if the result is positive, and class 2 otherwise
12
Interpretation of support vector machines The optimal w is a linear combination of a small number of data points. This βsparseβ representation can be viewed as data compression To compute the weights πΌπ , and to use support vector machines we need to specify only the inner products (or β€ π π kernel) between the examples π₯ π₯ We make decisions by comparing each new example z with only the support vectors: π¦ β = π πππ
πΌπ π¦ π π₯ π π§ + π πβπ π’πππππ‘ π£πππ‘πππ 13
Soft margin constraints What if the data is not linearly separable? We will allow points to violate the hard margin constraint (π€ β€ π₯ + π)π¦ β₯ 1 β π w xο«b ο½ 0 T
π€
π1 π2
Class 1
Class 2 π3
1
1 14
Soft margin SVM min
π€
π€,π,π
2
π
ππ
+πΆ π=1 π
π . π‘. π¦ π π€ β€ π₯ π + π β₯ 1 β π , π π β₯ 0, βπ Convert to standard form
1 β€ min π€ π€ π€,π 2 π . π‘. 1 β π¦ π π€ β€ π₯ π + π β π π β€ 0, π π β₯ 0, βπ
The Lagrangian function 1 β€ = π€ π€+ 2
π
πΏ π€, πΌ, π½ πΆπ π + πΌπ 1 β π¦ π π€ β€ π₯ π + π β π π β π½π π π
π
15
Deriving the dual problem 1 β€ = π€ π€+ 2
π
πΏ π€, πΌ, π½ πΆπ π + πΌπ 1 β π¦ π π€ β€ π₯ π + π β π π β π½π π π
π
Taking derivative and set to zero ππΏ =π€β ππ€
π
π
πΌπ π¦ π π₯ π = 0 π
ππΏ = πΌπ π¦ π = 0 ππ π ππΏ = πΆ β πΌπ β π½π = 0 π ππ 16
Plug back relation of π€, π and π πΏ π€, πΌ, π½ = 1 π ππ₯π πΌ π¦ π π 2
π π πΌπ
1
β π¦π
β€
π ππ₯π πΌ π¦ π π
+
β€ π π π π πΌπ π¦ π₯
After simplification
π
πΏ π€, πΌ, π½ = π
1 πΌπ β 2
π₯π + π
π
πΌπ πΌπ
π¦π π¦ π
β€ π π π₯ π₯
π,π
17
The dual problem π
max πΌ
π
1 πΌπ β 2
π
πΌπ πΌπ
π¦π π¦ π
β€ π π π₯ π₯
π,π
π . π‘. πΆ β πΌπ β π½π = 0, πΌπ β₯ 0, π½π β₯ 0, π = 1, β¦ , π π
πΌπ π¦ π = 0 π
The constraint πΆ β πΌπ β π½π = 0, πΌπ β₯ 0, π½π β₯ 0 can be simplified to πΆ β₯ πΌπ β₯ 0 This is a constrained quadratic programming Nice and convex, and global maximum can be found 18
Learning nonlinear decision boundary Linearly separable
Nonlinearly separable
The XOR gate
Speech recognition 19
A decision tree for Tax Fraud Input: a vector of attributes π = [Refund,MarSt,TaxInc]
Output: π= Cheating or Not
H as a procedure:
Each internal node: test one attribute ππ Each branch from a node: selects one value for ππ Each leaf node: predict π
Refund
Yes
No
NO
MarSt Single, Divorced TaxInc
< 80K NO
Married NO
> 80K YES 20
Apply model to test data I Query Data Start from the root of tree.
R e fu n d
No
Refund Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt Single, Divorced TaxInc < 80K
NO
Married NO
> 80K YES 21
Apply model to test data II Query Data R e fu n d
No
Refund Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt Single, Divorced TaxInc
< 80K NO
Married
NO > 80K YES 22
Apply model to test data III Query Data R e fu n d
No
Refund Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt Single, Divorced TaxInc
< 80K NO
Married
NO > 80K YES 23
Apply model to test data IV Query Data R e fu n d
No
Refund Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt Single, Divorced TaxInc
< 80K NO
Married
NO > 80K YES 24
Apply model to test data V Query Data R e fu n d
No
Refund Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt Single, Divorced TaxInc
< 80K NO
Married
Assign Cheat to βNoβ
NO > 80K YES 25
Expressiveness of decision tree Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row β path to leaf:
Trivially, there is a consistent decision tree for any training set with one path to leaf for each example. Prefer to find more compact decision trees 26
Hypothesis spaces (model space How many distinct decision trees with n Boolean attributes? = number of Boolean functions n n 2 = number of distinct truth tables with 2 rows = 2 E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry ο οRain)? Each attribute can be in (positive), in (negative), or out ο 3n distinct conjunctive hypotheses
More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent with training set ο may get worse predictions
27
Decision tree learning
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree Induction algorithm Induction Learn Model Model
10
Training Set
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply Model
Class
Decision Tree
Deduction
10
Test Set
28
Example of a decision tree
T id
R e fu n d
Splitting Attributes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
1
Yes
S in g le
125K
No
2
No
M a r r ie d
100K
No
3
No
S in g le
70K
No
4
Yes
M a r r ie d
120K
No
5
No
D iv o r c e d
95K
Yes
6
No
M a r r ie d
60K
No
7
Yes
D iv o r c e d
220K
No
8
No
S in g le
85K
Yes
9
No
M a r r ie d
75K
No
10
No
S in g le
90K
Yes
Refund Yes
No
NO
MarSt Single, Divorced TaxInc
< 80K NO
Married NO
> 80K YES
10
Training Data
Model: Decision Tree 29
Another example of a decision tree
MarSt T id
R e fu n d
Married
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
NO 1
Yes
S in g le
125K
No
2
No
M a r r ie d
100K
No
3
No
S in g le
70K
No
4
Yes
M a r r ie d
120K
No
5
No
D iv o r c e d
95K
Yes
6
No
M a r r ie d
60K
No
7
Yes
D iv o r c e d
220K
No
8
No
S in g le
85K
Yes
9
No
M a r r ie d
75K
No
10
No
S in g le
90K
Yes
Single, Divorced Refund No
Yes NO
TaxInc < 80K NO
> 80K YES
There could be more than one tree that fits the same data!
10
Training Data
30
Top-Down Induction of Decision tree Main loop: π΄ β the βbestβ decision attribute for next node Assign A as the decision attribute for node For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples perfectly classified, then STOP; ELSE iterate over new leaf nodes
31
Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion.
Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split?
Determine when to stop splitting
32
Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. CarType Family
Luxury Sports
Binary split: Divides values into two subsets. Need to find optimal partitioning.
{Sports, Luxury}
CarType {Family}
OR
{Family, Luxury}
CarType {Sports}
Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Size Small
Large
Medium
Binary split: Divides values into two subsets. Need to find optimal partitioning.
{Small, Medium}
Size {Large}
OR
{Medium, Large}
Size {Small}
Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static β discretize once at the beginning Dynamic β ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.
Binary Decision: (π΄ < π‘) or (π΄ ο³ π‘) consider all possible splits and finds the best cut can be more compute intensive Taxable Income > 80K?
Taxable Income? < 10K
Yes
> 80K
No [10K,25K)
(i) Binary split
[25K,50K)
[50K,80K)
(ii) Multi-way split
How to determine the Best Split Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative"
Homogeneous,
Non-homogeneous,
Low degree of impurity
High degree of impurity
Greedy approach: Nodes with homogeneous class distribution are preferred
Need a measure of node impurity
How to compare attribute? Entropy Entropy H(X) of a random variable X
H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Information theory: Most efficient code assigns -log2P(X=i) bits to encode the message X=I, So, expected number of bits to code one random X is:
Sample Entropy
S is a sample of training examples p+ is the proportion of positive examples in S p- is the proportion of negative examples in S Entropy measure the impurity of S
Examples for computing Entropy
C1
0
P(C1) = 0/6 = 0
C2
6
Entropy = β 0 log 0 β 1 log 1 = β 0 β 0 = 0
C1
1
P(C1) = 1/6
C2
5
Entropy = β (1/6) log2 (1/6) β (5/6) log2 (1/6) = 0.65
C1
2
P(C1) = 2/6
C2
4
Entropy = β (2/6) log2 (2/6) β (4/6) log2 (4/6) = 0.92
P(C2) = 6/6 = 1
P(C2) = 5/6
P(C2) = 4/6
How to compare attribute? Conditional Entropy of variable π given variable π Given specific Y=v entropy H(X|Y=v) of X:
Conditional entropy H(X|Y) of X: average of H(X|Y=v)
Mutual information (aka information gain) of X given Y :
Information Gain Information gain (after split a node): ni ο¦ ( p) ο ο§ ο₯ Entropy ο¨ i ο½1 n k
GAIN
split
ο½ Entropy
οΆ (i ) ο· οΈ
π samples in parent node π is split into π partitions; ππ is number of records in partition π
Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN)
Problem of splitting using information gain Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.
Gain Ratio: GainRATIO
split
ο½
GAIN
k
Split
SplitINFO
SplitINFO
ο½ οο₯ i ο½1
ni n
log
ni n
Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain 42
Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values
Early termination (to be discussed later)
Decision Tree Based Classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets
Example: C4.5 Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. You can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz