1
UU - IT - UDBL
DATA MINING – 1DL360 Fall 2010
An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht10 Erik Zeitler UDBL Dept of IT Uppsala university, Sweden
Erik Zeitler
2010-09-15
2
UU - IT - UDBL
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation (Tan, Steinbach, Kumar ch. 4)
Erik Zeitler Department of Information Technology Uppsala University, Uppsala, Sweden
Erik Zeitler
2010-09-15
3
UU - IT - UDBL
Definition of classification • Given a collection of records (training set) – Each record contains a set of attributes, one of the attributes is the class.
• Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Erik Zeitler
2010-09-15
4
UU - IT - UDBL
Illustrating Classification Task Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
Class
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learn Model
10
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply Model
Class
10
Erik Zeitler
2010-09-15
5
UU - IT - UDBL
Examples of classification tasks •
Predict tumor cells as benign or malignant
•
Classify credit card transactions as legitimate or fraudulent
•
Classify secondary structures of protein as alpha-helix, beta-sheet, or random coil
•
Categorize news stories as finance, weather, entertainment, sports, &c
Erik Zeitler
2010-09-15
6
UU - IT - UDBL
Classification techniques • • • • • •
Nearest Neighbour methods Decision Tree methods Rule-based methods Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines
Erik Zeitler
2010-09-15
7
UU - IT - UDBL
Instance-based classifiers (ch 5.2) Set of Stored Cases Atr1
……...
AtrN
• Store the training records
Class A
• Use training records to predict the class label of unseen cases
B B
Unseen Case
C A
Atr1
C B Erik Zeitler
2010-09-15
……...
AtrN
8
UU - IT - UDBL
Instance-based classifiers • Examples: – Rote-learner • Memorizes entire training data and performs classification only if the attributes of a record match one of the training examples exactly
– Nearest neighbor • Uses k “closest” points (nearest neighbors) for performing classification
Erik Zeitler
2010-09-15
9
UU - IT - UDBL
kNN classifier intuition • If you don’t know what you are • Look at the nearest ones around you • You are probably of the same kind
Erik Zeitler
2010-09-15
10
UU - IT - UDBL
Classifying you using kNN • Each one of you belongs to a group: – [F|STS|IT|Int Masters|Exchange|Other]
• Classify yourself using 1-NN and 3-NN – Look at your k nearest neighbors!
• How do we select our distance measure? • How do we decide which of 1-NN and 3-NN is best?
Erik Zeitler
2010-09-15
11
UU - IT - UDBL
A basic kNN classifier implementation •
Input: – – –
•
Output: –
•
A test point x The set of known points P Number of neighbors k Class belonging c
• Complexity is O(size(P)) for each tuple to be classified • Reduce complexity by using database indexes O(log(size(P))) • Rule of thumb: K ≤ sqrt(q). Commercial algorithms use a default of 10 Distance function
Implementation: 1. 2. 3.
Find the set of k points N ⊂ P that are nearest to x Count the number of occurrences of each class in N c = class to which the most points in N belong
Tie break? Erik Zeitler
2010-09-15
12
UU - IT - UDBL
More kNN classifier intuition • If it walks and sounds like a duck Æ Then it must be a duck
• If it walks and sounds like a cow Æ Then it must be a cow
Erik Zeitler
2010-09-15
13
UU - IT - UDBL
Walking and talking • Assume that a duck – has step length 5…15 cm – quacks at 600…700 Hz
• Assume that a cow – has step length is 30…60 cm – moos at 100…200 Hz
Erik Zeitler
2010-09-15
14
UU - IT - UDBL
Cows and ducks in a plot
stdev ≈ 30
Normalize: • subtract mean, divide by stdev • subtract min, divide by (max – min)
stdev ≈ 300 Erik Zeitler
2010-09-15
15
UU - IT - UDBL
Enter the chicken
Erik Zeitler
2010-09-15
16
UU - IT - UDBL
Nearest-Neighbor Classifiers z
Requires three things – The set of stored records – Distance metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve
z
To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)
Erik Zeitler
2010-09-15
17
UU - IT - UDBL
Definition of Nearest Neighbor
X
(a) 1-nearest neighbor
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
K-nearest neighbors of a record x are data points that have the k smallest distance to x Erik Zeitler
2010-09-15
18
UU - IT - UDBL
Nearest Neighbor Classification • Compute distance between two points: – Minkowski distance (Euclidean: r = 2)
⎛ r⎞ M ( p, q, r ) = ⎜ ∑ ( pi − qi ) ⎟ ⎝ i ⎠
1 r
• Determine the class from nearest neighbor list – take the majority vote of class labels among the k-nearest neighbors – Weigh the vote according to distance • weight factor, w = 1/d2
Erik Zeitler
2010-09-15
19
UU - IT - UDBL
Nearest Neighbor Classification… • Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes
Erik Zeitler
2010-09-15
20
UU - IT - UDBL
Nearest Neighbor Classification… • Scaling issues – Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes – Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 90lb to 300lb • income of a person may vary from $10K to $1M
Erik Zeitler
2010-09-15
21
UU - IT - UDBL
Nearest Neighbor Classification… • Problem with Euclidean measure: – High dimensional data • curse of dimensionality
– Can produce counter-intuitive results
111111111110
100000000000 vs
011111111111
000000000001
d = 1.4142
d = 1.4142
• Solution: Normalize the vectors to unit length Erik Zeitler
2010-09-15
22
UU - IT - UDBL
Nearest neighbor classification… • k-NN classifiers are lazy learners – It does not build models explicitly – Unlike eager learners such as decision tree induction and rule-based systems – Classifying unknown records are relatively expensive • Typically O(q), q being database size • O(log(q)) if index is utilized, but – indexing takes O(qlog(q)) time to build (once and for all) and O(q) additional space
Erik Zeitler
2010-09-15
23
UU - IT - UDBL
Decision tree induction (ch 4.3) • Many algorithms: – – – –
Erik Zeitler
Hunt’s algorithm (one of the earliest) CART ID3, C4.5 SLIQ, SPRINT
2010-09-15
24
UU - IT - UDBL
Decision tree classification task Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learn Model
10
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply Model
Class
10
Erik Zeitler
2010-09-15
25
UU - IT - UDBL
Example of a decision tree l
s al a u c c i i r r uo o o n i t ss eg eg t t n a cl ca ca co Tid
Refund
M arital Status
Taxable Incom e
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Splitting Attributes
Refund Yes
No
NO
MarSt Single, Divorced TaxInc < 80K NO
NO > 80K YES
10
Model: Decision Tree
Training Data Erik Zeitler
2010-09-15
Married
26
UU - IT - UDBL
Another example of decision tree al al us c c i i o or or nu i g g t ss e e t t n a l c ca ca co Tid
Refund
M arital Status
Taxable Incom e
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Married NO
Single, Divorced Refund No
Yes NO
TaxInc < 80K NO
> 80K YES
There could be more than one tree that fits the same data!
10
Erik Zeitler
MarSt
2010-09-15
27
UU - IT - UDBL
Decision tree classification task Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
Class
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learn Model
10
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply Model
Class
10
Erik Zeitler
2010-09-15
Decision Tree
28
UU - IT - UDBL
General structure of Hunt’s algorithm • •
Let Dt be the set of training records that reach a node t General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.
•
Tid Refund
Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Recursively apply the procedure to each subset.
Dt
t
Erik Zeitler
2010-09-15
60K
29
UU - IT - UDBL
Hunt’s algorithm
Refund *
Yes
No
Don’t Cheat
Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
Married
8
No
Single
85K
Yes
Don’t Cheat
9
No
Married
75K
No
10
No
Single
90K
Yes
*
Refund Refund
Yes
Yes
No
Don’t Cheat
Don’t Cheat
Marital Status
Single, Divorced
*
Erik Zeitler
Married
No
Marital Status
Single, Divorced
Taxable Income
Don’t Cheat
10
< 80K
>= 80K
Don’t Cheat
Cheat 2010-09-15
60K
30
UU - IT - UDBL
Tree induction • Greedy strategy – Split the records based on an attribute test that (locally) optimizes a certain criterion
• Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?
– Determine when to stop splitting
Erik Zeitler
2010-09-15
31
UU - IT - UDBL
Tree induction • Greedy strategy – Split the records based on an attribute test that (locally) optimizes a certain criterion
• Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?
– Determine when to stop splitting
Erik Zeitler
2010-09-15
32
UU - IT - UDBL
How to specify test condition? • Depends on attribute types – Nominal (categories) – Ordinal (categories) – Continuous (interval split)
• Depends on number of ways to split – 2-way split – Multi-way split
Erik Zeitler
2010-09-15
33
UU - IT - UDBL
Splitting based on nominal attributes • Multi-way split: Use as many partitions as distinct values. CarType Family
Luxury Sports
• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury}
Erik Zeitler
CarType {Family}
OR
2010-09-15
{Family, Luxury}
CarType {Sports}
34
UU - IT - UDBL
Splitting based on ordinal attributes • Multi-way split: Use as many partitions as distinct values. Size Small
Large Medium
• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium}
Size {Large}
OR
• What about this split? {Small, Large} Erik Zeitler
2010-09-15
{Medium, Large}
Size {Small}
Size {Medium}
35
UU - IT - UDBL
Splitting based on continuous attributes • Different ways of handling – Discretization to form an ordinal categorical attribute • Static – discretize once at the beginning • Dynamic – ranges can be found by – equal interval bucketing – equal frequency bucketing (percentiles) – clustering
– Binary Decision: (A < v) or (A ≥ v) • consider all possible splits and finds the best cut • can be more computational intensive
Erik Zeitler
2010-09-15
36
UU - IT - UDBL
Splitting based on continuous attributes
Erik Zeitler
2010-09-15
37
UU - IT - UDBL
Questions • Why divide by d2 and not by d in vote weighting? – In fact, there are different proposals out there; • d1/2, d, d2, ... • A higher degree will further suppress votes from distant points
Æ Why not compare some different distance weightings in the assignment!
• Tree based classifiers – Find the best split attribute – Find the best attribute split boundary
Erik Zeitler
2010-09-15
Meaningful (”interesting”) splits
38
UU - IT - UDBL
Tree induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.
• Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?
– Determine when to stop splitting
Erik Zeitler
2010-09-15
39
UU - IT - UDBL
How to find the best split Before Splitting:
C0 C1
N00 N01
M0
A?
B?
Yes
No
Node N1 C0 C1
Yes
Node N2
N10 N11
C0 C1
Node N3 C0 C1
N20 N21
M2
M1
No
N30 N31
C0 C1
M3
M12 2010-09-15
N40 N41
M4 M34
Gain = M0 – M12 vs M0 – M34 Erik Zeitler
Node N4
40
UU - IT - UDBL
How to determine the best split Before Splitting: 10 records of class 0, 10 records of class 1
Which test condition is the best?
Erik Zeitler
2010-09-15
41
UU - IT - UDBL
How to determine the best split • Greedy approach: – Child nodes with homogeneous class distribution (maximum discrimination) are preferred
• Need a measure of node impurity of a data set:
Erik Zeitler
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
2010-09-15
42
UU - IT - UDBL
Measures of node impurity
GINI (t ) = 1 − ∑ ( p ( j | t ) )
2
• Gini index
j
• Entropy
Entropy (t ) = −∑ p ( j | t ) log( p ( j | t ) ) j
• Misclassification error
Erik Zeitler
Error (t ) = 1 − max P(i | t ) i
2010-09-15
43
UU - IT - UDBL
Measure of impurity: GINI • Gini index for a given node t:
⎡ 1⎤ GINI (t ) = 1 − ∑ ( p ( j | t ) ) ∈ ⎢0, ⎥ j ⎣ nc ⎦ 2
– p( j | t) is the relative frequency of class j at node t) – Max. when records are equally distributed among all classes → least interesting information
– Min. when all records belong to one class → most interesting information C1 C2
0 6
Gini=0.000 Erik Zeitler
C1 C2
1 5
Gini=0.278
C1 C2
2 4
Gini=0.444
2010-09-15
C1 C2
3 3
Gini=0.500
44
UU - IT - UDBL
Examples for computing GINI GINI (t ) = 1 − ∑ [ p ( j | t )]2 j
C1 C2
0 6
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
Erik Zeitler
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444 2010-09-15
45
UU - IT - UDBL
Splitting based on GINI •
When a node p is split into k partitions (children), the quality of split is computed as, k
GINI split where,
Erik Zeitler
ni = ∑ GINI (i ) i =1 n
ni = number of records at child i, n = number of records at node p.
2010-09-15
46
UU - IT - UDBL
Binary attributes: computing GINI index • •
Splits into two partitions Effect of weighing: Large and pure partitions B?
Erik Zeitler
6
C2
6
No
Node N1
Gini(N2) = 1 – (1/5)2 – (4/5)2 = 0.320
C1
G ini = 0.500
Yes
Gini(N1) = 1 – (5/7)2 – (2/7)2 = 0.408
Parent
Node N2
C1 C2
N1 5 2
N2 1 4
Gini=0.371
2010-09-15
Gini(Children) = 7/12 * 0.408 + 5/12 * 0.320 = 0.371
47
UU - IT - UDBL
Categorical attributes: computing Gini index • •
For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions
Multi-way split
Two-way split (find best partition of values)
CarType Family Sports Luxury 1 4
C1 C2 Gini
Erik Zeitler
2 1 0.393
1 1
C1 C2 Gini
CarType {Sports, {Family} Luxury} 3 1 2 4 0.400
2010-09-15
C1 C2 Gini
CarType {Family, {Sports} Luxury} 2 2 1 5 0.419
48
UU - IT - UDBL
Continuous attributes: computing Gini index Tid
• Use Binary Decisions based on one value • Several Choices for the splitting value –
the distinct values in the data are meningful split points
• Each splitting value has a count matrix associated with it –
Class counts in each of the partitions, A < v and A≥v
• Simple method to choose best v – –
Erik Zeitler
For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work.
2010-09-15
10
Refund
Marital Status
Taxable Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
49
UU - IT - UDBL
Continuous attributes: computing Gini index... •
For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income 60
Sorted Values Split Positions
55
75
65
85
72
90
80
95
87
92
97
110
122
172
230
Yes
0
3
0
3
0
3
0
3
1
2
2
1
3
0
3
0
3
0
3
0
3
0
No
0
7
1
6
2
5
3
4
3
4
3
4
3
4
4
3
5
2
6
1
7
0
Gini
Erik Zeitler
70
0.420
0.400
0.375
0.343
0.417
2010-09-15
0.400
0.300
0.343
0.375
0.400
0.420
50
UU - IT - UDBL
Alternative splitting criteria based on INFO • Entropy at a given node t:
Entropy (t ) = − ∑ p ( j | t ) log ( p ( j | t ) ) j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed among all classes → least interesting information
• Minimum (0.0) when all records belong to one class, → most interesting information
– Entropy based computations are similar to the GINI index computations
Erik Zeitler
2010-09-15
51
UU - IT - UDBL
Examples for computing Entropy
Entropy (t ) = − ∑ p ( j | t ) log p ( j | t ) 2
j
C1 C2
0 6
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
Erik Zeitler
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0
P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65
P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
2010-09-15
52
UU - IT - UDBL
Splitting based on INFO... •
Information Gain:
GAIN
n ⎛ = Entropy ( p ) − ⎜ ∑ Entropy (i ) ⎞⎟ ⎠ ⎝ n k
split
i
i =1
Parent Node, p is split into k partitions; ni is number of records in partition i
– Measures reduction in entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) – Used in ID3 and C4.5 – Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.
Erik Zeitler
2010-09-15
53
UU - IT - UDBL
Splitting based on INFO... •
Gain Ratio:
GainRATIO
GAIN = SplitINFO Split
split
SplitINFO
n n = − ∑ log n n k
i
i
i =1
Parent node, p is split into k partitions ni is the number of records in partition i
– Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 – Designed to overcome the disadvantage of Information Gain
Erik Zeitler
2010-09-15
54
UU - IT - UDBL
Splitting criteria based on classification Error • Classification error at a node t :
Error (t ) = 1 − max P (i | t ) i
•
Measures misclassification error made by a node. • Maximum (1 - 1/nc) when records are equally distributed among all classes, Æ least interesting information • Minimum (0.0) when all records belong to one class, Æ most interesting information
Erik Zeitler
2010-09-15
55
UU - IT - UDBL
Examples for computing Error
Error (t ) = 1 − max P (i | t ) i
C1 C2
0 6
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
Erik Zeitler
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 2010-09-15
56
UU - IT - UDBL
Comparison among splitting criteria for a 2-class problem:
Erik Zeitler
2010-09-15
57
UU - IT - UDBL
Tree induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.
• Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?
– Determine when to stop splitting
Erik Zeitler
2010-09-15
58
UU - IT - UDBL
Stopping criteria for tree induction • Stop expanding a node when all the records belong to the same class • Stop expanding a node when all the records have similar attribute values • Early termination (e.g. to small resulting class set)
Erik Zeitler
2010-09-15
59
UU - IT - UDBL
Decision-tree-based classification • Advantages: – – – –
Erik Zeitler
Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets
2010-09-15
60
UU - IT - UDBL
Model evaluation (ch 4.5) • Metrics for Performance Evaluation – How to evaluate the performance of a model?
• Methods for Performance Evaluation – How to obtain reliable estimates?
Erik Zeitler
2010-09-15
61
UU - IT - UDBL
Metrics for performance evaluation • Focus on the predictive capability of a model – Rather than how fast it takes to classify or build models, scalability, etc.
• Confusion Matrix: PREDICTED CLASS Class=Yes
Class=No a: TP (true positive)
ACTUAL CLASS
Class=Yes Class=No
Erik Zeitler
a c
2010-09-15
b d
b: FN (false negative) c: FP (false positive) d: TN (true negative)
62
UU - IT - UDBL
Metrics for performance evaluation… PREDICTED CLASS
ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
a (TP)
b (FN)
Class=No
c (FP)
d (TN)
• Most widely-used metric:
TP + TN a+d Accuracy = = a + b + c + d TP + TN + FP + FN Erik Zeitler
2010-09-15
63
UU - IT - UDBL
Limitation of Accuracy • Consider a 2-class problem – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10
• If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – Accuracy is misleading because model does not detect any class 1 example
Erik Zeitler
2010-09-15
64
UU - IT - UDBL
Cost Matrix PREDICTED CLASS C(i|j) ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
C(Yes|Yes)
C(No|Yes)
Class=No
C(Yes|No)
C(No|No)
C(i|j): Cost of misclassifying class j example as class i
Erik Zeitler
2010-09-15
65
UU - IT - UDBL
Errata Slide 46 had wrong GINI computations when shown in-class on Sep 13. Was:
Should be:
Gini(N1)
Gini(N1)
= 1 – (5/6)2 – (2/6)2
= 1 – (5/7)2 – (2/7)2
= 0.194
= 0.408
j
Gini(N2)
Gini(N2)
= 1 – (1/6)2 – (4/6)2
= 1 – (1/5)2 – (4/5)2
= 0.528
= 0.320
Erik Zeitler
GINI (t ) = 1 − ∑ ( p ( j | t ) )
2
2010-09-15
66
UU - IT - UDBL
Binary attributes: computing GINI index • •
Splits into two partitions Effect of weighing: Large and pure partitions B?
Erik Zeitler
6
C2
6
No
Node N1
Gini(N2) = 1 – (1/5)2 – (4/5)2 = 0.320
C1
G ini = 0.500
Yes
Gini(N1) = 1 – (5/7)2 – (2/7)2 = 0.408
Parent
Node N2
C1 C2
N1 5 2
N2 1 4
Gini=0.371 2010-09-15
Gini(Children) = 7/12 * 0.408 + 5/12 * 0.320 = 0.371
67
UU - IT - UDBL
Computing Cost of Classification Cost Matrix
ACTUAL CLASS
Model M1
ACTUAL CLASS
PREDICTED CLASS C(i|j)
+
-
+
-1
100
-
1
0
PREDICTED CLASS
+
-
+
150
40
-
60
250
Model M2
ACTUAL CLASS
Accuracy = 80% Cost = 3910 Erik Zeitler
PREDICTED CLASS
+
-
+
250
45
-
5
200
Accuracy = 90% Cost = 4255 2010-09-15
68
UU - IT - UDBL
Cost vs Accuracy Accuracy is proportional to cost if 1. C(Yes|No) = C(No|Yes) = q 2. C(Yes|Yes)= C(No|No) = p
PREDICTED CLASS
Count
Class=Yes Class=Yes
ACTUAL CLASS
Class=No
Class=No
a
b
c
d
Proof: N=a+b+c+d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c)
PREDICTED CLASS
Cost
Class=Yes
= p (a + d) + q (N – a – d)
Class=No
= q N – (q – p)(a + d) ACTUAL CLASS
Erik Zeitler
Class=Yes
p
q
Class=No
q
p
= N [q – (q-p) × Accuracy]
2010-09-15
69
UU - IT - UDBL
Alternative performance metrics a proportion of correctly predicted ‘Yes’ to all Precision (p) = a + c predicted ‘Yes’ a proportion of correctly predicted ‘Yes’ to all Recall (r) = actual ‘Yes’ a+b 2rp 2a harmonic mean of precision and = F - measure (F) = r + p 2a + b + c recall generalize
wa + w d Weighted Accuracy = wa + wb+ wc+ w d 1
Erik Zeitler
2010-09-15
1
4
2
3
4
70
UU - IT - UDBL
Alternative performance metrics •
Sensitivity (true pos. rate), TPR = TP / TP + FN
•
Specitivity (true neg. rate), TNR = TN / TN + FP
•
False pos. rate, FPR = FP / TN + FP
•
False neg. rate, FNR = FN / TP + FN
•
Precision (p) = TP / TP + FP
•
Recall, r = TP / TP + FN (= TPR)
•
Erik Zeitler
F1 =
2rp 2TP = r + p 2TP + FP + FN
•
Precision is biased towards C(Yes|Yes) & C(Yes|No)
•
Recall is biased towards C(Yes|Yes) & C(No|Yes)
•
F-measure is biased towards all except C(No|No) 2010-09-15
71
UU - IT - UDBL
Methods for Performance Evaluation • How to obtain a reliable estimate of performance? • Performance of a model may depend on other factors besides the learning algorithm: – Class distribution – Cost of misclassification – Size of training and test sets
Erik Zeitler
2010-09-15
72
UU - IT - UDBL
Learning Curve •
Learning curve shows how accuracy changes with varying sample size
•
Requires a sampling scheme for creating learning curve:
•
Erik Zeitler
2010-09-15
•
Arithmetic sampling (Langley et al, 1996)
•
Geometric sampling (Provost et al, 1999)
Effect of small sample size: •
Bias in the estimate
•
Variance of estimate
73
UU - IT - UDBL
Methods of Estimation • Holdout –
Reserve 2/3 for training and 1/3 for testing
• Random subsampling –
Repeated holdout
• Cross validation – Partition data into k disjoint subsets – k-fold: train on k – 1 partitions, test on the remaining one – Leave-one-out: k = n
• Stratified sampling – –
partition the population (randomly, or based on attribute values) oversampling: Increase the sampling fraction of rare sub-groups
• Bootstrap –
Erik Zeitler
Sampling with replacement
2010-09-15