DATA MINING 1DL360. Fall An introductory class in data mining

1 UU - IT - UDBL DATA MINING – 1DL360 Fall 2010 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht10 Erik Zeit...
3 downloads 0 Views 543KB Size
1

UU - IT - UDBL

DATA MINING – 1DL360 Fall 2010

An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht10 Erik Zeitler UDBL Dept of IT Uppsala university, Sweden

Erik Zeitler

2010-09-15

2

UU - IT - UDBL

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation (Tan, Steinbach, Kumar ch. 4)

Erik Zeitler Department of Information Technology Uppsala University, Uppsala, Sweden

Erik Zeitler

2010-09-15

3

UU - IT - UDBL

Definition of classification • Given a collection of records (training set) – Each record contains a set of attributes, one of the attributes is the class.

• Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Erik Zeitler

2010-09-15

4

UU - IT - UDBL

Illustrating Classification Task Tid

Attrib1

1

Yes

Large

Attrib2

125K

Attrib3

No

Class

2

No

Medium

100K

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Learn Model

10

Tid

Attrib1

11

No

Small

Attrib2

55K

Attrib3

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Apply Model

Class

10

Erik Zeitler

2010-09-15

5

UU - IT - UDBL

Examples of classification tasks •

Predict tumor cells as benign or malignant



Classify credit card transactions as legitimate or fraudulent



Classify secondary structures of protein as alpha-helix, beta-sheet, or random coil



Categorize news stories as finance, weather, entertainment, sports, &c

Erik Zeitler

2010-09-15

6

UU - IT - UDBL

Classification techniques • • • • • •

Nearest Neighbour methods Decision Tree methods Rule-based methods Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines

Erik Zeitler

2010-09-15

7

UU - IT - UDBL

Instance-based classifiers (ch 5.2) Set of Stored Cases Atr1

……...

AtrN

• Store the training records

Class A

• Use training records to predict the class label of unseen cases

B B

Unseen Case

C A

Atr1

C B Erik Zeitler

2010-09-15

……...

AtrN

8

UU - IT - UDBL

Instance-based classifiers • Examples: – Rote-learner • Memorizes entire training data and performs classification only if the attributes of a record match one of the training examples exactly

– Nearest neighbor • Uses k “closest” points (nearest neighbors) for performing classification

Erik Zeitler

2010-09-15

9

UU - IT - UDBL

kNN classifier intuition • If you don’t know what you are • Look at the nearest ones around you • You are probably of the same kind

Erik Zeitler

2010-09-15

10

UU - IT - UDBL

Classifying you using kNN • Each one of you belongs to a group: – [F|STS|IT|Int Masters|Exchange|Other]

• Classify yourself using 1-NN and 3-NN – Look at your k nearest neighbors!

• How do we select our distance measure? • How do we decide which of 1-NN and 3-NN is best?

Erik Zeitler

2010-09-15

11

UU - IT - UDBL

A basic kNN classifier implementation •

Input: – – –



Output: –



A test point x The set of known points P Number of neighbors k Class belonging c

• Complexity is O(size(P)) for each tuple to be classified • Reduce complexity by using database indexes O(log(size(P))) • Rule of thumb: K ≤ sqrt(q). Commercial algorithms use a default of 10 Distance function

Implementation: 1. 2. 3.

Find the set of k points N ⊂ P that are nearest to x Count the number of occurrences of each class in N c = class to which the most points in N belong

Tie break? Erik Zeitler

2010-09-15

12

UU - IT - UDBL

More kNN classifier intuition • If it walks and sounds like a duck Æ Then it must be a duck

• If it walks and sounds like a cow Æ Then it must be a cow

Erik Zeitler

2010-09-15

13

UU - IT - UDBL

Walking and talking • Assume that a duck – has step length 5…15 cm – quacks at 600…700 Hz

• Assume that a cow – has step length is 30…60 cm – moos at 100…200 Hz

Erik Zeitler

2010-09-15

14

UU - IT - UDBL

Cows and ducks in a plot

stdev ≈ 30

Normalize: • subtract mean, divide by stdev • subtract min, divide by (max – min)

stdev ≈ 300 Erik Zeitler

2010-09-15

15

UU - IT - UDBL

Enter the chicken

Erik Zeitler

2010-09-15

16

UU - IT - UDBL

Nearest-Neighbor Classifiers z

Requires three things – The set of stored records – Distance metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve

z

To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Erik Zeitler

2010-09-15

17

UU - IT - UDBL

Definition of Nearest Neighbor

X

(a) 1-nearest neighbor

X

X

(b) 2-nearest neighbor

(c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x Erik Zeitler

2010-09-15

18

UU - IT - UDBL

Nearest Neighbor Classification • Compute distance between two points: – Minkowski distance (Euclidean: r = 2)

⎛ r⎞ M ( p, q, r ) = ⎜ ∑ ( pi − qi ) ⎟ ⎝ i ⎠

1 r

• Determine the class from nearest neighbor list – take the majority vote of class labels among the k-nearest neighbors – Weigh the vote according to distance • weight factor, w = 1/d2

Erik Zeitler

2010-09-15

19

UU - IT - UDBL

Nearest Neighbor Classification… • Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes

Erik Zeitler

2010-09-15

20

UU - IT - UDBL

Nearest Neighbor Classification… • Scaling issues – Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes – Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 90lb to 300lb • income of a person may vary from $10K to $1M

Erik Zeitler

2010-09-15

21

UU - IT - UDBL

Nearest Neighbor Classification… • Problem with Euclidean measure: – High dimensional data • curse of dimensionality

– Can produce counter-intuitive results

111111111110

100000000000 vs

011111111111

000000000001

d = 1.4142

d = 1.4142

• Solution: Normalize the vectors to unit length Erik Zeitler

2010-09-15

22

UU - IT - UDBL

Nearest neighbor classification… • k-NN classifiers are lazy learners – It does not build models explicitly – Unlike eager learners such as decision tree induction and rule-based systems – Classifying unknown records are relatively expensive • Typically O(q), q being database size • O(log(q)) if index is utilized, but – indexing takes O(qlog(q)) time to build (once and for all) and O(q) additional space

Erik Zeitler

2010-09-15

23

UU - IT - UDBL

Decision tree induction (ch 4.3) • Many algorithms: – – – –

Erik Zeitler

Hunt’s algorithm (one of the earliest) CART ID3, C4.5 SLIQ, SPRINT

2010-09-15

24

UU - IT - UDBL

Decision tree classification task Tid

Attrib1

Attrib2

Attrib3

Class

1

Yes

Large

125K

No

2

No

Medium

100K

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Learn Model

10

Tid

Attrib1

Attrib2

Attrib3

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Apply Model

Class

10

Erik Zeitler

2010-09-15

25

UU - IT - UDBL

Example of a decision tree l

s al a u c c i i r r uo o o n i t ss eg eg t t n a cl ca ca co Tid

Refund

M arital Status

Taxable Incom e

Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

Splitting Attributes

Refund Yes

No

NO

MarSt Single, Divorced TaxInc < 80K NO

NO > 80K YES

10

Model: Decision Tree

Training Data Erik Zeitler

2010-09-15

Married

26

UU - IT - UDBL

Another example of decision tree al al us c c i i o or or nu i g g t ss e e t t n a l c ca ca co Tid

Refund

M arital Status

Taxable Incom e

Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

Married NO

Single, Divorced Refund No

Yes NO

TaxInc < 80K NO

> 80K YES

There could be more than one tree that fits the same data!

10

Erik Zeitler

MarSt

2010-09-15

27

UU - IT - UDBL

Decision tree classification task Tid

Attrib1

1

Yes

Large

Attrib2

125K

Attrib3

No

Class

2

No

Medium

100K

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Learn Model

10

Tid

Attrib1

Attrib2

Attrib3

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Apply Model

Class

10

Erik Zeitler

2010-09-15

Decision Tree

28

UU - IT - UDBL

General structure of Hunt’s algorithm • •

Let Dt be the set of training records that reach a node t General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.



Tid Refund

Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

10

Recursively apply the procedure to each subset.

Dt

t

Erik Zeitler

2010-09-15

60K

29

UU - IT - UDBL

Hunt’s algorithm

Refund *

Yes

No

Don’t Cheat

Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

Married

8

No

Single

85K

Yes

Don’t Cheat

9

No

Married

75K

No

10

No

Single

90K

Yes

*

Refund Refund

Yes

Yes

No

Don’t Cheat

Don’t Cheat

Marital Status

Single, Divorced

*

Erik Zeitler

Married

No

Marital Status

Single, Divorced

Taxable Income

Don’t Cheat

10

< 80K

>= 80K

Don’t Cheat

Cheat 2010-09-15

60K

30

UU - IT - UDBL

Tree induction • Greedy strategy – Split the records based on an attribute test that (locally) optimizes a certain criterion

• Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?

– Determine when to stop splitting

Erik Zeitler

2010-09-15

31

UU - IT - UDBL

Tree induction • Greedy strategy – Split the records based on an attribute test that (locally) optimizes a certain criterion

• Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?

– Determine when to stop splitting

Erik Zeitler

2010-09-15

32

UU - IT - UDBL

How to specify test condition? • Depends on attribute types – Nominal (categories) – Ordinal (categories) – Continuous (interval split)

• Depends on number of ways to split – 2-way split – Multi-way split

Erik Zeitler

2010-09-15

33

UU - IT - UDBL

Splitting based on nominal attributes • Multi-way split: Use as many partitions as distinct values. CarType Family

Luxury Sports

• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury}

Erik Zeitler

CarType {Family}

OR

2010-09-15

{Family, Luxury}

CarType {Sports}

34

UU - IT - UDBL

Splitting based on ordinal attributes • Multi-way split: Use as many partitions as distinct values. Size Small

Large Medium

• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium}

Size {Large}

OR

• What about this split? {Small, Large} Erik Zeitler

2010-09-15

{Medium, Large}

Size {Small}

Size {Medium}

35

UU - IT - UDBL

Splitting based on continuous attributes • Different ways of handling – Discretization to form an ordinal categorical attribute • Static – discretize once at the beginning • Dynamic – ranges can be found by – equal interval bucketing – equal frequency bucketing (percentiles) – clustering

– Binary Decision: (A < v) or (A ≥ v) • consider all possible splits and finds the best cut • can be more computational intensive

Erik Zeitler

2010-09-15

36

UU - IT - UDBL

Splitting based on continuous attributes

Erik Zeitler

2010-09-15

37

UU - IT - UDBL

Questions • Why divide by d2 and not by d in vote weighting? – In fact, there are different proposals out there; • d1/2, d, d2, ... • A higher degree will further suppress votes from distant points

Æ Why not compare some different distance weightings in the assignment!

• Tree based classifiers – Find the best split attribute – Find the best attribute split boundary

Erik Zeitler

2010-09-15

Meaningful (”interesting”) splits

38

UU - IT - UDBL

Tree induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.

• Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?

– Determine when to stop splitting

Erik Zeitler

2010-09-15

39

UU - IT - UDBL

How to find the best split Before Splitting:

C0 C1

N00 N01

M0

A?

B?

Yes

No

Node N1 C0 C1

Yes

Node N2

N10 N11

C0 C1

Node N3 C0 C1

N20 N21

M2

M1

No

N30 N31

C0 C1

M3

M12 2010-09-15

N40 N41

M4 M34

Gain = M0 – M12 vs M0 – M34 Erik Zeitler

Node N4

40

UU - IT - UDBL

How to determine the best split Before Splitting: 10 records of class 0, 10 records of class 1

Which test condition is the best?

Erik Zeitler

2010-09-15

41

UU - IT - UDBL

How to determine the best split • Greedy approach: – Child nodes with homogeneous class distribution (maximum discrimination) are preferred

• Need a measure of node impurity of a data set:

Erik Zeitler

Non-homogeneous,

Homogeneous,

High degree of impurity

Low degree of impurity

2010-09-15

42

UU - IT - UDBL

Measures of node impurity

GINI (t ) = 1 − ∑ ( p ( j | t ) )

2

• Gini index

j

• Entropy

Entropy (t ) = −∑ p ( j | t ) log( p ( j | t ) ) j

• Misclassification error

Erik Zeitler

Error (t ) = 1 − max P(i | t ) i

2010-09-15

43

UU - IT - UDBL

Measure of impurity: GINI • Gini index for a given node t:

⎡ 1⎤ GINI (t ) = 1 − ∑ ( p ( j | t ) ) ∈ ⎢0, ⎥ j ⎣ nc ⎦ 2

– p( j | t) is the relative frequency of class j at node t) – Max. when records are equally distributed among all classes → least interesting information

– Min. when all records belong to one class → most interesting information C1 C2

0 6

Gini=0.000 Erik Zeitler

C1 C2

1 5

Gini=0.278

C1 C2

2 4

Gini=0.444

2010-09-15

C1 C2

3 3

Gini=0.500

44

UU - IT - UDBL

Examples for computing GINI GINI (t ) = 1 − ∑ [ p ( j | t )]2 j

C1 C2

0 6

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

Erik Zeitler

P(C1) = 0/6 = 0

P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444 2010-09-15

45

UU - IT - UDBL

Splitting based on GINI •

When a node p is split into k partitions (children), the quality of split is computed as, k

GINI split where,

Erik Zeitler

ni = ∑ GINI (i ) i =1 n

ni = number of records at child i, n = number of records at node p.

2010-09-15

46

UU - IT - UDBL

Binary attributes: computing GINI index • •

Splits into two partitions Effect of weighing: Large and pure partitions B?

Erik Zeitler

6

C2

6

No

Node N1

Gini(N2) = 1 – (1/5)2 – (4/5)2 = 0.320

C1

G ini = 0.500

Yes

Gini(N1) = 1 – (5/7)2 – (2/7)2 = 0.408

Parent

Node N2

C1 C2

N1 5 2

N2 1 4

Gini=0.371

2010-09-15

Gini(Children) = 7/12 * 0.408 + 5/12 * 0.320 = 0.371

47

UU - IT - UDBL

Categorical attributes: computing Gini index • •

For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions

Multi-way split

Two-way split (find best partition of values)

CarType Family Sports Luxury 1 4

C1 C2 Gini

Erik Zeitler

2 1 0.393

1 1

C1 C2 Gini

CarType {Sports, {Family} Luxury} 3 1 2 4 0.400

2010-09-15

C1 C2 Gini

CarType {Family, {Sports} Luxury} 2 2 1 5 0.419

48

UU - IT - UDBL

Continuous attributes: computing Gini index Tid

• Use Binary Decisions based on one value • Several Choices for the splitting value –

the distinct values in the data are meningful split points

• Each splitting value has a count matrix associated with it –

Class counts in each of the partitions, A < v and A≥v

• Simple method to choose best v – –

Erik Zeitler

For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work.

2010-09-15

10

Refund

Marital Status

Taxable Income

Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

49

UU - IT - UDBL

Continuous attributes: computing Gini index... •

For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Cheat

No

No

No

Yes

Yes

Yes

No

No

No

No

100

120

125

220

Taxable Income 60

Sorted Values Split Positions

55

75

65

85

72

90

80

95

87

92

97

110

122

172

230























Yes

0

3

0

3

0

3

0

3

1

2

2

1

3

0

3

0

3

0

3

0

3

0

No

0

7

1

6

2

5

3

4

3

4

3

4

3

4

4

3

5

2

6

1

7

0

Gini

Erik Zeitler

70

0.420

0.400

0.375

0.343

0.417

2010-09-15

0.400

0.300

0.343

0.375

0.400

0.420

50

UU - IT - UDBL

Alternative splitting criteria based on INFO • Entropy at a given node t:

Entropy (t ) = − ∑ p ( j | t ) log ( p ( j | t ) ) j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed among all classes → least interesting information

• Minimum (0.0) when all records belong to one class, → most interesting information

– Entropy based computations are similar to the GINI index computations

Erik Zeitler

2010-09-15

51

UU - IT - UDBL

Examples for computing Entropy

Entropy (t ) = − ∑ p ( j | t ) log p ( j | t ) 2

j

C1 C2

0 6

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

Erik Zeitler

P(C1) = 0/6 = 0

P(C2) = 6/6 = 1

Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0

P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

2010-09-15

52

UU - IT - UDBL

Splitting based on INFO... •

Information Gain:

GAIN

n ⎛ = Entropy ( p ) − ⎜ ∑ Entropy (i ) ⎞⎟ ⎠ ⎝ n k

split

i

i =1

Parent Node, p is split into k partitions; ni is number of records in partition i

– Measures reduction in entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) – Used in ID3 and C4.5 – Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

Erik Zeitler

2010-09-15

53

UU - IT - UDBL

Splitting based on INFO... •

Gain Ratio:

GainRATIO

GAIN = SplitINFO Split

split

SplitINFO

n n = − ∑ log n n k

i

i

i =1

Parent node, p is split into k partitions ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 – Designed to overcome the disadvantage of Information Gain

Erik Zeitler

2010-09-15

54

UU - IT - UDBL

Splitting criteria based on classification Error • Classification error at a node t :

Error (t ) = 1 − max P (i | t ) i



Measures misclassification error made by a node. • Maximum (1 - 1/nc) when records are equally distributed among all classes, Æ least interesting information • Minimum (0.0) when all records belong to one class, Æ most interesting information

Erik Zeitler

2010-09-15

55

UU - IT - UDBL

Examples for computing Error

Error (t ) = 1 − max P (i | t ) i

C1 C2

0 6

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

Erik Zeitler

P(C1) = 0/6 = 0

P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 2010-09-15

56

UU - IT - UDBL

Comparison among splitting criteria for a 2-class problem:

Erik Zeitler

2010-09-15

57

UU - IT - UDBL

Tree induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.

• Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?

– Determine when to stop splitting

Erik Zeitler

2010-09-15

58

UU - IT - UDBL

Stopping criteria for tree induction • Stop expanding a node when all the records belong to the same class • Stop expanding a node when all the records have similar attribute values • Early termination (e.g. to small resulting class set)

Erik Zeitler

2010-09-15

59

UU - IT - UDBL

Decision-tree-based classification • Advantages: – – – –

Erik Zeitler

Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets

2010-09-15

60

UU - IT - UDBL

Model evaluation (ch 4.5) • Metrics for Performance Evaluation – How to evaluate the performance of a model?

• Methods for Performance Evaluation – How to obtain reliable estimates?

Erik Zeitler

2010-09-15

61

UU - IT - UDBL

Metrics for performance evaluation • Focus on the predictive capability of a model – Rather than how fast it takes to classify or build models, scalability, etc.

• Confusion Matrix: PREDICTED CLASS Class=Yes

Class=No a: TP (true positive)

ACTUAL CLASS

Class=Yes Class=No

Erik Zeitler

a c

2010-09-15

b d

b: FN (false negative) c: FP (false positive) d: TN (true negative)

62

UU - IT - UDBL

Metrics for performance evaluation… PREDICTED CLASS

ACTUAL CLASS

Class=Yes

Class=No

Class=Yes

a (TP)

b (FN)

Class=No

c (FP)

d (TN)

• Most widely-used metric:

TP + TN a+d Accuracy = = a + b + c + d TP + TN + FP + FN Erik Zeitler

2010-09-15

63

UU - IT - UDBL

Limitation of Accuracy • Consider a 2-class problem – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10

• If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – Accuracy is misleading because model does not detect any class 1 example

Erik Zeitler

2010-09-15

64

UU - IT - UDBL

Cost Matrix PREDICTED CLASS C(i|j) ACTUAL CLASS

Class=Yes

Class=No

Class=Yes

C(Yes|Yes)

C(No|Yes)

Class=No

C(Yes|No)

C(No|No)

C(i|j): Cost of misclassifying class j example as class i

Erik Zeitler

2010-09-15

65

UU - IT - UDBL

Errata Slide 46 had wrong GINI computations when shown in-class on Sep 13. Was:

Should be:

Gini(N1)

Gini(N1)

= 1 – (5/6)2 – (2/6)2

= 1 – (5/7)2 – (2/7)2

= 0.194

= 0.408

j

Gini(N2)

Gini(N2)

= 1 – (1/6)2 – (4/6)2

= 1 – (1/5)2 – (4/5)2

= 0.528

= 0.320

Erik Zeitler

GINI (t ) = 1 − ∑ ( p ( j | t ) )

2

2010-09-15

66

UU - IT - UDBL

Binary attributes: computing GINI index • •

Splits into two partitions Effect of weighing: Large and pure partitions B?

Erik Zeitler

6

C2

6

No

Node N1

Gini(N2) = 1 – (1/5)2 – (4/5)2 = 0.320

C1

G ini = 0.500

Yes

Gini(N1) = 1 – (5/7)2 – (2/7)2 = 0.408

Parent

Node N2

C1 C2

N1 5 2

N2 1 4

Gini=0.371 2010-09-15

Gini(Children) = 7/12 * 0.408 + 5/12 * 0.320 = 0.371

67

UU - IT - UDBL

Computing Cost of Classification Cost Matrix

ACTUAL CLASS

Model M1

ACTUAL CLASS

PREDICTED CLASS C(i|j)

+

-

+

-1

100

-

1

0

PREDICTED CLASS

+

-

+

150

40

-

60

250

Model M2

ACTUAL CLASS

Accuracy = 80% Cost = 3910 Erik Zeitler

PREDICTED CLASS

+

-

+

250

45

-

5

200

Accuracy = 90% Cost = 4255 2010-09-15

68

UU - IT - UDBL

Cost vs Accuracy Accuracy is proportional to cost if 1. C(Yes|No) = C(No|Yes) = q 2. C(Yes|Yes)= C(No|No) = p

PREDICTED CLASS

Count

Class=Yes Class=Yes

ACTUAL CLASS

Class=No

Class=No

a

b

c

d

Proof: N=a+b+c+d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c)

PREDICTED CLASS

Cost

Class=Yes

= p (a + d) + q (N – a – d)

Class=No

= q N – (q – p)(a + d) ACTUAL CLASS

Erik Zeitler

Class=Yes

p

q

Class=No

q

p

= N [q – (q-p) × Accuracy]

2010-09-15

69

UU - IT - UDBL

Alternative performance metrics a proportion of correctly predicted ‘Yes’ to all Precision (p) = a + c predicted ‘Yes’ a proportion of correctly predicted ‘Yes’ to all Recall (r) = actual ‘Yes’ a+b 2rp 2a harmonic mean of precision and = F - measure (F) = r + p 2a + b + c recall generalize

wa + w d Weighted Accuracy = wa + wb+ wc+ w d 1

Erik Zeitler

2010-09-15

1

4

2

3

4

70

UU - IT - UDBL

Alternative performance metrics •

Sensitivity (true pos. rate), TPR = TP / TP + FN



Specitivity (true neg. rate), TNR = TN / TN + FP



False pos. rate, FPR = FP / TN + FP



False neg. rate, FNR = FN / TP + FN



Precision (p) = TP / TP + FP



Recall, r = TP / TP + FN (= TPR)



Erik Zeitler

F1 =

2rp 2TP = r + p 2TP + FP + FN



Precision is biased towards C(Yes|Yes) & C(Yes|No)



Recall is biased towards C(Yes|Yes) & C(No|Yes)



F-measure is biased towards all except C(No|No) 2010-09-15

71

UU - IT - UDBL

Methods for Performance Evaluation • How to obtain a reliable estimate of performance? • Performance of a model may depend on other factors besides the learning algorithm: – Class distribution – Cost of misclassification – Size of training and test sets

Erik Zeitler

2010-09-15

72

UU - IT - UDBL

Learning Curve •

Learning curve shows how accuracy changes with varying sample size



Requires a sampling scheme for creating learning curve:



Erik Zeitler

2010-09-15



Arithmetic sampling (Langley et al, 1996)



Geometric sampling (Provost et al, 1999)

Effect of small sample size: •

Bias in the estimate



Variance of estimate

73

UU - IT - UDBL

Methods of Estimation • Holdout –

Reserve 2/3 for training and 1/3 for testing

• Random subsampling –

Repeated holdout

• Cross validation – Partition data into k disjoint subsets – k-fold: train on k – 1 partitions, test on the remaining one – Leave-one-out: k = n

• Stratified sampling – –

partition the population (randomly, or based on attribute values) oversampling: Increase the sampling fraction of rare sub-groups

• Bootstrap –

Erik Zeitler

Sampling with replacement

2010-09-15