Information Retrieval and Data Mining Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (AG DBIS) http://dbis.informatik.uni-kl.de/ Information Retrieval and Data Mining, SoSe 2015, S. Michel
1
Chapter VI: Classification 1. Motivation and Definitions 2. Decision Trees 3. Bayes Classifier 4. Support Vector Machines (only as teaser)
Tan, Steinbach & Kumar, Chapter 8 Information Retrieval and Data Mining, SoSe 2015, S. Michel
2
1. Classification: Example Classifier age? youth student? no
no
middle_age?
senior
yes yes
yes
credit_rating? fair
no
excellent
yes
A decision tree for the concept buys_computer, indicating whether a customer at an electronic shop is likely to purchase a computer.
source: Han&Kamber
Information Retrieval and Data Mining, SoSe 2015, S. Michel
3
Classification: Definition • Given a collection of records (training set) – Each record contains a set of attributes, one of the attributes is the class.
© Tan,Steinbach, Kumar
• Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Information Retrieval and Data Mining, SoSe 2015, S. Michel
4
Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent
© Tan,Steinbach, Kumar
• Categorizing news stories as finance, weather, entertainment, sports, etc • Classifying persons into tax evaders and tax payers. Information Retrieval and Data Mining, SoSe 2015, S. Michel
5
Illustrating Classification Task Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Learning algorithm Induction Learn Model Model
10
© Tan,Steinbach, Kumar
Training Set Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply Model
Class
Deduction
10
Test Set
Information Retrieval and Data Mining, SoSe 2015, S. Michel
6
Classification model evaluation • Much the same measures as with IR methods
Predicted class
– Focus on accuracy and error rate
Class = 1
Class = 0
Class = 1
f11
f10
Class = 0
f01
f00
– But also precision, recall, F-scores, … Information Retrieval and Data Mining, SoSe 2015, S. Michel
7
Overview Classification Techniques • • • • •
Decision-Tree-based Methods Rule-based Methods Naïve Bayes Support Vector Machines ……
Information Retrieval and Data Mining, SoSe 2015, S. Michel
8
© Tan,Steinbach, Kumar
Example of a Decision Tree Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund Yes
No
NO
MarSt Single, Divorced TaxInc
< 80K NO
Married NO
> 80K YES
10
Training Data
Model: Decision Tree
Information Retrieval and Data Mining, SoSe 2015, S. Michel
9
2. Decision Trees
© Tan,Steinbach, Kumar
MarSt Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married NO
Single, Divorced Refund
No
Yes NO
TaxInc
< 80K NO
> 80K YES
There could be more than one tree that fits the same data!
10
Information Retrieval and Data Mining, SoSe 2015, S. Michel
10
Decision Tree Classification Task Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Tree Induction algorithm Induction Learn Model Model
10
© Tan,Steinbach, Kumar
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply Model
Class
Decision Tree
Deduction
10
Test Set
Information Retrieval and Data Mining, SoSe 2015, S. Michel
11
Apply Model to Test Data Test Data
Start from the root of tree.
Refund
Taxable Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt Single, Divorced TaxInc < 80K
© Tan,Steinbach, Kumar
Refund Marital Status
NO
Married
NO > 80K YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
12
Apply Model to Test Data Test Data
Refund
Taxable Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt Single, Divorced TaxInc < 80K
© Tan,Steinbach, Kumar
Refund Marital Status
NO
Married
NO > 80K YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
13
Apply Model to Test Data Test Data
Refund
Taxable Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt Single, Divorced TaxInc < 80K
© Tan,Steinbach, Kumar
Refund Marital Status
NO
Married
NO > 80K YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
14
Apply Model to Test Data Test Data
Refund
Taxable Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt Single, Divorced TaxInc < 80K
© Tan,Steinbach, Kumar
Refund Marital Status
NO
Married
NO > 80K YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
15
Apply Model to Test Data Test Data
Refund
Taxable Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt Single, Divorced TaxInc < 80K
© Tan,Steinbach, Kumar
Refund Marital Status
NO
Married
NO > 80K YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
16
Apply Model to Test Data Test Data
Refund
Taxable Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt Single, Divorced TaxInc < 80K
© Tan,Steinbach, Kumar
Refund Marital Status
NO
Married
Assign Cheat to “No”
NO > 80K YES
Information Retrieval and Data Mining, SoSe 2015, S. Michel
17
Classifying a Record with a Decision Tree • Given a decision tree. • How to classify a test record? • Start at root note and apply the test condition to the record and follow the appropriate branch. • If this leads to internal node, again apply test condition and follow branch. • Otherwise, if at leave node, assign class of leave node to record. • Repeat until at leave node. Information Retrieval and Data Mining, SoSe 2015, S. Michel
18
Decision Tree Classification Task Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree Induction algorithm Induction Learn Model Model
10
© Tan,Steinbach, Kumar
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply Model
Class
Decision Tree
Deduction
10
Test Set Information Retrieval and Data Mining, SoSe 2015, S. Michel
19
Constructing Decision Tree • There are exponentially many decision trees for the training data. • Finding optimal tree is computationally infeasible. • Instead, use greedy algorithms: Series of local split operations to grow the tree. Not optimal, but there are efficient algorithms that create sufficiently accurate trees.
Information Retrieval and Data Mining, SoSe 2015, S. Michel
20
© Tan,Steinbach, Kumar
General Structure of Hunt’s Algorithm • Let Dt be the set of training records that reach a node t • General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Information Retrieval and Data Mining, SoSe 2015, S. Michel
Dt
?
21
Hunt’s Algorithm Don’t Cheat
Start with most frequent class as default class.
Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
© Tan,Steinbach, Kumar
10
Information Retrieval and Data Mining, SoSe 2015, S. Michel
22
Hunt’s Algorithm (2) Refund
Don’t Cheat
Yes Don’t Cheat
No Don’t Cheat
Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
© Tan,Steinbach, Kumar
10
Information Retrieval and Data Mining, SoSe 2015, S. Michel
23
Hunt’s Algorithm (3) Refund
Don’t Cheat
Yes
No Don’t Cheat
Don’t Cheat
Refund
Yes
No
© Tan,Steinbach, Kumar
Don’t Cheat Single, Divorced Cheat
Marital Status
Tid Refund Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Married
Don’t Cheat
Information Retrieval and Data Mining, SoSe 2015, S. Michel
24
Hunt’s Algorithm (4) Refund
Don’t Cheat
Yes
No Don’t Cheat
Don’t Cheat
Refund
Refund
Yes Don’t Cheat Single, Divorced © Tan,Steinbach, Kumar
Yes
No
Cheat
Marital Status
Married
Don’t Cheat Single, Divorced
Don’t Cheat
No Marital Status
Don’t Cheat
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Married Don’t Cheat
Taxable Income
< 80K
Tid Refund Marital Status
>= 80K Cheat
Information Retrieval and Data Mining, SoSe 2015, S. Michel
25
Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.
• Issues
© Tan,Steinbach, Kumar
– Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?
– Determine when to stop splitting Information Retrieval and Data Mining, SoSe 2015, S. Michel
26
How to Specify Test Condition? • Depends on attribute types – Nominal – Ordinal – Continuous
© Tan,Steinbach, Kumar
• Depends on number of ways to split – 2-way split – Multi-way split
Information Retrieval and Data Mining, SoSe 2015, S. Michel
27
Splitting Based on Nominal Attributes • Multi-way split: Use as many partitions as distinct values. CarType Family
Luxury
© Tan,Steinbach, Kumar
Sports
• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury}
CarType {Family}
OR
{Family, Luxury}
Information Retrieval and Data Mining, SoSe 2015, S. Michel
CarType {Sports}
28
Splitting Based on Continuous Attributes • Different ways of handling continuous attributes – Discretization to form an ordinal categorical attribute • Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.
© Tan,Steinbach, Kumar
– Binary Decision: (A < v) or (A v) • consider all possible splits and finds the best cut • can be more compute intensive
Information Retrieval and Data Mining, SoSe 2015, S. Michel
30
Splitting Based on Continuous Attributes Taxable Income > 80K?
Taxable Income? < 10K
Yes
> 80K
No
© Tan,Steinbach, Kumar
[10K,25K)
(i) Binary split
[25K,50K)
[50K,80K)
(ii) Multi-way split
Information Retrieval and Data Mining, SoSe 2015, S. Michel
31
Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.
• Issues
© Tan,Steinbach, Kumar
– Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?
– Determine when to stop splitting Information Retrieval and Data Mining, SoSe 2015, S. Michel
32
How to determine the Best Split Before Splitting: 10 records of class 0 10 records of class 1 Own Car? Yes
Car Type? No
Family
Student ID? Luxury
c1
Sports
© Tan,Steinbach, Kumar
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
C0: 1 C1: 7
C0: 1 C1: 0
...
c10 C0: 1 C1: 0
c11 C0: 0 C1: 1
c20
...
C0: 0 C1: 1
Which test condition is the best? Information Retrieval and Data Mining, SoSe 2015, S. Michel
33
How to determine the Best Split • Greedy approach: – Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:
© Tan,Steinbach, Kumar
C0: 5 C1: 5
C0: 9 C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
Information Retrieval and Data Mining, SoSe 2015, S. Michel
34
Selecting the Best Split • Let p(i | t) be the fraction of records belonging to class i at node t • Best split is selected based on the degree of impurity of the child nodes – p(0 | t) = 0 and p(1 | t) = 1 has high purity – p(0 | t) = 1/2 and p(1 | t) = 1/2 has the smallest purity (highest impurity)
• Intuition: high purity ⇒ small value of impurity measures ⇒ better split 35
© Tan,Steinbach, Kumar
Example of Purity
high impurity
high purity
Information Retrieval and Data Mining, SoSe 2015, S. Michel
36
Impurity Measures
Information Retrieval and Data Mining, SoSe 2015, S. Michel
37
© Tan,Steinbach, Kumar
Examples for Computing Entropy
C1 C2
0 6
P(C1) = 0/6 = 0
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 Information Retrieval and Data Mining, SoSe 2015, S. Michel
38
© Tan,Steinbach, Kumar
Examples for computing GINI C1 C2
0 6
P(C1) = 0/6 = 0
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444 Information Retrieval and Data Mining, SoSe 2015, S. Michel
39
Comparing Conditions • The quality of the split: the change in the impurity – Called the gain of the test condition
• • • • •
I( ) is the impurity measure k is the number of attribute values p is the parent node, vj is the child node N is the total number of records at the parent node N(vj) is the number of records associated with the child node
• Maximizing the gain ⇔ minimizing the weighted average impurity measure of child nodes • If I() = Entropy(), then Δ = Δinfo is called information gain Information Retrieval and Data Mining, SoSe 2015, S. Michel
40
How to Find the Best Split Before Splitting:
C0 C1
N00 N01
M0
A?
B?
Yes
No
Node N1
© Tan,Steinbach, Kumar
C0 C1
Node N2
N10 N11
C0 C1
N20 N21
M2
M1
Yes
No
Node N3 C0 C1
Node N4
N30 N31
C0 C1
M3
M12
N40 N41
M4 M34
Gain = M0 – M12 vs M0 – M34 Information Retrieval and Data Mining, SoSe 2015, S. Michel
41
Problems of maximizing Δ
Higher purity
Information Retrieval and Data Mining, SoSe 2015, S. Michel
42
Problems of Maximizing Δ • Impurity measures favor attributes with large number of values • A test condition with large number of outcomes might not be desirable – Number of records in each partition is too small to make predictions
• Solution 1: gain ratio = Δinfo / SplitInfo
P(vi) = the fraction of records at child; k = total number of splits
• Solution 2: restrict the splits to binary 43
Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.
• Issues
© Tan,Steinbach, Kumar
– Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?
– Determine when to stop splitting Information Retrieval and Data Mining, SoSe 2015, S. Michel
44
Stopping Criteria for Tree Induction • Stop expanding a node when all the records belong to the same class
• Stop expanding a node when all the records have the same or similar attribute values. In this case the class with “majority” wins.
Information Retrieval and Data Mining, SoSe 2015, S. Michel
45
Overfitting and Tree Pruning • Common problem with decision trees is that tree might be too tightly tailored to training data (and thus possibly to noise in data). – Good: error on training data might be very low – But what about test data previously unseen?
• Idea: Avoid tree becoming too fine-grained. • Solution 1: Stop splitting nodes early (i.e., preprocessing) • Solution 2: Build tree regularly and then prune parts of it (i.e., postprocessing) Information Retrieval and Data Mining, SoSe 2015, S. Michel
46
Example: Training Data
© Tan,Steinbach, Kumar
Example of overfitting due to noisy training data …..
*) wrong class Information Retrieval and Data Mining, SoSe 2015, S. Michel
47
© Tan,Steinbach, Kumar
Example: Two Different Decision Trees
Information Retrieval and Data Mining, SoSe 2015, S. Michel
48
Example: Test Data 1
1 2 1
Let’s see how the trees M1 and M2 perform on test and training data. M1: 0% error on training data, but 30% error on test data! Errors marked with 1 M2: 20% error on training data, but 10% error on test data! 2 table source: Tan,Steinbach, Kumar
Information Retrieval and Data Mining, SoSe 2015, S. Michel
49
2. (Naive) Bayes Classifier • A probabilistic framework for solving classification problems • Conditional Probability: P(C | A) P( A, C )
© Tan,Steinbach, Kumar
P ( A) P ( A, C ) P( A | C ) P (C )
• Bayes theorem:
P( A | C ) P(C ) P(C | A) P( A) Information Retrieval and Data Mining, SoSe 2015, S. Michel
50
Example of Bayes Theorem • Given: – A doctor knows that meningitis causes stiff neck 50% of the time – Prior probability of any patient having meningitis is 1/50,000 – Prior probability of any patient having stiff neck is 1/20
© Tan,Steinbach, Kumar
• If a patient has stiff neck, what’s the probability he/she has meningitis? P( S | M ) P( M ) 0.5 1 / 50000 P( M | S ) 0.0002 P( S ) 1 / 20 Information Retrieval and Data Mining, SoSe 2015, S. Michel
51
Bayesian Classifiers • Consider each attribute and class label as random variables • Given a record with attributes (A1, A2,…,An)
© Tan,Steinbach, Kumar
– Goal is to predict class C – Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly from data? Information Retrieval and Data Mining, SoSe 2015, S. Michel
52
Bayesian Classifiers
• Approach: – compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem P( A A A | C ) P(C ) P(C | A A A ) P( A A A ) 1
1
2
2
n
n
1
2
n
© Tan,Steinbach, Kumar
– Choose value of C that maximizes P(C | A1, A2, …, An) – Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) • How to estimate P(A1, A2, …, An | C )? Information Retrieval and Data Mining, SoSe 2015, S. Michel
53
Naïve Bayes Classifier • Assume independence among attributes Ai when class is given: P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
© Tan,Steinbach, Kumar
– Can estimate P(Ai| Cj) for all Ai and Cj. – New point is classified to Cj if P(Cj) P(Ai| Cj) is maximal. Information Retrieval and Data Mining, SoSe 2015, S. Michel
54
How to Estimate Probabilities from s al al u c c i i Data? ategor ategor ontinuo lass
• Class: P(C) = Nc/N
c
Tid
© Tan,Steinbach, Kumar
– e.g., P(No) = 7/10, P(Yes) = 3/10
Refund
c
c
c
Marital Status
Taxable Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
• For discrete attributes: P(Ai | Ck) = |Aik|/ Nc
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
– where |Aik| is number of instances having attribute Ai and belongs to class Ck – Examples:
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
k
10
P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0 Information Retrieval and Data Mining, SoSe 2015, S. Michel
55
How to Estimate Probabilities from Data? • For continuous attributes: – Discretize the range into bins • one ordinal attribute per bin • violates independence assumption
k
– Two-way split: (A < v) or (A > v) • choose only one of the two splits as new attribute
© Tan,Steinbach, Kumar
– Probability density estimation: • Assume attribute follows a normal distribution • Use data to estimate parameters of distribution (e.g., mean and standard deviation) • Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) Information Retrieval and Data Mining, SoSe 2015, S. Michel
56
How to Estimate Probabilities from Data? t ca
• Normal distribution:
Tid
– One for each (Ai,ci) pair
© Tan,Steinbach, Kumar
• For (Income, Class=No): – If Class=No • sample mean = 110 • sample variance = 2975
Refund
o eg
a c i r
l
t ca
o eg
a c i r
l
co
in nt
u
s u o
Marital Status
Taxable Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Information Retrieval and Data Mining, SoSe 2015, S. Michel
as l c
57
Example of Naïve Bayes Classifier Given a Test Record:
X (Refund No, Married, Income 120K)
naive Bayes Classifier:
© Tan,Steinbach, Kumar
P(Refund=Yes|No) = 3/7 P(Refund=No|No) = 4/7 P(Refund=Yes|Yes) = 0 P(Refund=No|Yes) = 1 P(Marital Status=Single|No) = 2/7 P(Marital Status=Divorced|No)=1/7 P(Marital Status=Married|No) = 4/7 P(Marital Status=Single|Yes) = 2/7 P(Marital Status=Divorced|Yes)=1/7 P(Marital Status=Married|Yes) = 0 For taxable income: If class=No: sample mean=110 sample variance=2975 If class=Yes: sample mean=90 sample variance=25
P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K| Class=No) = 4/7 4/7 0.0072 = 0.0024
P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=120K| Class=Yes) = 1 0 1.2 10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X)
=> Class = No
Information Retrieval and Data Mining, SoSe 2015, S. Michel
58
3. Support Vector Machines
© Tan,Steinbach, Kumar
Idea: Find a linear hyperplane (decision boundary) that will separate the data
Information Retrieval and Data Mining, SoSe 2015, S. Michel
59
Support Vector Machines One Possible Solution
© Tan,Steinbach, Kumar
B1
Information Retrieval and Data Mining, SoSe 2015, S. Michel
60
Support Vector Machines Another possible solution
© Tan,Steinbach, Kumar
B2
Information Retrieval and Data Mining, SoSe 2015, S. Michel
61
Support Vector Machines Other possible solutions
© Tan,Steinbach, Kumar
B2
Information Retrieval and Data Mining, SoSe 2015, S. Michel
62
Support Vector Machines B1
© Tan,Steinbach, Kumar
B2
• Which one is better? B1 or B2? • How do you define better? Information Retrieval and Data Mining, SoSe 2015, S. Michel
63
Support Vector Machines B1
B2
© Tan,Steinbach, Kumar
b21 b22
margin
b11
b12
• Find hyperplane maximizes the margin => B1 is better than B2 Information Retrieval and Data Mining, SoSe 2015, S. Michel
64
Support Vector Machines B1
w x b 0 w x b 1
w x b 1
© Tan,Steinbach, Kumar
b11
if w x b 1 1 f ( x) 1 if w x b 1 Information Retrieval and Data Mining, SoSe 2015, S. Michel
b12
Margin
2 || w ||2 65
Summary Data Mining • Frequent Itemset and Association Rule Mining: – Apriori Principle and Algorithm
• Clustering: – K-means – Hierarchical clustering – DBSCAN (density based clustering)
• Classification: – Decision trees – Naïve Bayes Classifier – Support Vector Machines (SVMs) Information Retrieval and Data Mining, SoSe 2015, S. Michel
66