Information Retrieval and Data Mining. Summer Semester 2015 TU Kaiserslautern

Information Retrieval and Data Mining Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (...

Author: Toby Blair

1 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

2015 summer semester 2015

Software-Engineering Seminar, Summer AG Softech FB Informatik TU Kaiserslautern

DOPPIK-KONTENPLAN TU Kaiserslautern

Reiner Hartenstein, TU Kaiserslautern, Germany

DAS MAGAZIN DER TU KAISERSLAUTERN

Andreas Rausch TU Kaiserslautern Gottlieb-Daimler-Str. D Kaiserslautern

DAS MAGAZIN DER TU KAISERSLAUTERN

Information Extraction, Data Mining and Joint Inference

summer semester SUMMER SEMESTER CLASS SCHEDULE

Media Information Summer 2015

Summer Semester

DAS MAGAZIN DER TU KAISERSLAUTERN. Der 1.000ste

Robotics Research at the TU Kaiserslautern

Information Retrieval

DAS LEHRAMTSSTUDIUM AN DER TU KAISERSLAUTERN

Data Warehousing and Data Mining

Spieltheorie. Prof. Dr. Philipp Weinschenk. TU Kaiserslautern

Data Mining: Data And Preprocessing

Information Retrieval and Semantic Technologies

Current Topics in Information-theoretic Data Mining

Summer Semester 2017

Course Schedule - Summer Semester

REVELATION Summer Semester 2012

Information Retrieval and Data Mining Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (AG DBIS) http://dbis.informatik.uni-kl.de/ Information Retrieval and Data Mining, SoSe 2015, S. Michel

1

Chapter VI: Classification 1. Motivation and Definitions 2. Decision Trees 3. Bayes Classifier 4. Support Vector Machines (only as teaser)

Tan, Steinbach & Kumar, Chapter 8 Information Retrieval and Data Mining, SoSe 2015, S. Michel

2

1. Classification: Example Classifier age? youth student? no

no

middle_age?

senior

yes yes

yes

credit_rating? fair

no

excellent

yes

A decision tree for the concept buys_computer, indicating whether a customer at an electronic shop is likely to purchase a computer.

source: Han&Kamber

Information Retrieval and Data Mining, SoSe 2015, S. Michel

3

Classification: Definition • Given a collection of records (training set) – Each record contains a set of attributes, one of the attributes is the class.

© Tan,Steinbach, Kumar

• Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Information Retrieval and Data Mining, SoSe 2015, S. Michel

4

Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent

© Tan,Steinbach, Kumar

• Categorizing news stories as finance, weather, entertainment, sports, etc • Classifying persons into tax evaders and tax payers. Information Retrieval and Data Mining, SoSe 2015, S. Michel

5

Illustrating Classification Task Tid

Attrib1

1

Yes

Large

125K

No

2

No

Medium

100K

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Attrib2

Attrib3

Class

Learning algorithm Induction Learn Model Model

10

© Tan,Steinbach, Kumar

Training Set Tid

Attrib1

Attrib2

Attrib3

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Apply Model

Class

Deduction

10

Test Set

Information Retrieval and Data Mining, SoSe 2015, S. Michel

6

Classification model evaluation • Much the same measures as with IR methods

Predicted class

– Focus on accuracy and error rate

Class = 1

Class = 0

Class = 1

f11

f10

Class = 0

f01

f00

– But also precision, recall, F-scores, … Information Retrieval and Data Mining, SoSe 2015, S. Michel

7

Overview Classification Techniques • • • • •

Decision-Tree-based Methods Rule-based Methods Naïve Bayes Support Vector Machines ……

Information Retrieval and Data Mining, SoSe 2015, S. Michel

8

© Tan,Steinbach, Kumar

Example of a Decision Tree Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

Splitting Attributes

Refund Yes

No

NO

MarSt Single, Divorced TaxInc

< 80K NO

Married NO

> 80K YES

10

Training Data

Model: Decision Tree

Information Retrieval and Data Mining, SoSe 2015, S. Michel

9

2. Decision Trees

© Tan,Steinbach, Kumar

MarSt Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

Married NO

Single, Divorced Refund

No

Yes NO

TaxInc

< 80K NO

> 80K YES

There could be more than one tree that fits the same data!

10

Information Retrieval and Data Mining, SoSe 2015, S. Michel

10

Decision Tree Classification Task Tid

Attrib1

1

Yes

Large

125K

No

2

No

Medium

100K

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Attrib2

Attrib3

Class

Tree Induction algorithm Induction Learn Model Model

10

© Tan,Steinbach, Kumar

Training Set

Tid

Attrib1

Attrib2

Attrib3

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Apply Model

Class

Decision Tree

Deduction

10

Test Set

Information Retrieval and Data Mining, SoSe 2015, S. Michel

11

Apply Model to Test Data Test Data

Start from the root of tree.

Refund

Taxable Income Cheat

No

80K

Married

?

10

Yes

No

NO

MarSt Single, Divorced TaxInc < 80K

© Tan,Steinbach, Kumar

Refund Marital Status

NO

Married

NO > 80K YES

Information Retrieval and Data Mining, SoSe 2015, S. Michel

12

Apply Model to Test Data Test Data

Refund

Taxable Income Cheat

No

80K

Married

?

10

Yes

No

NO

MarSt Single, Divorced TaxInc < 80K

© Tan,Steinbach, Kumar

Refund Marital Status

NO

Married

NO > 80K YES

Information Retrieval and Data Mining, SoSe 2015, S. Michel

13

Apply Model to Test Data Test Data

Refund

Taxable Income Cheat

No

80K

Married

?

10

Yes

No

NO

MarSt Single, Divorced TaxInc < 80K

© Tan,Steinbach, Kumar

Refund Marital Status

NO

Married

NO > 80K YES

Information Retrieval and Data Mining, SoSe 2015, S. Michel

14

Apply Model to Test Data Test Data

Refund

Taxable Income Cheat

No

80K

Married

?

10

Yes

No

NO

MarSt Single, Divorced TaxInc < 80K

© Tan,Steinbach, Kumar

Refund Marital Status

NO

Married

NO > 80K YES

Information Retrieval and Data Mining, SoSe 2015, S. Michel

15

Apply Model to Test Data Test Data

Refund

Taxable Income Cheat

No

80K

Married

?

10

Yes

No

NO

MarSt Single, Divorced TaxInc < 80K

© Tan,Steinbach, Kumar

Refund Marital Status

NO

Married

NO > 80K YES

Information Retrieval and Data Mining, SoSe 2015, S. Michel

16

Apply Model to Test Data Test Data

Refund

Taxable Income Cheat

No

80K

Married

?

10

Yes

No

NO

MarSt Single, Divorced TaxInc < 80K

© Tan,Steinbach, Kumar

Refund Marital Status

NO

Married

Assign Cheat to “No”

NO > 80K YES

Information Retrieval and Data Mining, SoSe 2015, S. Michel

17

Classifying a Record with a Decision Tree • Given a decision tree. • How to classify a test record? • Start at root note and apply the test condition to the record and follow the appropriate branch. • If this leads to internal node, again apply test condition and follow branch. • Otherwise, if at leave node, assign class of leave node to record. • Repeat until at leave node. Information Retrieval and Data Mining, SoSe 2015, S. Michel

18

Decision Tree Classification Task Tid

Attrib1

Attrib2

Attrib3

Class

1

Yes

Large

125K

No

2

No

Medium

100K

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Tree Induction algorithm Induction Learn Model Model

10

© Tan,Steinbach, Kumar

Training Set

Tid

Attrib1

Attrib2

Attrib3

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Apply Model

Class

Decision Tree

Deduction

10

Test Set Information Retrieval and Data Mining, SoSe 2015, S. Michel

19

Constructing Decision Tree • There are exponentially many decision trees for the training data. • Finding optimal tree is computationally infeasible. • Instead, use greedy algorithms: Series of local split operations to grow the tree. Not optimal, but there are efficient algorithms that create sufficiently accurate trees.

Information Retrieval and Data Mining, SoSe 2015, S. Michel

20

© Tan,Steinbach, Kumar

General Structure of Hunt’s Algorithm • Let Dt be the set of training records that reach a node t • General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.

Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

10

Information Retrieval and Data Mining, SoSe 2015, S. Michel

Dt

?

21

Hunt’s Algorithm Don’t Cheat

Start with most frequent class as default class.

Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

© Tan,Steinbach, Kumar

10

Information Retrieval and Data Mining, SoSe 2015, S. Michel

22

Hunt’s Algorithm (2) Refund

Don’t Cheat

Yes Don’t Cheat

No Don’t Cheat

Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

© Tan,Steinbach, Kumar

10

Information Retrieval and Data Mining, SoSe 2015, S. Michel

23

Hunt’s Algorithm (3) Refund

Don’t Cheat

Yes

No Don’t Cheat

Don’t Cheat

Refund

Yes

No

© Tan,Steinbach, Kumar

Don’t Cheat Single, Divorced Cheat

Marital Status

Tid Refund Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

10

Married

Don’t Cheat

Information Retrieval and Data Mining, SoSe 2015, S. Michel

24

Hunt’s Algorithm (4) Refund

Don’t Cheat

Yes

No Don’t Cheat

Don’t Cheat

Refund

Refund

Yes Don’t Cheat Single, Divorced © Tan,Steinbach, Kumar

Yes

No

Cheat

Marital Status

Married

Don’t Cheat Single, Divorced

Don’t Cheat

No Marital Status

Don’t Cheat

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

10

Married Don’t Cheat

Taxable Income

< 80K

Tid Refund Marital Status

>= 80K Cheat

Information Retrieval and Data Mining, SoSe 2015, S. Michel

25

Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.

• Issues

© Tan,Steinbach, Kumar

– Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?

– Determine when to stop splitting Information Retrieval and Data Mining, SoSe 2015, S. Michel

26

How to Specify Test Condition? • Depends on attribute types – Nominal – Ordinal – Continuous

© Tan,Steinbach, Kumar

• Depends on number of ways to split – 2-way split – Multi-way split

Information Retrieval and Data Mining, SoSe 2015, S. Michel

27

Splitting Based on Nominal Attributes • Multi-way split: Use as many partitions as distinct values. CarType Family

Luxury

© Tan,Steinbach, Kumar

Sports

• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury}

CarType {Family}

OR

{Family, Luxury}

Information Retrieval and Data Mining, SoSe 2015, S. Michel

CarType {Sports}

28

Splitting Based on Continuous Attributes • Different ways of handling continuous attributes – Discretization to form an ordinal categorical attribute • Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.

© Tan,Steinbach, Kumar

– Binary Decision: (A < v) or (A  v) • consider all possible splits and finds the best cut • can be more compute intensive

Information Retrieval and Data Mining, SoSe 2015, S. Michel

30

Splitting Based on Continuous Attributes Taxable Income > 80K?

Taxable Income? < 10K

Yes

> 80K

No

© Tan,Steinbach, Kumar

[10K,25K)

(i) Binary split

[25K,50K)

[50K,80K)

(ii) Multi-way split

Information Retrieval and Data Mining, SoSe 2015, S. Michel

31

Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.

• Issues

© Tan,Steinbach, Kumar

– Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?

– Determine when to stop splitting Information Retrieval and Data Mining, SoSe 2015, S. Michel

32

How to determine the Best Split Before Splitting: 10 records of class 0 10 records of class 1 Own Car? Yes

Car Type? No

Family

Student ID? Luxury

c1

Sports

© Tan,Steinbach, Kumar

C0: 6 C1: 4

C0: 4 C1: 6

C0: 1 C1: 3

C0: 8 C1: 0

C0: 1 C1: 7

C0: 1 C1: 0

...

c10 C0: 1 C1: 0

c11 C0: 0 C1: 1

c20

...

C0: 0 C1: 1

Which test condition is the best? Information Retrieval and Data Mining, SoSe 2015, S. Michel

33

How to determine the Best Split • Greedy approach: – Nodes with homogeneous class distribution are preferred

• Need a measure of node impurity:

© Tan,Steinbach, Kumar

C0: 5 C1: 5

C0: 9 C1: 1

Non-homogeneous,

Homogeneous,

High degree of impurity

Low degree of impurity

Information Retrieval and Data Mining, SoSe 2015, S. Michel

34

Selecting the Best Split • Let p(i | t) be the fraction of records belonging to class i at node t • Best split is selected based on the degree of impurity of the child nodes – p(0 | t) = 0 and p(1 | t) = 1 has high purity – p(0 | t) = 1/2 and p(1 | t) = 1/2 has the smallest purity (highest impurity)

• Intuition: high purity ⇒ small value of impurity measures ⇒ better split 35

© Tan,Steinbach, Kumar

Example of Purity

high impurity

high purity

Information Retrieval and Data Mining, SoSe 2015, S. Michel

36

Impurity Measures

Information Retrieval and Data Mining, SoSe 2015, S. Michel

37

© Tan,Steinbach, Kumar

Examples for Computing Entropy

C1 C2

0 6

P(C1) = 0/6 = 0

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 Information Retrieval and Data Mining, SoSe 2015, S. Michel

38

© Tan,Steinbach, Kumar

Examples for computing GINI C1 C2

0 6

P(C1) = 0/6 = 0

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444 Information Retrieval and Data Mining, SoSe 2015, S. Michel

39

Comparing Conditions • The quality of the split: the change in the impurity – Called the gain of the test condition

• • • • •

I( ) is the impurity measure k is the number of attribute values p is the parent node, vj is the child node N is the total number of records at the parent node N(vj) is the number of records associated with the child node

• Maximizing the gain ⇔ minimizing the weighted average impurity measure of child nodes • If I() = Entropy(), then Δ = Δinfo is called information gain Information Retrieval and Data Mining, SoSe 2015, S. Michel

40

How to Find the Best Split Before Splitting:

C0 C1

N00 N01

M0

A?

B?

Yes

No

Node N1

© Tan,Steinbach, Kumar

C0 C1

Node N2

N10 N11

C0 C1

N20 N21

M2

M1

Yes

No

Node N3 C0 C1

Node N4

N30 N31

C0 C1

M3

M12

N40 N41

M4 M34

Gain = M0 – M12 vs M0 – M34 Information Retrieval and Data Mining, SoSe 2015, S. Michel

41

Problems of maximizing Δ

Higher purity

Information Retrieval and Data Mining, SoSe 2015, S. Michel

42

Problems of Maximizing Δ • Impurity measures favor attributes with large number of values • A test condition with large number of outcomes might not be desirable – Number of records in each partition is too small to make predictions

• Solution 1: gain ratio = Δinfo / SplitInfo

P(vi) = the fraction of records at child; k = total number of splits

• Solution 2: restrict the splits to binary 43

Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.

• Issues

© Tan,Steinbach, Kumar

– Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?

– Determine when to stop splitting Information Retrieval and Data Mining, SoSe 2015, S. Michel

44

Stopping Criteria for Tree Induction • Stop expanding a node when all the records belong to the same class

• Stop expanding a node when all the records have the same or similar attribute values. In this case the class with “majority” wins.

Information Retrieval and Data Mining, SoSe 2015, S. Michel

45

Overfitting and Tree Pruning • Common problem with decision trees is that tree might be too tightly tailored to training data (and thus possibly to noise in data). – Good: error on training data might be very low – But what about test data previously unseen?

• Idea: Avoid tree becoming too fine-grained. • Solution 1: Stop splitting nodes early (i.e., preprocessing) • Solution 2: Build tree regularly and then prune parts of it (i.e., postprocessing) Information Retrieval and Data Mining, SoSe 2015, S. Michel

46

Example: Training Data

© Tan,Steinbach, Kumar

Example of overfitting due to noisy training data …..

*) wrong class Information Retrieval and Data Mining, SoSe 2015, S. Michel

47

© Tan,Steinbach, Kumar

Example: Two Different Decision Trees

Information Retrieval and Data Mining, SoSe 2015, S. Michel

48

Example: Test Data 1

1 2 1

Let’s see how the trees M1 and M2 perform on test and training data. M1: 0% error on training data, but 30% error on test data! Errors marked with 1 M2: 20% error on training data, but 10% error on test data! 2 table source: Tan,Steinbach, Kumar

Information Retrieval and Data Mining, SoSe 2015, S. Michel

49

2. (Naive) Bayes Classifier • A probabilistic framework for solving classification problems • Conditional Probability: P(C | A)  P( A, C )

© Tan,Steinbach, Kumar

P ( A) P ( A, C ) P( A | C )  P (C )

• Bayes theorem:

P( A | C ) P(C ) P(C | A)  P( A) Information Retrieval and Data Mining, SoSe 2015, S. Michel

50

Example of Bayes Theorem • Given: – A doctor knows that meningitis causes stiff neck 50% of the time – Prior probability of any patient having meningitis is 1/50,000 – Prior probability of any patient having stiff neck is 1/20

© Tan,Steinbach, Kumar

• If a patient has stiff neck, what’s the probability he/she has meningitis? P( S | M ) P( M ) 0.5 1 / 50000 P( M | S )    0.0002 P( S ) 1 / 20 Information Retrieval and Data Mining, SoSe 2015, S. Michel

51

Bayesian Classifiers • Consider each attribute and class label as random variables • Given a record with attributes (A1, A2,…,An)

© Tan,Steinbach, Kumar

– Goal is to predict class C – Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )

• Can we estimate P(C| A1, A2,…,An ) directly from data? Information Retrieval and Data Mining, SoSe 2015, S. Michel

52

Bayesian Classifiers

• Approach: – compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem P( A A  A | C ) P(C ) P(C | A A  A )  P( A A  A ) 1

1

2

2

n

n

1

2

n

© Tan,Steinbach, Kumar

– Choose value of C that maximizes P(C | A1, A2, …, An) – Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) • How to estimate P(A1, A2, …, An | C )? Information Retrieval and Data Mining, SoSe 2015, S. Michel

53

Naïve Bayes Classifier • Assume independence among attributes Ai when class is given: P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

© Tan,Steinbach, Kumar

– Can estimate P(Ai| Cj) for all Ai and Cj. – New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal. Information Retrieval and Data Mining, SoSe 2015, S. Michel

54

How to Estimate Probabilities from s al al u c c i i Data? ategor ategor ontinuo lass

• Class: P(C) = Nc/N

c

Tid

© Tan,Steinbach, Kumar

– e.g., P(No) = 7/10, P(Yes) = 3/10

Refund

c

c

c

Marital Status

Taxable Income

Evade

1

Yes

Single

125K

No

2

No

Married

100K

No

• For discrete attributes: P(Ai | Ck) = |Aik|/ Nc

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

– where |Aik| is number of instances having attribute Ai and belongs to class Ck – Examples:

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

k

10

P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0 Information Retrieval and Data Mining, SoSe 2015, S. Michel

55

How to Estimate Probabilities from Data? • For continuous attributes: – Discretize the range into bins • one ordinal attribute per bin • violates independence assumption

k

– Two-way split: (A < v) or (A > v) • choose only one of the two splits as new attribute

© Tan,Steinbach, Kumar

– Probability density estimation: • Assume attribute follows a normal distribution • Use data to estimate parameters of distribution (e.g., mean and standard deviation) • Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) Information Retrieval and Data Mining, SoSe 2015, S. Michel

56

How to Estimate Probabilities from Data? t ca

• Normal distribution:

Tid

– One for each (Ai,ci) pair

© Tan,Steinbach, Kumar

• For (Income, Class=No): – If Class=No • sample mean = 110 • sample variance = 2975

Refund

o eg

a c i r

l

t ca

o eg

a c i r

l

co

in nt

u

s u o

Marital Status

Taxable Income

Evade

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

10

Information Retrieval and Data Mining, SoSe 2015, S. Michel

as l c

57

Example of Naïve Bayes Classifier Given a Test Record:

X  (Refund  No, Married, Income  120K)

naive Bayes Classifier:

© Tan,Steinbach, Kumar

P(Refund=Yes|No) = 3/7 P(Refund=No|No) = 4/7 P(Refund=Yes|Yes) = 0 P(Refund=No|Yes) = 1 P(Marital Status=Single|No) = 2/7 P(Marital Status=Divorced|No)=1/7 P(Marital Status=Married|No) = 4/7 P(Marital Status=Single|Yes) = 2/7 P(Marital Status=Divorced|Yes)=1/7 P(Marital Status=Married|Yes) = 0 For taxable income: If class=No: sample mean=110 sample variance=2975 If class=Yes: sample mean=90 sample variance=25

P(X|Class=No) = P(Refund=No|Class=No)  P(Married| Class=No)  P(Income=120K| Class=No) = 4/7  4/7  0.0072 = 0.0024

P(X|Class=Yes) = P(Refund=No| Class=Yes)  P(Married| Class=Yes)  P(Income=120K| Class=Yes) = 1  0  1.2  10-9 = 0

Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X)

=> Class = No

Information Retrieval and Data Mining, SoSe 2015, S. Michel

58

3. Support Vector Machines

© Tan,Steinbach, Kumar

Idea: Find a linear hyperplane (decision boundary) that will separate the data

Information Retrieval and Data Mining, SoSe 2015, S. Michel

59

Support Vector Machines One Possible Solution

© Tan,Steinbach, Kumar

B1

Information Retrieval and Data Mining, SoSe 2015, S. Michel

60

Support Vector Machines Another possible solution

© Tan,Steinbach, Kumar

B2

Information Retrieval and Data Mining, SoSe 2015, S. Michel

61

Support Vector Machines Other possible solutions

© Tan,Steinbach, Kumar

B2

Information Retrieval and Data Mining, SoSe 2015, S. Michel

62

Support Vector Machines B1

© Tan,Steinbach, Kumar

B2

• Which one is better? B1 or B2? • How do you define better? Information Retrieval and Data Mining, SoSe 2015, S. Michel

63

Support Vector Machines B1

B2

© Tan,Steinbach, Kumar

b21 b22

margin

b11

b12

• Find hyperplane maximizes the margin => B1 is better than B2 Information Retrieval and Data Mining, SoSe 2015, S. Michel

64

Support Vector Machines B1

  w x  b  0   w  x  b  1

  w  x  b  1

© Tan,Steinbach, Kumar

b11

  if w  x  b  1 1  f ( x)      1 if w  x  b  1  Information Retrieval and Data Mining, SoSe 2015, S. Michel

b12

Margin 

2  || w ||2 65

Summary Data Mining • Frequent Itemset and Association Rule Mining: – Apriori Principle and Algorithm

• Clustering: – K-means – Hierarchical clustering – DBSCAN (density based clustering)

• Classification: – Decision trees – Naïve Bayes Classifier – Support Vector Machines (SVMs) Information Retrieval and Data Mining, SoSe 2015, S. Michel

66