Data Mining I. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods. Keith E. Emmert

Data Mining I Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Meth...

Author: Reginald Anderson

4 downloads 0 Views 615KB Size

Report

Download PDF

Recommend Documents

Data Mining: Concepts and Techniques

Data Mining. Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining: Concepts and Techniques. Web Mining. Li Xiong

Pattern Decomposition Algorithm for Data Mining Frequent Patterns

Basic Data Mining Techniques

Mining Periodic Frequent Patterns using Period Summary and Map-Reduce

A Survey Paper on Frequent Itemset Mining Methods and Techniques

Frequent Item Set Mining

Frequent Subgraph Mining

Frequent subsequence mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Keywords : Data mining,weighted frequent pattern mining,updated TDB, Mining with tree structure

Beginning Steps in SPSS. Keith E. Emmert Soad Emmert

Data Warehousing and Data Mining

Data Mining: Data And Preprocessing

Sequential Pattern Mining: Concepts and Primitives. Mining Sequence Patterns in Transactional Databases

Data mining of geospatial data: combining visual and automatic methods

Mining Frequent Patterns in Print Logs with Semantically Alternative Labels

Mining Best-N Frequent Patterns in a Video Sequence

Chapter 5, Frequent Pattern Mining

Data Mining I Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Data Mining I Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

Keith E. Emmert Tarleton State University

October 2, 2012

Data Mining I Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Basics Concepts

Frequent Itemset Mining Methods: Apriori

Which Patterns Are Interesting? Pattern Evaluation Methods

Data Mining I

Frequent Patterns

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

A frequent pattern is a pattern that often appears in a data set. I A frequent itemset is a set of items that often appears in a data set. (Computer, monitor, surge protector, and a computer game) I A frequent sequential pattern is a subsequence that often appears in a data set. (First a computer, then computer game #1, then computer game #2) I A frequent structured patter is a substructure that can be constructed from a data set that often appears. (Subgraphs, subtrees, etc)

Data Mining I

Market Basket Analysis

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Market basket analysis uses customer purchases to develop associations between the different items purchased. This allows “big-brother” to produce a more effective I product placement and catalog design (Computer games are near computers). Note that product placement in a store, catalog, and online may all be different as different types of people shop in different ways. I cross-marketing strategies - products and services of other companies that complement yours (you only sell computers, so partner with a company that sells cool computer games!) I customer shopping behavior analyses. (A computer was purchased...they’re gonna buy a game, so print computer game coupons upon checkout!)

Data Mining I

Some Basic Terms

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Let I = {I1 , . . . , Im } be a (universal) set of items. I A transaction, T ⊆ I, is a random variable. I D is the set of all transactions in a given time. I A ⊆ T is an itemset. If |A| = k, then it is called a k-itemset. The occurrence frequency of an itemset is the # of transactions that contain the itemset and is also know as frequency, support count, count, or absolute support of the itemset. We wish to study which itemsets (subsets of transactions) imply other itemsets are also obtained. For itemsets A, B, we assume the following notation: I Pr (A ∪ B) denotes the probability that a transaction contains the union of sets A and B, that is, it contains all items in the sets A and B. I Pr (A or B) denotes the probability that a transaction contains either A, B, or both.

Data Mining I Keith E. Emmert

Association Rules Support

Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Suppose A, B ⊆ I are item sets, A, B 6= ∅, and A ∩ B = ∅. We wish to study “customers” who obtain A that also tend to obtain B. This is denoted by A ⇒ B and is called an association rule. Support(A ⇒ B) = Pr (A ∪ B) is the percentage of transactions in D that contain both A and B. So, Support(A ⇒ B) is # transactions that contain both A and B . |D| Note that Support(A ⇒ B) = Support(B ⇒ A) = Support(A ∪ B). We shall also call this Relative Support of an itemset.

Data Mining I Keith E. Emmert

Association Rules Confidence(A ⇒ B)

Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Define Confidence(A ⇒ B) = Pr (B | A) is the percentage of transactions in D containing A that also contain B, and that for any itemset Q, the number of transactions that contain Q is called Occurrence Frequency(Q) = Support(Q) = Support Count(Q). So, we know # transactions that contain both A and B # transactions that contain A Support(A ∪ B) Support(A ⇒ B) = = Support(A) Support(A) Support Count(A ∪ B) = Support Count(A)

P(B | A) =

In general, Confidence(A ⇒ B) 6= Confidence(B ⇒ A).

Data Mining I

Example

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Suppose that A contains a cat, and B contains catnip, then we can write A ⇒ B[Support = 4%, Confidence = 40%, ] and conclude that I 4% of all transactions purchased both a cat and their drug of choice. I 40% of all transactions that included a cat, also included their drug of choice. Of course, this is not a commutative operator, so B ⇒ A would have completely different support and confidence.

Data Mining I

Frequent Itemsets and Strong Association Rules

Keith E. Emmert Outline

I

Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

I I

An itemset, Q, is called a frequent itemset if Support(Q) = Pr (Q) (the # of transactions that contain Q divided by the number of transactions) is larger than a specified minimum support threshold, i.e. relative support of Q is at least a prespecified minimum support threshold if and only if its absolute support is at least a corresponding minimum support count threshold. The set of frequent k-itemsets is denoted by Lk . For two itemsets A, B, if Support(A ⇒ B) is at least a minimum support threshold and Confidence(A ⇒ B) is at least a minimum confidence threshold, then the association rule is called strong.

Data Mining I

When are Association Rules Strong?

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Recall that the support count of an itemset is the number of transactions that contain the itemset. If we know 1. Number of transactions, |D| 2. Support Count(A) 3. Support Count(B) 4. Support Count(A ∪ B) Then we can compute Support Count(A ∪ B) I Confidence(A ⇒ B) = . Support Count(A) Support Count(A ∪ B) I Confidence(B ⇒ A) = . Support Count(B) Support Count(A ∪ B) I Support(A ⇒ B) = . |D| I Support(B ⇒ A) = Support(A ⇒ B). It is easy to derive the rules A ⇒ B as well as B ⇒ A and check whether or not they are strong.

Data Mining I

Association Rule Mining and a Problem

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

A “simple” two step process 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, minsup. 2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence. The Problem: I If I is a frequent itemset, then each A ⊆ I is frequent. I Hence, P(I )\{∅} contains all frequent item subsets of I. I Note that P(I )\{∅} = 2|I | − 1 is the number of frequent itemsets contained in I , a possibly large number.

Data Mining I

Some More Terms

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Let Q be an itemset, and D the set of transactions. Suppose a minimum support threshold is fixed. I Q is closed in D if 6 ∃R such that Q ( R ⊆ D where the Support Count(Q) = Support Count(R). I Q is a closed frequent itemset in D if Q is closed and frequent. I Q is a maximal frequent itemset in D if Q is frequent and 6 ∃R ⊆ D such that Q ( R and R is frequent. I C is the set of closed frequent itemsets of D. I M is the set of maximal frequent itemsets of D. Note that I C along with the support count of each itemset in C allows us to derive the whole set of frequent itemsets. I However, M with support counts of each itemset in M does NOT (in general) allow us to derive the set of frequent itemsets.

Data Mining I

Example

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Let the (universe) set of item be I = {a1 , . . . , a100 } and define D = {Q = {a1 , . . . , a100 }, R = {a1 , . . . , a50 }} to be the transactions set. Let the minimum support count is 1. I Support Count(Q) = 1. I Support Count(R) = 2 because R ⊆ R and R ⊆ Q. I Clearly both are frequent itemsets. I Q is closed because 6 ∃Y ⊆ D such that Q ( Y . I R is closed because 6 ∃Y ⊆ D such that R ( Y and Support Count(R) = Support Count(Y ). I So, C = {Q, R}. I Since R ⊆ Q and Q is a frequent itemset, then R is not a maximal frequent itemset. I Clearly Q is a maximal frequent itemset. I So, M = {Q}.

Data Mining I

Example Continued

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Let the (universe) set of item be I = {a1 , . . . , a100 } and define D = {Q = {a1 , . . . , a100 }, R = {a1 , . . . , a50 }} to be the transactions set. Let the minimum support count is 1. Recall that I Support Count(Q) = 1. I Support Count(R) = 2 because R ⊆ R and R ⊆ Q. I So, C = {Q, R}. I So, M = {Q}. Hence, I For any U ⊆ D, if ∃aj such that j > 50, then the Support Count(U) = 1 (i.e. U ⊆ Q), otherwise it is 2 (i.e. U ⊆ R). So, knowing the support count of C yields the support count of all itemsets. I Note we can’t compute Support Count({a1 , a2 }) given the only the support count of M .

Data Mining I

Apriori

Keith E. Emmert Outline

I

Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

I I

Apriori (Prior Knowledge) algorithm developed by Rakesh Agrawal and Ramakrishnan Srikant, fourth in Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994 – used in forming frequent itemsets for Boolean association rules. Apriori Property All nonempty subsets of a frequent itemset must also be frequent. Corollary If I is not frequent, the for all items a, I ∪ {a} is not frequent.

Data Mining I

The Basic Idea

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Suppose we have Lk−1 , the (k − 1)-frequent itemsets, for k ≥ 2. Assume that transaction items are sorted in lexicographic order. Goal: Find Lk . I The Join Step Ck = Lk−1 o n Lk−1 : Join Lk−1 with itself to form candidate k-itemsets, where o n means: I

I I

I

l1 , l2 ∈ Lk−1 are joinable if ∧k−2 n=1 (l1 [n] = l2 [n]) and l1 [k − 1] < l2 [k − 1]. l1 [k − 1] < l2 [k − 1] ensures there are no duplicates. New k-itemset is {l1 [1], l1 [2], . . . , l1 [k − 1], l2 [k − 1]}.

Prune Step Note that Lk ⊆ Ck , so some k-itemsets may not be frequent. I

I

Any (k − 1)-itemset that is not frequent will not be a subset of a frequent k-itemset. Any (k − 1)-subset of a candidate k-itemset in Ck that is not in Lk−1 can’t be frequent and must be removed!

Data Mining I

Example

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Suppose our transaction database is as given on the right. Suppose that minSup = 2. Find the frequent itemsets. C1 is L1 is Itemset Support Itemset {1} 2 {1} {2} 3 =⇒ {2} {3} 3 {3} {4} 1 {5} {5} 3

TID 1 2 3 4

Support 2 3 3 3

Items 1, 3, 4 2, 3, 5 1, 2, 3, 5 2, 5

Data Mining I

Example

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Suppose our transaction database is as given on the right. Suppose that minSup = 2. Find the frequent itemsets. C2 is Itemset Support L2 is {1, 2} 1 Itemset {1, 3} 2 {1, 3} =⇒ {1, 5} 1 {2, 3} {2, 3} 2 {2, 5} {2, 5} 3 {3, 5} {3, 5} 2

TID T001 T002 T003 T004

Items 1, 3, 4 2, 3, 5 1, 2, 3, 5 2, 5

Support 2 2 3 2

Note that C2 contains all possible joins because each subset is a frequent 1-itemset.

Data Mining I

Example

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Suppose our transaction database is as given on the right. Suppose that minSup = 2.

TID T001 T002 T003 T004

Items 1, 3, 4 2, 3, 5 1, 2, 3, 5 2, 5

Find the frequent itemsets. C3 is L3 is Itemset Support =⇒ Itemset Support {2, 3, 5} 2 {2, 3, 5} 2 Note that C2 does not contain I {1, 2, 3} because {1, 2} is not a frequent 2-itemset. I {1, 2, 5} because {1, 2} is not a frequent 2-itemset. I {1, 3, 5} because {1, 5} is not a frequent 2-itemset. Further joins are not possible, that is, C4 = L3 o n L3 = ∅, and the Apriori algorithm terminates.

Data Mining I

Apriori Algorithm

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

input : D, a database of transactions minSup, the minimum support count threshold. output: L, frequent itemsets in D. begin L1 = findFequent1Itemsets(D) for k = 2; Lk−1 6= ∅; k++ do Ck = apriori.gen(Lk−1 ) for t ∈ D do Ct = subset(Ck , t) /* get candidates */ for c ∈ Ct do c.count++ end end Lk = {c ∈ Ck | c.count ≥ minSup} end return L = ∪k lk end

Data Mining I

Procedure apriori.gen

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

input : Lk−1 , frequent (k − 1)-itemsets output: Ck , candidate k-itemsets begin for ∀l1 ∈ Lk−1 do for ∀l2 ∈ Lk−1 do if ∧k−2 n=1 (l1 [n] = l2 [n]) ∧ (l1 [k − 1] < l2 [k − 1]) then c = l1 o n l2 /* join step candidates */ if has.infrequent.subset(c, Lk−1 ) then delete(c) /* prune step */ end else add c to Ck end end end end return Ck end

Data Mining I

Procedure has.infrequent.subset

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

input : c, candidate k-itemset Lk−1 , frequent (k − 1)-itemsets output: TRUE or FALSE begin for ∀(k − 1)subset s of c do if s 6∈ Lk−1 then return TRUE end end return FALSE end

Data Mining I

Generating Strong Association Rules A ⇒ B

Keith E. Emmert Outline

I

Basics Concepts Frequent Itemset Mining Methods: Apriori

I

Which Patterns Are Interesting? Pattern Evaluation Methods

I

I

I

Assume that minimum support, minSup, and minimum confidence, minConf , are defined, and D is the set of transactions. Want Support Count(A ∪ B) ≥ minSup Support(A ⇒ B) = |D| Support Count(A ∪ B) Confidence(A ⇒ B) = ≥ minConf . Support Count(A) For each frequent itemset I (consider each frequent itemset from L2 , L3 , . . .), generate all nonempty subsets of I . For every nonempty subset s of I , output the rule Support Count(I ) ≥ minConf s ⇒ (I \s) if Support Count(s) Support(A ⇒ B) ≥ minSup is true for all association rules since we use subsets of frequent itemsets.

Data Mining I Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Example Association Rules from L2

Items 1, 3, 4 2, 3, 5 1, 2, 3, 5 2, 5 Rule Confidence Support L1 Itemsets {1} ⇒ {3} 2/2 = 100% {1} 2 {2} ⇒ {3} 2/3 = 66.7% {2}, {3}, {5} 3 {2} ⇒ {5} 3/3 = 100% L2 Itemset Support {3} ⇒ {5} 2/3 = 66.7% {1, 3} 2 {3} ⇒ {1} 2/3 = 66.7% {2, 3} 2 {3} ⇒ {2} 2/3 = 66.7% 3 {2, 5} {5} ⇒ {2} 3/3 = 100% {3, 5} 2 {5} ⇒ {3} 2/3 = 66.7% {1} ⇒ {3}, {2} ⇒ {5}, and {5} ⇒ {2} are strong association rules from L2 .

Suppose our transaction database is as given on the right. Suppose that minSup = 2. Assume that minConf = 0.7

TID T001 T002 T003 T004

Data Mining I Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Example Association Rules from L3

Suppose our transaction database is as given on the right. Suppose that minSup = 2. Assume that minConf = 0.7

TID T001 T002 T003 T004

L1 Itemsets Support {1} 2 Rule 3 {2}, {3}, {5} {2} ⇒ {3, 5} L2 Itemsets Support {3} ⇒ {2, 5} {1, 3} 2 {5} ⇒ {2, 3} {2, 3} 2 {2, 3} ⇒ {5} {2, 5} 3 {2, 5} ⇒ {3} {3, 5} 2 {3, 5} ⇒ {2} L3 Itemset Support {2, 3, 5} 2 Hence, {2, 3} ⇒ {5} and {3, 5} ⇒ {2} are

Items 1, 3, 4 2, 3, 5 1, 2, 3, 5 2, 5

Confidence 2/3 = 66.7% 2/3 = 66.7% 2/3 = 66.7% 2/2 = 100% 2/3 = 66.7% 2/2 = 100% strong.

Data Mining I

When Strong Association Rules Go Bad

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Suppose I The number of transactions is |D| = 10000 I Number of computer games is 6000 I Number of videos is 7500 I Both computer games and videos are 4000 I Apriori generated, based on minSup = 0.3 and minConf = 0.6, the association rule Computer Games ⇒ Videos[Sup = 40%, Conf = 66%] I I

This is a strong association rule! This probability of purchasing videos is 7500/10000 = 75%. This is bad, since it appears that computer games and videos are negatively correlated - purchasing computer games decreases the number of videos purchased.

Data Mining I Keith E. Emmert

Correlation Lift

Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

The idea is that the occurrence (based on transactions that A appears in) of A is independent from the occurrence of B if Pr (A ∪ B) = Pr (A)Pr (B). Lift is defined to be Pr (B | A) Confidence(A ⇒ B) Pr (A ∪ B) = = . lift(A, B) = Pr (A)Pr (B) Pr (A) Support(A) I I I

If lift(A, B) < 1, then A is negatively correlated with the occurrence of B. If lift(A, B) > 1, then A is positively correlated with the occurrence of B. If lift(A, B) = 1, then A and B are independent of each other. Any association rule between these variables may not be useful.

Data Mining I

A Tiny Example

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Suppose you have noted the following transactions. Our minimum support threshold is 3 and minimum confidence threshold of 50%. TID Items T001 {1, 2, 3, 4} L1 Support L2 Support T002 {1, 2} {1} 3 {1, 2} 3 T003 {2, 3, 4} {2} 6 {2, 3} 3 T004 {2, 3} {3} 4 {2, 4} 4 T005 {1, 2, 4} {3, 4} 3 {4} 5 T006 {3, 4} Note that L3 did not make the miniT007 {2, 4} mum support threshold of 3.

Data Mining I

Example - The Association Rules Sorted by Lift

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

A⇒B {1} ⇒ {2} {4} ⇒ {2} {3} ⇒ {2} {3} ⇒ {4} {1} ⇒ {4} {2} ⇒ {4} {1} ⇒ {2, 4} {3, 4} ⇒ {2} {1, 2} ⇒ {4} {2, 3} ⇒ {4} {1} ⇒ {2, 4} {4} ⇒ {3} {2} ⇒ {1} {2} ⇒ {3} {2, 4} ⇒ {1} {2, 4} ⇒ {3}

Conf(A ⇒ B) 3/3 4/5 3/4 3/4 2/3 2/3 2/3 2/3 2/3 2/3 2/3 3/5 1/2 1/2 1/2 1/2

Support(A)/|D| 3/7 5/7 4/7 4/7 3/7 6/7 3/7 3/7 3/7 3/7 3/7 5/7 6/7 6/7 4/7 4/7

Lift(A, B) 7/3 28/25 21/16 21/16 14/9 7/9 14/9 14/9 14/9 14/9 14/9 21/25 7/12 7/12 7/8 7/8

Data Mining I

Example - Some Conclusions

Keith E. Emmert Outline

I

Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

{1} ⇒ {2} has the highest lift ratio. I

Basics Concepts

I

I

I

Offering a coupon for item 1 may entice the buyer to purchase item 2. The price on item 2 might be slightly increased to further enhance revenue of the bundled purchase. Buying a cat is followed by buying cat litter.

{2} ⇒ {1} has a lift ratio less than one. I

I

So, coupons for item 2 will probably not entice someone to purchase item 1! Buying cat litter need not be followed by buying a cat.

Data Mining I Keith E. Emmert

Other Pattern Evaluation Measures χ2 Test for Independence

Outline Basics Concepts

I

Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

I

χ2 Recall that we need the observed value and expected value. The statistic is X (observed − expected)2 χ2 = . expected The hypotheses are H0 :The variables are independent Ha :The variables are related

I I I

Degrees of freedom df = (Row − 1)(Column − 1). Reject when χ2 > χ2α . Some statisticians suggest that each cell count in a contingency table should be at least 5.

Data Mining I Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Other Pattern Evaluation Measures χ2 Test for Independence

The typical contingency table for ¯ repA ⇒ B is shown to the right. A ¯ A A resents transactions that do not conB c1 (d1 ) c2 (d2 ) ¯ are transactions tain itemset A. B ¯ c3 (d3 ) c4 (d4 ) B that do not contain B. Parentheses are expected counts. I Note that for χ2 , a number close to zero indicates a lack of information indicating the variables may be independent (fail to reject H0 ). I Larger numbers (closer to one) indicate dependence (reject H0 ). I

I

If the observed value is smaller than the expected, then their is a negative correlation in the rule A ⇒ B. If the observed value is larger than the expected, then their is a positive correlation in the rule A ⇒ B.

Data Mining I Keith E. Emmert

Other Pattern Evaluation Measures All Confidence

Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Given two itemsets A and B, Support(A ∪ B) max{Support(A), Support(B)} = min{Pr (A | B), Pr (B | A)}

All Conf(A, B) =

= min{Conf(B ⇒ A), Conf(A ⇒ B)} This is the minimum confidence of the two association rules A ⇒ B and B ⇒ A. This measure is anti-monotonic (the Apriori property), that is, if an item-set can’t pass a minimum association threshold, none of its super sets will pass it. Very useful for pruning.

Data Mining I Keith E. Emmert

Other Pattern Evaluation Measures Max Confidence

Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Given two itemsets A and B, Support(A ∪ B) min{Support(A), Support(B)} = max{Pr (A | B), Pr (B | A)}

Max Conf(A, B) =

= max{Conf(B ⇒ A), Conf(A ⇒ B)} This is the maximum confidence of the two association rules A ⇒ B and B ⇒ A. This measure is monotonic, that is, if an item-set’s association is no less than γ, then all of its super sets will be no less than it.

Data Mining I Keith E. Emmert

Other Pattern Evaluation Measures Kulczynski

Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Given two itemsets A and B, 1 Kulc(A, B) = (Pr (A | B) + Pr (B | A)) 2 1 = (Conf(B ⇒ A) + Conf(A ⇒ B)) 2 This is the average confidence of the two association rules A ⇒ B and B ⇒ A. This is the arithmetic mean. Does not have the monotonicity or anti-monotonicity property.

Data Mining I Keith E. Emmert

Other Pattern Evaluation Measures Cosine

Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Given two itemsets A and B, p Pr (A ∪ B) Cosine(A, B) = p = Pr (A | B)Pr (B | A) Pr (A)Pr (B) p = Conf(B ⇒ A)Conf(A ⇒ B) This is called a harmonized lift measure (the square root in the denominator ensures that the cosine value is only influenced by the support of A, B, and A ∪ B). It is also called the geometric mean. (Note that for M−1 (A, B) we obtain the harmonic mean.) Does not have the monotonicity or anti-monotonicity property.

Data Mining I Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Mathematical Generalized Mean See: “Re-examination of interestingness measures in pattern mining: a unified framework” by Wu, Chen, & Han, 2009.

The mathematical generalized mean is defined by 1/k Pr (A | B)k + Pr (B | A)k k M (A, B) = , 2 where M is the mathematical generalized mean and k ∈ R.

Theorem For all k ∈ R, Mk satisfies the following properties P1 Mk ∈ [0, l] P2 Mk monotonically decreases with Support(A) (or Support(B)) when Support(A ∪ B) and Support(B) (or Support(A)) remain constant P3 Mk is symmetric under item permutations P4 Mk is invariant to scaling, i.e. multiplying a scaling factor to Support(A ∪ B), Support(A), and Support(B) will not affect the measure.

Data Mining I

Using the Mathematical Generalized Mean

Keith E. Emmert Outline

Theorem

Basics Concepts

Recall that for two itemsets A and B, the mathematical generalized mean is defined by

Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

k

M (A, B) =

Pr (A | B)k + Pr (B | A)k 2

1/k ,

where k ∈ R. The following hold. I

All Conf(A, B) = lim Mk (A, B) = min{Pr (A | k→−∞

B), Pr (B | A)} I

Cosine(A, B) = lim Mk (A, B) = k→0

I I

p Pr (A | B)Pr (B | A)

Pr (A | B) + Pr (B | A) k→1 2 Max Conf(A, B) = lim Mk (A, B) = max{Pr (A | Kulc(A, B) = lim Mk (A, B) = k→∞

B), Pr (B | A)}

Data Mining I Keith E. Emmert

Common Properties for All Confidence, Max Confidence, Kulcynski, and Cosine Measures

Outline Basics Concepts

I

Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

I I

The measures are influenced Pr (A | B) and Pr (B | A) (that is, the supports of A, B, and A ∪ B), but not the total number of transactions. Return values between 0 and 1, with 1 indicating a stronger correlation. For all itemsets A and B, AllConf(A, B) ≤ Cosine(A, B) ≤ Kulc(A, B) ≤ MaxConf(A, B).

Data Mining I

Null Transactions

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

A null-transaction is a transaction that does not contain any of the itemsets being examined. A measure is called null-invariant if it it not influenced by null-transactions. I Often, null transactions outweigh the number of individual purchases being examined. I Lift and χ2 are sensitive to null-transactions. I All Confidence, Max Confidence, Kulcynski, and Cosine Measures remove the influence of null-transactions, and hence, they are null-invariant.

Data Mining I

Example

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Suppose we have milk and coffee under study. Let m denote milk, and c denote coffee. Then c denotes transactions that do not contain coffee, m denotes transactions that do not contain milk, and mc denotes null-transactions that contain neither milk nor coffee. P milk milk row coffee m mc c coffee mc mc c P P m m col Data Set mc mc mc mc D1 10000 1000 1000 100000 D2 10000 1000 1000 100 D3 100 1000 1000 100000 D4 1000 1000 1000 100000 D5 1000 100 10000 100000 D6 1000 10 100000 100000

Data Mining I

Example

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Data Set D1 D2 D3 D4 D5 D6 I mc

mc mc mc mc 10000 1000 1000 100000 10000 1000 1000 100 100 1000 1000 100000 1000 1000 1000 100000 1000 100 10000 100000 1000 10 100000 100000 positively associated in D1 , D2 since (for either set) Support(mc) = 10000 > Support(mc) = 1000 Support(mc) = 10000 > Support(mc) = 1000

I I

mc negatively associated in D3 mc is neutral in D4

Data Mining I

Example

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Data Set D1 D2 D3 D4 D5 D6 I mc I mc

mc mc mc mc 10000 1000 1000 100000 10000 1000 1000 100 100 1000 1000 100000 1000 1000 1000 100000 1000 100 10000 100000 1000 10 100000 100000 positively associated in D1 , D2 negatively associated in D3 since Support(mc) = 100 < Support(mc) = 1000 Support(mc) = 100 < Support(mc) = 1000

I

mc is neutral in D4

Data Mining I

Example

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Data Set D1 D2 D3 D4 D5 D6 I mc I mc I mc

mc mc mc mc 10000 1000 1000 100000 10000 1000 1000 100 100 1000 1000 100000 1000 1000 1000 100000 1000 100 10000 100000 1000 10 100000 100000 positively associated in D1 , D2 negatively associated in D3 is neutral in D4 since Support(mc) = 1000 = Support(mc) = 1000 Support(mc) = 1000 = Support(mc) = 1000

Data Mining I

Lift and χ2 - Sensitive to Null Transactions

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Data mc mc mc χ2 lift Set mc D1 10000 1000 1000 100000 905557 9.26 D2 10000 1000 1000 100 0 1 D3 100 1000 1000 100000 670 8.44 24740 25.75 D4 1000 1000 1000 100000 D5 1000 100 10000 100000 8173 9.18 965 1.97 D6 1000 10 100000 100000 I mc positively associated in D1 , D2 I mc negatively associated in D3 I mc is neutral in D4 I However, D1 and D2 have dramatically different results using χ2 and lift! This is due to sensitivity to null-transactions, mc. I For D3 , a negative association is not indicated. I For D4 , we have a very strong positive association indicated!

Data Mining I

All Confidence and Max Confidence

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Data mc mc mc Set mc D1 10000 1000 1000 100000 D2 10000 1000 1000 100 D3 100 1000 1000 100000 D4 1000 1000 1000 100000 D5 1000 100 10000 100000 D6 1000 10 100000 100000 I Association of mc is positive in D1 , D2 , negative in D3 , and neutral in D4 . Support(mc) 1000 I D5 : Support(c) = 1000+100 = 0.909 so c ⇒ m should

I

1000 occur and Support(mc) Support(m) = 1000+10000 = 0.091 so m ⇒ c probably won’t occur. 1000 D6 : Support(mc) Support(c) = 1000+10 = 0.990 so c ⇒ m should

occur and Support(mc) Support(m) = probably won’t occur.

1000 1000+100000

= 0.0099 so m ⇒ c

Data Mining I

All Confidence and Max Confidence

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Data mc mc mc AllCon MaxCon Set mc D1 10000 1000 1000 100000 0.91 0.91 D2 10000 1000 1000 100 0.91 0.91 D3 100 1000 1000 100000 0.09 0.09 0.5 0.5 D4 1000 1000 1000 100000 D5 1000 100 10000 100000 0.09 0.91 0.01 0.99 D6 1000 10 100000 100000 I Association of mc is positive in D1 , D2 , negative in D3 , and neutral in D4 . I D5 : c ⇒ m should occur and m ⇒ c probably won’t occur. I D6 : c ⇒ m should occur and m ⇒ c probably won’t occur. I AllCon and MaxCon agree on D1 through D4 . I AllCon and MaxCon give opposite results for D5 and D6 .

Data Mining I

Kulczynski and Cosine

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Data mc mc mc Kulc Cosine Set mc D1 10000 1000 1000 100000 0.91 0.91 D2 10000 1000 1000 100 0.91 0.91 D3 100 1000 1000 100000 0.09 0.09 0.5 0.5 D4 1000 1000 1000 100000 D5 1000 100 10000 100000 0.5 0.29 0.5 0.10 D6 1000 10 100000 100000 I Association of mc is positive in D1 , D2 , negative in D3 , and neutral in D4 . I D5 : c ⇒ m probably Yes and m ⇒ c probably No. I D6 : c ⇒ m probably Yes and m ⇒ c probably No. I Kulc & Cosine agree on D1 through D4 . I Kulc views (perhaps correctly?) D5 & D6 as neutral. I Cosine views D5 & D6 with negative association.

Data Mining I

When are A and B “Controversial”?

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

To measure how far “out of whack” A and B are, we want a function, IR(A, B), with the following properties Q1 IR(A, B) ∈ [0, 1], with IR(A, B) = 0 ⇐⇒ Pr (A | B) = Pr (B | A) IR(A, B) = 1 ⇐⇒ |Pr (A | B) − Pr (B | A)| = 1 Q2 IR(A, B) = IR(B, A) (Symmetry) Q3 IR(A, B) monotonically decreases if Support(AB) increases and Support(AB) & Support(AB) are fixed. (IR decreases when A & B share more common item sets) Q4 If Support(AB) & Support(AB) are fixed, then If Sup(AB) ≥ Sup(AB), IR ⇑ as Sup(AB) ⇑ . If Sup(AB) ≤ Sup(AB), IR ⇑ as Sup(AB) ⇓ . Thus, IR increases as Sup(AB) & Sup(AB) move apart. Similar results for Sup(AB) due to symmetry.

Data Mining I

Tiny Set Theory

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

For any two sets, U and V , we have U = (U ∩ V ) ∪ (U ∩ V ). I

Support(A) = Support(AB) + Support(AB)

I

Support(B) = Support(AB) + Support(AB)

I

Support(A)−Support(B) = Support(AB)−Support(AB)

Data Mining I

Imbalance Ratio

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori

The imbalance ratio, which assesses the imbalance of the two itemsets A and B, is |Support(A) − Support(B)| IR(A, B) = Support(A) + Support(B) − Support(AB)

Which Patterns Are Interesting? Pattern Evaluation Methods

= I I I I I

|Support(AB) − Support(AB)| Support(AB) + Support(AB) + Support(AB)

The numerator is the (absolute) difference between the support of the itemsets A and B The denominator is the number of transactions containing A ∪ B. If Conf(A ⇒ B) = Conf(B ⇒ A), then IR(A, B) = 0. Otherwise, the further out of agreement, the larger the ratio. IR is independent of the number of null-transactions as well as the total number of transactions.

Data Mining I

Imbalance Ratio

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Data mc mc mc IR Kulc Set mc D1 10000 1000 1000 100000 0 0.91 D2 10000 1000 1000 100 0 0.91 D3 100 1000 1000 100000 0 0.09 0 0.5 D4 1000 1000 1000 100000 D5 1000 100 10000 100000 0.971 0.5 D6 1000 10 100000 100000 0.9899 0.5 I Association of mc is positive in D1 , D2 , negative in D3 , and neutral in D4 . I D5 : c ⇒ m probably Yes and m ⇒ c probably No. I D6 : c ⇒ m probably Yes and m ⇒ c probably No. |(1000+10000)−(1000+100)| I D5 : IR(m, c) = (1000+10000)+(1000+100)−1000 = 0.971 D6 : IR(m, c) = 0.9899. Both indicate an imbalance. I Thus, Kulc and IR together give a better picture with D4 being neutral, and D5 , D6 being unbalanced because c ⇒ m “Yes” and m ⇒ c “No”.

Data Mining I

Homework

Keith E. Emmert Outline Basics Concepts Frequent Itemset Mining Methods: Apriori Which Patterns Are Interesting? Pattern Evaluation Methods

Chapter 6, Problems: 8, 11, 14