150
University of Pittsburgh
Association Mining (aka Affinity Grouping) •
Association rule mining: –
•
Applications: –
•
Finding frequent patterns, associations, correlations, or causal structures among sets of entries in transaction databases, relational databases, and other information repositories. Market Basket analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc.
Examples. Rule form: If (Something) Then (Something Else)... i.e., (Body) → (Head); [support (or coverage), confidence (or accuracy)]”. – buys(“beer”) → buys(“chips”); [0.5%, 60%] – major(“CS”) ^ takes(“DB”) → grade(“A”); [1%, 75%] –
Industrial Engineering
151
University of Pittsburgh
Association Rules: Basic Concepts • Given: (1) a database of transactions, (2) each transaction is a list of items (e.g., purchased by a customer during a visit) • Find: all rules that correlate the presence of one set of items with that of another set of items – E.g., (78% of) people who purchase tires and auto accessories also get automotive services done
• Applications
– * ⇒ Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) – Home Electronics ⇒ * (What other products should the store stock up on?) – Attached mailing in direct marketing Industrial Engineering
152
University of Pittsburgh
The Useful, the Trivial and the Inexplicable • Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also buying Snickers, Butterfinger or M&M candy. • Customers who purchase maintenance agreements are very likely to purchase large appliances. • When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners.
Industrial Engineering
153
University of Pittsburgh
Association Rule Mining: A Road Map •
Boolean vs. quantitative associations (Based on the types of values handled) – –
• •
Single dimension vs. multiple dimensional associations (see example above) Single level vs. multiple-level analysis –
•
buys(“SQLServer”) ^ buys(“DMBook”) → buys(“DBMiner”); [0.2%, 60%] age(“30...39”) ^ income(“42...48K”) → buys(“PC”); [1%, 75%]
What brands of beers are associated with what brands of chips?
Various extensions –
Correlation, causality analysis
–
Maxpatterns (a “frequent” pattern but one for which any superpattern is not “frequent”) Closed Itemsets (a set of items such that none of the transactions containing these contain any proper superset of these items)
•
–
Association does not necessarily imply correlation or causality
Industrial Engineering
154
University of Pittsburgh
Choosing the Correct Set of Items •
At a grocery store –
•
Items might be Milk, Sugar, Fruit, Coffee, Frozen pizza
If you’re the manager of frozen foods within the grocery store (or you run a pizzeria…) –
Items might be Extra Cheese, Onions, Peppers, Mushrooms, Sausage Industrial Engineering
155
University of Pittsburgh
Product Hierarchies
Frozen Foods
Frozen Yogurt
Ice Cream
Vanilla
Frozen Fruitbars
Strawberry
Peas
Rocky Road
Frozen Dinners
Carrots
Pistachio
Mixed
Other
Other
more detailed
Chocolate
Frozen Veggies
more general
Frozen Desserts
Brands, sizes, and stock keeping units (SKUs)
Industrial Engineering
156
University of Pittsburgh
Some Notation • Itemset ≡ set of items • k-itemset ≡ itemset that consists of k items – E.g., {beer, chips, salsa} is a 3-itemset
• Frequency (σ) - sometimes also called support count or count of an itemset ≡ no. of transactions in the dataset that contain the itemset • Support (s) - sometimes called relative support is simply the frequency divided by total no. of transactions in dataset • Frequent Itemset ≡ an itemset whose support exceeds some minimum support threshold min_sup (or equivalently, whose support count exceeds some minimum support count threshold) • LK ≡ set of all frequent k-itemsets Industrial Engineering
157
University of Pittsburgh
Binary Representation • Each row corresponds to a transaction and each column to an item • If item is present in the transaction the entry is 1 otherwise it is 0 • Ignores non-binary information such as quantities, prices paid, etc.
TID A
B
C
D
E
F
2000 1000 4000 5000
1 0 0 1
1 1 0 0
0 0 1 0
0 0 0 1
0 0 0 1
1 1 1 0
Industrial Engineering
158
University of Pittsburgh
Rule Measures: Support and Confidence Customer buys both
Customer buys chips
Find all the rules X ⇒ Y with minimum confidence and support
– support, s, probability that a transaction contains {X AND Y} = Transaction ID Items Bought Let minimum support 50%, σ(X&Y) ÷ |T| confidence 2000 A,B,C and minimum 1000 A,C 50%, we have – confidence, c, conditional 4000 A,D – A ⇒ C that (50%,a66.6%) probability 5000 B,E,F – C ⇒ A (50%, 100%) transaction having {X} Industrial Engineering also contains {Y}, i.e. Customer buys beer
159
University of Pittsburgh
Rule Measures: Why? •
Support: Important because a rule with low support – –
•
may occur just by chance Is probably not of much practical interest since it happens so rarely
Confidence: Important because it – –
measures the reliability of a rule’s inference estimates the conditional probability of the antecedent, given the precedent Industrial Engineering
160
University of Pittsburgh
Mining Association Rules •
Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support ≥ min_sup
2. Rule Generation – From each frequent itemset generate strong (high confidence) rules, i.e. those with confidence higher than some threshold. – Each rule is a binary partitioning of a frequent itemset Industrial Engineering
161
University of Pittsburgh
How Does it Work? • Basic Idea - Brute Force Counting! Example: Six customers of Expedia.com have purchased the following: Customer 1: Flight, Cruise, Hotel Customer 2: Flight, Car, Hotel Customer 3: Package Hotel Customer 4: Flight, Car Car Customer 5: Flight, Hotel Flight Customer 6: Flight, Cruise Cruise
Co-Occurrence Matrix
Package
Hotel Car 3 1
Flight Cruise Package 3 1 0
1
2
2
0
0
3
2
5
2
0
1
0
2
2
0
0
0
0
0
1
Industrial Engineering
162
University of Pittsburgh
An Example Min. support 33.33% Min. confidence 50%
Hotel Car 3 1
Hotel Car
2
Flight
Flight Cruise Package 3 1 0 2
0
0
5
2
0
2
0
Cruise Package
For rule Cruise ⇒ Flight: Support = P(Cruise & Flight) = support({Cruise, Flight}) = 0.33 Confidence = P(Flight|Cruise)= P(Flight&Cruise) ÷ P(Cruise) = support({Cruise, Flight}) support({Cruise}) = 0.33/0.33 = 100%
1
Frequent Itemset Support {Hotel} 50% {Car} 33% {Flight} 83% {Cruise} 33% {Flight,Hotel} 50% {Flight,Car} 33% {Flight,Cruise} 33% Industrial Engineering
163
University of Pittsburgh
Frequent Itemset Generation null
A
B
C
D
E
Frequent itemset generation is computationally expensive
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE Source: Tan, Steinbach and Kumar
ACDE
BCDE
Given d items, there are 2d -1 possible candidate itemsets Industrial Engineering
164
University of Pittsburgh
Computational Complexity • Given d unique items: – Total number of itemsets = 2d-1 – Total number of possible association rules:
⎡⎛ d ⎞ ⎛ d − k ⎞⎤ R = ∑ ⎢⎜ ⎟ × ∑ ⎜ ⎟⎥ ⎣⎝ k ⎠ ⎝ j ⎠⎦ = 3 − 2 +1 d −1
d −k
k =1
j =1
d
d +1
If d=6, R = 602 rules. If d=100, R = 5.15×1047 rules! Source: Tan, Steinbach and Kumar
Industrial Engineering
165
University of Pittsburgh
Key to Mining Frequent Itemsets: The Apriori Principle • A subset of a frequent itemset must also be a frequent itemset: – i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset – As a corollary, if an itemset is infrequent then so are all of its supersets
• Holds because of the the following anti-monotone property ∀X,Y : (X⊆Y) ⇒ s(X)≥s(Y) • Iteratively find frequent itemsets L1,L2,…,Lk with cardinality from 1 to k (k-itemset), from candidate itemsets C1,C2,…,Ck respectively. Industrial Engineering
166
University of Pittsburgh
Frequent Itemset Generation via Apriori Principle null
Found to be infrequent
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
Pruned supersets ABCDE Source: Tan, Steinbach and Kumar
ACDE
BCDE
Frequent Itemset Industrial Engineering
167
University of Pittsburgh
Illustrating Apriori Principle Item Bread Coke Milk Beer Diaper Eggs
Count 4 2 4 3 4 1
C1 Items (1-itemsets)
Minimum Support = 3
Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}
If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13
Source: Tan, Steinbach and Kumar
Count 3 2 3 2 3 3
C2 Items Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) C3 items (3-itemsets)
Ite m s e t { B r e a d ,M ilk ,D ia p e r }
Count 3
Industrial Engineering
168
University of Pittsburgh
The Apriori Algorithm • Join Step: Ck is generated by joining Lk-1with itself • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k Lk : frequent itemset of size k
L1 = {frequent items}; for (k = 1; Lk !=φ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support end return ∪k Lk; Industrial Engineering
169
University of Pittsburgh
How to Generate Candidates? Suppose the items in Lk-1 are listed in an order Step 1 (Fk-1×Fk-1 method): self-joining Lk-1 insert into Ck from Lk-1 (p), Lk-1 (q) identify all cases where item1(p)=item1(q), item2(p)=item2(q) …, itemk2(p)=itemk-2(q) and itemk-1(p)AC in the rule consequent CD=>AB • join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC • Prune rule D=>ABC if its D=>ABC subset AD=>BC does not have high confidence Source: Tan, Steinbach and Kumar
Industrial Engineering