Affinity Grouping) Association Rules: Basic Concepts

150 University of Pittsburgh Association Mining (aka Affinity Grouping) • Association rule mining: – • Applications: – • Finding frequent patte...
Author: Kathlyn Lindsey
1 downloads 1 Views 174KB Size
150

University of Pittsburgh

Association Mining (aka Affinity Grouping) •

Association rule mining: –



Applications: –



Finding frequent patterns, associations, correlations, or causal structures among sets of entries in transaction databases, relational databases, and other information repositories. Market Basket analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc.

Examples. Rule form: If (Something) Then (Something Else)... i.e., (Body) → (Head); [support (or coverage), confidence (or accuracy)]”. – buys(“beer”) → buys(“chips”); [0.5%, 60%] – major(“CS”) ^ takes(“DB”) → grade(“A”); [1%, 75%] –

Industrial Engineering

151

University of Pittsburgh

Association Rules: Basic Concepts • Given: (1) a database of transactions, (2) each transaction is a list of items (e.g., purchased by a customer during a visit) • Find: all rules that correlate the presence of one set of items with that of another set of items – E.g., (78% of) people who purchase tires and auto accessories also get automotive services done

• Applications

– * ⇒ Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) – Home Electronics ⇒ * (What other products should the store stock up on?) – Attached mailing in direct marketing Industrial Engineering

152

University of Pittsburgh

The Useful, the Trivial and the Inexplicable • Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also buying Snickers, Butterfinger or M&M candy. • Customers who purchase maintenance agreements are very likely to purchase large appliances. • When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners.

Industrial Engineering

153

University of Pittsburgh

Association Rule Mining: A Road Map •

Boolean vs. quantitative associations (Based on the types of values handled) – –

• •

Single dimension vs. multiple dimensional associations (see example above) Single level vs. multiple-level analysis –



buys(“SQLServer”) ^ buys(“DMBook”) → buys(“DBMiner”); [0.2%, 60%] age(“30...39”) ^ income(“42...48K”) → buys(“PC”); [1%, 75%]

What brands of beers are associated with what brands of chips?

Various extensions –

Correlation, causality analysis



Maxpatterns (a “frequent” pattern but one for which any superpattern is not “frequent”) Closed Itemsets (a set of items such that none of the transactions containing these contain any proper superset of these items)





Association does not necessarily imply correlation or causality

Industrial Engineering

154

University of Pittsburgh

Choosing the Correct Set of Items •

At a grocery store –



Items might be Milk, Sugar, Fruit, Coffee, Frozen pizza

If you’re the manager of frozen foods within the grocery store (or you run a pizzeria…) –

Items might be Extra Cheese, Onions, Peppers, Mushrooms, Sausage Industrial Engineering

155

University of Pittsburgh

Product Hierarchies

Frozen Foods

Frozen Yogurt

Ice Cream

Vanilla

Frozen Fruitbars

Strawberry

Peas

Rocky Road

Frozen Dinners

Carrots

Pistachio

Mixed

Other

Other

more detailed

Chocolate

Frozen Veggies

more general

Frozen Desserts

Brands, sizes, and stock keeping units (SKUs)

Industrial Engineering

156

University of Pittsburgh

Some Notation • Itemset ≡ set of items • k-itemset ≡ itemset that consists of k items – E.g., {beer, chips, salsa} is a 3-itemset

• Frequency (σ) - sometimes also called support count or count of an itemset ≡ no. of transactions in the dataset that contain the itemset • Support (s) - sometimes called relative support is simply the frequency divided by total no. of transactions in dataset • Frequent Itemset ≡ an itemset whose support exceeds some minimum support threshold min_sup (or equivalently, whose support count exceeds some minimum support count threshold) • LK ≡ set of all frequent k-itemsets Industrial Engineering

157

University of Pittsburgh

Binary Representation • Each row corresponds to a transaction and each column to an item • If item is present in the transaction the entry is 1 otherwise it is 0 • Ignores non-binary information such as quantities, prices paid, etc.

TID A

B

C

D

E

F

2000 1000 4000 5000

1 0 0 1

1 1 0 0

0 0 1 0

0 0 0 1

0 0 0 1

1 1 1 0

Industrial Engineering

158

University of Pittsburgh

Rule Measures: Support and Confidence Customer buys both

Customer buys chips

Find all the rules X ⇒ Y with minimum confidence and support

– support, s, probability that a transaction contains {X AND Y} = Transaction ID Items Bought Let minimum support 50%, σ(X&Y) ÷ |T| confidence 2000 A,B,C and minimum 1000 A,C 50%, we have – confidence, c, conditional 4000 A,D – A ⇒ C that (50%,a66.6%) probability 5000 B,E,F – C ⇒ A (50%, 100%) transaction having {X} Industrial Engineering also contains {Y}, i.e. Customer buys beer

159

University of Pittsburgh

Rule Measures: Why? •

Support: Important because a rule with low support – –



may occur just by chance Is probably not of much practical interest since it happens so rarely

Confidence: Important because it – –

measures the reliability of a rule’s inference estimates the conditional probability of the antecedent, given the precedent Industrial Engineering

160

University of Pittsburgh

Mining Association Rules •

Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support ≥ min_sup

2. Rule Generation – From each frequent itemset generate strong (high confidence) rules, i.e. those with confidence higher than some threshold. – Each rule is a binary partitioning of a frequent itemset Industrial Engineering

161

University of Pittsburgh

How Does it Work? • Basic Idea - Brute Force Counting! Example: Six customers of Expedia.com have purchased the following: Customer 1: Flight, Cruise, Hotel Customer 2: Flight, Car, Hotel Customer 3: Package Hotel Customer 4: Flight, Car Car Customer 5: Flight, Hotel Flight Customer 6: Flight, Cruise Cruise

Co-Occurrence Matrix

Package

Hotel Car 3 1

Flight Cruise Package 3 1 0

1

2

2

0

0

3

2

5

2

0

1

0

2

2

0

0

0

0

0

1

Industrial Engineering

162

University of Pittsburgh

An Example Min. support 33.33% Min. confidence 50%

Hotel Car 3 1

Hotel Car

2

Flight

Flight Cruise Package 3 1 0 2

0

0

5

2

0

2

0

Cruise Package

For rule Cruise ⇒ Flight: Support = P(Cruise & Flight) = support({Cruise, Flight}) = 0.33 Confidence = P(Flight|Cruise)= P(Flight&Cruise) ÷ P(Cruise) = support({Cruise, Flight}) support({Cruise}) = 0.33/0.33 = 100%

1

Frequent Itemset Support {Hotel} 50% {Car} 33% {Flight} 83% {Cruise} 33% {Flight,Hotel} 50% {Flight,Car} 33% {Flight,Cruise} 33% Industrial Engineering

163

University of Pittsburgh

Frequent Itemset Generation null

A

B

C

D

E

Frequent itemset generation is computationally expensive

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

ABCE

ABDE

ABCDE Source: Tan, Steinbach and Kumar

ACDE

BCDE

Given d items, there are 2d -1 possible candidate itemsets Industrial Engineering

164

University of Pittsburgh

Computational Complexity • Given d unique items: – Total number of itemsets = 2d-1 – Total number of possible association rules:

⎡⎛ d ⎞ ⎛ d − k ⎞⎤ R = ∑ ⎢⎜ ⎟ × ∑ ⎜ ⎟⎥ ⎣⎝ k ⎠ ⎝ j ⎠⎦ = 3 − 2 +1 d −1

d −k

k =1

j =1

d

d +1

If d=6, R = 602 rules. If d=100, R = 5.15×1047 rules! Source: Tan, Steinbach and Kumar

Industrial Engineering

165

University of Pittsburgh

Key to Mining Frequent Itemsets: The Apriori Principle • A subset of a frequent itemset must also be a frequent itemset: – i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset – As a corollary, if an itemset is infrequent then so are all of its supersets

• Holds because of the the following anti-monotone property ∀X,Y : (X⊆Y) ⇒ s(X)≥s(Y) • Iteratively find frequent itemsets L1,L2,…,Lk with cardinality from 1 to k (k-itemset), from candidate itemsets C1,C2,…,Ck respectively. Industrial Engineering

166

University of Pittsburgh

Frequent Itemset Generation via Apriori Principle null

Found to be infrequent

A

B

C

D

E

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

ABCE

ABDE

Pruned supersets ABCDE Source: Tan, Steinbach and Kumar

ACDE

BCDE

Frequent Itemset Industrial Engineering

167

University of Pittsburgh

Illustrating Apriori Principle Item Bread Coke Milk Beer Diaper Eggs

Count 4 2 4 3 4 1

C1 Items (1-itemsets)

Minimum Support = 3

Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}

If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13

Source: Tan, Steinbach and Kumar

Count 3 2 3 2 3 3

C2 Items Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) C3 items (3-itemsets)

Ite m s e t { B r e a d ,M ilk ,D ia p e r }

Count 3

Industrial Engineering

168

University of Pittsburgh

The Apriori Algorithm • Join Step: Ck is generated by joining Lk-1with itself • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

• Pseudo-code:

Ck: Candidate itemset of size k Lk : frequent itemset of size k

L1 = {frequent items}; for (k = 1; Lk !=φ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support end return ∪k Lk; Industrial Engineering

169

University of Pittsburgh

How to Generate Candidates? Suppose the items in Lk-1 are listed in an order Step 1 (Fk-1×Fk-1 method): self-joining Lk-1 insert into Ck from Lk-1 (p), Lk-1 (q) identify all cases where item1(p)=item1(q), item2(p)=item2(q) …, itemk2(p)=itemk-2(q) and itemk-1(p)AC in the rule consequent CD=>AB • join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC • Prune rule D=>ABC if its D=>ABC subset AD=>BC does not have high confidence Source: Tan, Steinbach and Kumar

Industrial Engineering