Chapter 8, Sequence Data Mining

11/8/2016 CSI 4352, Introduction to Data Mining Chapter 8, Sequence Data Mining Young-Rae Cho Associate Professor Department of Computer Science Ba...

Author: Marlene Horn

1 downloads 2 Views 115KB Size

Report

Download PDF

Recommend Documents

Chapter 8: Privacy Preserving Data Mining

Chapter 3 Data Mining

Data Mining. Chapter Introduction

Mining Periodic Patterns in Sequence Data

Data Mining: Concepts and Techniques. Chapter 8. (3 rd ed.)

Chapter 8 Transportation Data

A Survey of Sequence Patterns in Data Mining Techniques

CHAPTER-30 Additional Themes on Data Mining

Chapter 3. Advanced Data Mining Neural Networks

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Query Driven Sequence Pattern Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Chapter 8. Producing Data: Sampling

Mining Features for Sequence Classification

Chapter 8. Abstract Data Types

Marking Time in Sequence Mining

Chapter 15. Data Warehousing and Data Mining Table of Contents

Excerpts for Data Mining Anomaly Detection. Lecture Notes for Chapters 8 &10. Introduction to Data Mining

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 5. Introduction to Data Mining

Data Warehousing & Data Mining

11/8/2016

CSI 4352, Introduction to Data Mining

Chapter 8, Sequence Data Mining

Young-Rae Cho Associate Professor Department of Computer Science Baylor University

Topics  Single Sequence Mining 

Frequent sequence pattern mining • Finding sub-sequences that frequently occur in a sequence

ACTTCGATGGAGCCAGTCGCGAAATTCGACTAGATCG  Sequence Dataset Mining 

Frequent sequence pattern mining • Finding sub-sequences that frequently occur among sequences



Sequence data clustering • Grouping similar sequences



Sequence data classification

id 1

sequence ACTTCG

2

GGAGC

3

CGTCGAT

4

ATCGATCGC

5

TCGACT

6

AGATCGC

• Classifying a new sequence

1

11/8/2016

Applications  Examples 

Customer shopping sequential patterns • e.g., First buy a computer, then a CD-ROM, and then a printer within 3 months



Stock market changes



Web-log patterns



Medical treatment records



Gene or protein sequences



Natural disaster records

 Challenges 

Finding the complete set satisfying the minimum support threshold



Developing efficient and scalable algorithms



Incorporating various kinds of user-specific constraints

Problem Definition  Scope 

Transaction data → Sequential transaction data



Frequent itemset patterns → (Frequent) Sequential patterns

 Definitions 

Sequence: an ordered list of items, e.g., x =



Sub-sequence of x: an ordered sequence of items from x



•

Not necessarily consecutive (different from a substring)

•

Frequent sequence pattern < a c d > is a frequent sequence pattern with 75% minimum support

SID

sequences

10

20

30

40

2

11/8/2016

Extended Problem Definition  Extended Definitions 

Consider “or” at a single position



Sequence: an ordered list of elements and each element is a set of items, e.g., x =



Sub-sequence of x



Frequent sequence pattern

•

SID

sequences

10

20

is a frequent sequence

30

pattern with 75% minimum support

40

Properties of Sequence Patterns  Properties 

Anti-monotonic property → Apriori algorithm



If a sequence S is not frequent, then none of the super-sequences of S are frequent

 Example 

If is infrequent, so do and and

SID

sequence

10

20

30

40

min_sup = 75%

3

11/8/2016

GSP Algorithm  GSP 

Generalized Sequential Pattern mining

 Algorithm (1) Initially, find all frequent length-1 sequences ( = frequent 1-itemset ) (2) Generate candidate length-(k+1) sequences from frequent length-k sequences (3) Count support for each candidate sequence to select frequent sequences (4) Repeat (2) and (3) until no frequent sequence or no candidate is found

Length-1 Sequences  Example

SID

sequence

10

20

30

40

min_sup = 75%

 Candidate Length-1 Sequences 

, , , , , ,

 Frequent Length-1 Sequences 

, , , ,

4

11/8/2016

Length-2 Sequences  Candidate & Frequent Length-2 Sequences

SID

sequence

10

20

30

40

Length-3 Sequences  Candidate & Frequent Length-3 Sequences

SID

sequence

10

20

30

40

5

11/8/2016

Summary of GSP Algorithm  Strength 

Apriori pruning

 Weakness 

Generates a huge set of candidate sequences



Requires multiple scans of database



Inefficient for mining long sequential patterns

 References 

Agrawal, R. and Srikant, R., “Mining sequential patterns.” In Proceedings



Srikant, R. and Agrawal, R., “Mining sequential patterns: Generalizations

of ICDE (1995) and performance improvements.” In Proceedings of EDBT (1996)

PrefixSpan Algorithm  PrefixSpan 

Prefix-projected sequential pattern mining

 Prefix 

Suppose all items in an element are listed alphabetically.



Given x=, y= (m  n) is a prefix of x if and only if (1) ei = e’i for (i  m1)

(2) e’m  em

(3) all items in (em  e’m) are alphabetically after those in e’m 

, , and are prefixes of

 Main Idea 

Keep track of prefixes (instead of all candidate sequences) from the sequence database



Project their suffixes into projected databases

6

11/8/2016

Projected Database  Definition 

A set of maximal suffixes of substrings with a given prefix

 Example 

-projected database includes

SID

sequence

•

10

•

20

•

30

•

40

PrefixSpan Algorithm  Algorithm (1) Find all frequent length-1 sequences • , , , , (2) Divide search space to • Maximal substrings having prefix • Maximal substrings having prefix • Maximal substrings having prefix • Maximal substrings having prefix • Maximal substrings having prefix (3) Construct a projected database for each search space (4) Find subsets of sequential patterns recursively

7

11/8/2016

PrefixSpan Algorithm  Example SID

sequence

10

20

30

40

min_sup = 75%

pattern

projected database

, , ,

, , ,

, , ,

,

,

PrefixSpan Algorithm  Example (continued) pattern

projected database

, , ,

, , ,

, , ,

,

,

pattern

projected database

, ,

, , ,

recursion

, ,

8

11/8/2016

PrefixSpan Algorithm  Example (continued) pattern

projected database

, ,

, , ,

, , recursion

pattern

projected database

, ,

Summary of PrefixSpan Algorithm  Strength 

Efficient (major cost is to construct projected databases)



Project databases keep shrinking rapidly

 Weakness 

Searching frequency redundantly

 References 

Pei, J., et al., “PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth.” In Proceedings of ICDE (2001)

9

11/8/2016

Constraint-Based Sequence Mining  Constraint Types 

Prefix anti-monotonic • If a sequence s violates the constraint C, then so does any sequence having s as a prefix



Prefix monotonic • If a sequence s satisfies the constraint C, then so does any sequence having s as a prefix

 Application 

Push prefix anti-monotonic constraints for early pruning

 Reference 

Pei, J., Han, J. and Wang, W, “Mining sequential patterns with constraints in large databases.” In Proceedings of CIKM (2002)

Questions?  Lecture Slides on the Course Website, “www.ecs.baylor.edu/faculty/cho/4352”

10