Mining Periodic Frequent Patterns using Period Summary and Map-Reduce

Mining Periodic Frequent Patterns using Period Summary and Map-Reduce Thesis submitted in partial fulfillment of the requirements for the degree of ...
Author: Shawn Sanders
0 downloads 0 Views 728KB Size
Mining Periodic Frequent Patterns using Period Summary and Map-Reduce

Thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science in Computer Science and Engineering by Research

by Anirudh Alampally 201202173 [email protected]

International Institute of Information Technology Hyderabad - 500 032, INDIA Dec 2017

c Anirudh Alampally, 2017 Copyright All Rights Reserved

Dedicated to my beloved family

Acknowledgments

For acquainting me to research and helping me figure out how to grow as a researcher, I shall everlastingly be grateful to Prof. P. Krishna Reddy. I thank him for the continuous moral support when I needed it the most. His rigorous process of reviewing research papers before submitting them to conferences made me comprehend the way for exhibiting thoughts. I express my gratitude to Dr. Uday kiran from University of Tokyo who have put a great effort in providing guidance and shaping my research. The diverse knowledge he had in various areas helped me to understand a lot of things. I am grateful to Venkatesh for a ton of productive discussions we had, to Kumara Swamy, Amulya, Raghav, Kavya and Mamatha for helping me write research papers. I am indebted to my friend Pujitha for her moral support and encouragement even in times of despair. I thank Murali, Sankeerth, Alok, Ashrith, Jayanth, and Chetan for the wonderful memories they gave especially FIFA sessions. I like to thank everyone in IIIT who helped me make numerous memories which I would cherish forever. Finally, I would like to thank God for giving his blessings and strength at all times.

v

Abstract

The field of data mining has evolved to help organizations find interesting knowledge from databases. Under data mining, different knowledge extraction approaches like association rules, classification and clustering have been proposed. Association rules are generated by analyzing the transactional databases using the support and confidence interestingness measures to identify the most important relationships among the items. Discovering frequent patterns is a key step in association rule mining. A pattern is called a frequent pattern if it satisfies the user-defined threshold on minimum support. Periodic-frequent patterns come under the class of frequent patterns which capture the temporal dimension of a pattern. In this thesis, we made efforts to improve the performance of mining periodic-frequent patterns. Periodic-frequent patterns are an important class of regularities that exist in a transactional database. A pattern is called periodic-frequent if it satisfies the thresholds of minimum support and maximum periodicity. Here, support is the frequency and periodicity is the maximum inter-arrival time of the pattern. In the literature, an algorithm called periodic-frequent pattern growth (PF-growth) was proposed which extracts the complete set of periodic-frequent patterns from transactional databases. Here, the transactional database is initially compressed into a prefix-tree like structure called periodic-frequent pattern tree (PF-tree), which is recursively mined to extract the patterns. The PF-tree explicitly maintains the transaction identifier information of each transaction. This enables the calculation of periodicity, but nevertheless the PF-tree size increases. In recent times, the amount of data generated from e-Commerce and social-networking sites are very huge. Mining periodic-frequent patterns from these datasets require large amounts of main memory and faster processing power. We propose the following two approaches to overcome these issues: (i) mining periodic-frequent patterns using the notion of period summary and (ii) a Map-Reduce based approach to extract periodic-frequent patterns. In the first approach, a novel compression technique called period summary is introduced to reduce the amount of memory consumed to extract periodic-frequent patterns. It is based on the observation that the periodic-frequent patterns can also be extracted by storing period summaries instead of transaction identifiers in the tree structure. The performance is improved as the memory required to store the period summaries is significantly less than the memory required to store the list of transaction identifiers. The process of merging period summaries is also introduced to find the support and periodicity values of a candidate pattern. Apart from memory optimization, the time consumed has also been lowered. This is because, the linear time-complexity for finding a candidate pattern’s periodicity using a list of transvi

vii action identifiers has been reduced to a constant time operation using period summaries. Experimental results on three real-world datasets show that the proposed approach using period summary performs better than the existing approach both in terms of memory and time. Further in the second approach, a Map-Reduce based approach to extract periodic-frequent patterns is proposed. As the real-world datasets are very huge, extraction of periodic-frequent patterns on a single machine takes a lot of time and in some cases it becomes difficult because of the memory constraints. The proposed approach is scalable and utilizes multiple machines by considering Map-Reduce framework. It consists of two Map-Reduce phases. In the first phase, all one-sized periodic-frequent patterns are derived and in the second phase, independent trees are constructed on different machines. These independent trees are mined parallely on different machines to find the complete set of periodic-frequent patterns. Further, we proposed the notion of partition summary to reduce the amount of data shuffled among the machines, which improves the performance. Experiments on Apache Spark’s distributed environment show that the proposed approach speeds up with increase in the number of machines. Also, the notion of partition summary significantly reduces the amount of data shuffled among the machines. Overall, the process of mining periodic-frequent patterns from transactional databases has been made efficient, both in terms of memory and time. This paves a way for various e-Commerce and socialnetworking sites to use the knowledge of periodic-frequent patterns in the areas of customer relation management, inventory management, recommendation systems and so on.

Contents

Chapter

Page

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Periodic frequent patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Performance issues with the existing approach . . . . . . . . . . . . . . . . . . . 1.3 Overview of the proposed approaches . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Mining periodic frequent patterns using period summary . . . . . . . . . 1.3.2 Map-Reduce based approach to mine periodic frequent patterns . . . . . 1.4 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . .

2

Related Work . . . . . . . . . . . . . . 2.1 Frequent patterns . . . . . . . . . . 2.2 Periodic frequent patterns . . . . . . 2.3 Parallel mining of frequent patterns 2.4 How proposed approach is different 2.5 Summary . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . 6 . 6 . 7 . 9 . 10 . 11

3

Background . . . . . . . . . . . . . . . . . . 3.1 Frequent pattern mining using FP-growth 3.1.1 Structure of FP-tree . . . . . . . . 3.1.2 Construction of FP-tree . . . . . . 3.1.3 Mining of FP-tree . . . . . . . . . 3.2 Model of periodic-frequent patterns . . . 3.3 Periodic frequent pattern growth . . . . . 3.3.1 Structure of PF-tree . . . . . . . . 3.3.2 Construction of PF-tree . . . . . . 3.3.3 Mining of PF-tree . . . . . . . . . 3.4 Map-Reduce framework . . . . . . . . . 3.4.1 Combine function . . . . . . . . 3.5 Parallel FP-growth using Map-Reduce . . 3.6 Summary . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . 12 . 12 . 12 . 13 . 13 . 14 . 15 . 16 . 16 . 18 . 19 . 21 . 21 . 23

4

Mining Periodic Frequent Patterns Using Period Summary 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 4.2 Basic Idea - Period Summary . . . . . . . . . . . . . 4.3 Proposed approach - Period summary growth . . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . 24 . 24 . 24 . 27

. . . . . .

. . . . . .

viii

1 2 3 3 3 4 5 5

CONTENTS 4.3.1

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

5

Map-Reduce based approach to mine Periodic Frequent Patterns . . . . . . . . . . . 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Map-Reduce based periodic-frequent pattern mining . . . . . . . . . . . 5.2.2 Improvement using the notion of Partition summary . . . . . . . . . . . 5.3 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 First Map-Reduce phase - Finding one sized periodic-frequent patterns . 5.3.2 Second Map-Reduce phase - Construction of the PPF-tree . . . . . . . . 5.3.3 Mining of PPF-trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Improving the performance using the notion of partition summary . . . . . . . . 5.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Datasets description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . 38 . 38 . 39 . 40 . 40 . 42 . 42 . 44 . 47 . 49 . 51 . 53 . 53 . 54

6

Conclusions and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7

Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4

4.5

PS-Tree - Structure and Construction 4.3.1.1 Structure of PS-Tree . . . . 4.3.1.2 Construction of PS-tree . . 4.3.2 Mining of PS-tree . . . . . . . . . . . Experimental results . . . . . . . . . . . . . 4.4.1 Datasets description . . . . . . . . . 4.4.2 Performance evaluation . . . . . . . . 4.4.3 Scalability performance . . . . . . . 4.4.4 Discussion . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . .

ix . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

28 28 28 30 35 35 35 36 36 37

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

List of Figures

Figure

Page

1.1

Application of Frequent patterns in eCommerce - Amazon . . . . . . . . . . . . . . .

3.1

A sample dataset and corresponding ordered frequent itemsets. (Image taken as a reference from paper [18]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constructed FP-tree. (Image taken as a reference from paper [18]) . . . . . . . . . . . Construction of PF-list. (a) After scanning first transaction (b) After scanning second transaction (c) After scanning all transactions (d) Final sorted list of periodic-frequent items of size 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Construction of PF-tree. (a) PF-list (b) After scanning first transaction (c) After scanning second transaction (d) After scanning all transactions. . . . . . . . . . . . . . . . Mining using PFP-growth algorithm. (a) PF-tree after removing ‘b’ (b) Prefix tree of ‘b’ (c) Conditional Tree of ‘b’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mining using PFP-growth algorithm. (a) PF-tree after removing ‘d’ (b) Prefix tree of ‘d’ (c) Conditional Tree of ‘d’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Map-Reduce framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Map-Reduce execution model using combine function . . . . . . . . . . . . . . . . .

3.2 3.3

3.4 3.5 3.6 3.7 3.8 4.1 4.2 4.3

4.4 4.5 4.6 4.7

4.8

4.9

Occurrence timeline for pattern ‘cad’ . . . . . . . . . . . . . . . . . . . . . . . . . . Occurrence timeline for pattern ‘cad’ using the notion of period summary . . . . . . . Construction of PS-list. (a) After scanning first transaction (b) After scanning second transaction (c) After scanning all transactions (d) Final sorted list of periodic-frequent items of size 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Construction of PS-Tree. (a) PS-list (b) After scanning first transaction (c) After scanning second transaction and (d) After scanning all transactions. . . . . . . . . . . . . . Mining using PS-growth algorithm. (a) PS-Tree after removing ‘b’ (b) Prefix tree of ‘b’ (c) Conditional Tree of ‘b’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mining using PS-growth algorithm. (a) PS-Tree after removing ‘d’ (b) Prefix tree of ‘d’ (c) Conditional Tree of ‘d’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparative analysis by varying maxP er. Figures (1) - (3) show maxP er vs Number of patterns, Figures (4) - (6) show maxP er vs Tree memory, Figures (7) - (9) show maxP er vs Total memory and Figures (10) - (12) show maxP er vs Time . . . . . . . Comparative analysis by varying minSup. Figures (1) - (3) show minSup vs Number of patterns, Figures (4) - (6) show minSup vs Tree memory, Figures (7) - (9) show minSup vs Total memory and Figures (10) - (12) show minSup vs Time . . . . . . . Scalability of PS-tree (Memory requirements) . . . . . . . . . . . . . . . . . . . . . . x

2

13 13

16 17 18 18 20 20 25 26

28 29 30 31

33

34 37

LIST OF FIGURES 5.1 5.2 5.3 5.4 5.5 5.6

5.7

5.8

5.9 5.10 5.11 5.12 5.13 5.14 5.15

5.16

5.17

5.18 5.19

Single machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tree distributed between two machines . . . . . . . . . . . . . . . . . . . . . . . . . . First phase of parallel periodic frequent pattern mining . . . . . . . . . . . . . . . . . Second phase of parallel periodic frequent pattern mining . . . . . . . . . . . . . . . . Construction of PPF-list (a) tid-lists of all the items (b) support of each item (e) periodicity of each item . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Construction of PF-list on a single machine. (a) After scanning first transaction (b) After scanning second transaction (c) After scanning all transactions (d) Final sorted list of periodic-frequent items of size 1. . . . . . . . . . . . . . . . . . . . . . . . . . Construction of PPF-tree in partition 0 (a) tree after scanning the first transaction (b) tree after scanning the second transaction (c) tree after scanning the eight transaction and (d) tree after scanning the entire database . . . . . . . . . . . . . . . . . . . . . . Construction of PPF-tree in partition 1 (a) tree after scanning the first transaction (b) tree after scanning the second transaction (c) tree after scanning the eight transaction and (d) tree after scanning the entire database . . . . . . . . . . . . . . . . . . . . . . Construction of PPF-tree (a) tree at partition-id 0 (b) tree at partition-id 1 . . . . . . . Construction of PF-tree on a single machines (a) After scanning the first transaction (b) After scanning the second transaction (c) After scanning all transactions. . . . . . . . Mining using PPF-growth algorithm for suffix item ‘a’ in partition 1. (a) Prefix tree of ‘a’, P Ta (b) Conditional tree of ‘a’, CTa (c) PPF-tree after removing ‘a’. . . . . . . . Mining using PPF-growth algorithm for suffix item ‘d’ in partition 0. (a) Prefix tree of ‘d’, P Td (b) Conditional tree of ‘d’, CTd (c) PPF-tree after removing ‘d’. . . . . . . . Mining using PF-growth algorithm for suffix item ‘a’. (a) Prefix tree of ‘a’, P Ta (b) Conditional tree of ‘a’, CTa (c) PF-tree after removing ‘a’. . . . . . . . . . . . . . . . Mining using PF-growth algorithm for suffix item ‘d’. (a) Prefix tree of ‘d’ (b) Conditional Tree of ‘d’ (c) PF-tree after removing ‘d’. . . . . . . . . . . . . . . . . . . . . . Construction of PPF-list using partition summaries in partition 0 (a) after scanning first transaction (b) after scanning second transaction (c) after scanning all the transactions in partition 0 (eight transactions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Construction of PPF-list using partition summaries in partition 1 (a) after scanning ninth transaction (b) after scanning tenth transaction (c) after scanning all the transactions in partition 1 (seven transactions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Merging of the partition summaries (a) Merged partition summaries (b) PPF-list sorted based on support (c) PPF-list after filtering the items which failed to satisfy minSup and maxP er constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of Machines vs Time Consumed for different datasets at different threshold values Total amount of data shuffled for different datasets with and without PS . . . . . . . .

xi 39 39 41 41 43

43

44

44 45 46 48 48 49 49

51

52

52 54 54

List of Tables

Table

Page

3.1 3.2 3.3

Nomenclature of various terms used in periodic-frequent pattern model . . . . . . . . A running example of a transactional database with continuation in Chapter 4 . . . . . Patterns generated using PFP-growth Approach for transactional database in Table 3.2

15 15 18

4.1 4.2

Patterns generated using PS-growth Approach . . . . . . . . . . . . . . . . . . . . . . Datasets description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 35

5.1 5.2 5.3

A running example of a Transactional Database . . . . . . . . . . . . . . . . . . . . . Patterns generated using PFP-growth Approach for transactional database in Table 5.1 Datasets description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39 50 53

xii

Chapter 1

Introduction

Data mining (also known as knowledge discovery) is the extraction of hidden pattern information in large datasets involving methods at the intersection of database systems, artificial intelligence, and statistics. Modern organizations view data mining as an important framework to transform data into intelligence. Currently, data mining is being used in a wide range of industry applications, such as customer relation management, marketing, surveillance, fraud detection, scientific discovery etc. The process of data mining mainly involves extraction of association rules, finding clusters, regression, dimensionality reduction, classification etc. Several efforts [15, 32, 9, 37, 36] have been made in the recent years in these fields to come up with efficient methods to extract interesting knowledge. Association rules are generated by analyzing the transactional databases using the support and confidence interestingness measures to identify the most important relationships among the items. Discovering frequent patterns is a key step in association rule mining. The process of extracting frequent patterns from transactional databases is known as frequent pattern mining. Frequent pattern mining is an important data mining task for various decision making techniques. The process of frequent pattern extraction finds interesting information about the association among the items in a transactional database which co-occur frequently. Apart from associative rule discovery [2], it has also been used in different techniques such as sequential pattern mining [4], correlation analysis, classification and clustering. A classic example to demonstrate a frequent pattern is ‘sweater’ and ‘gloves’. Both these items occur together frequently in apparel store transactional database. This kind of knowledge is helpful for proper inventory management for vendors and also in e-Commerce websites. Figure 1.1 shows the application of frequent patterns on Amazon website by recommending the users that the items Moto G5, tempered glass for Moto G5 and back cover case for Moto G5 are bought together frequently by the customers. The process of frequent pattern mining do not take into consideration about the temporal occurrence information about the patterns. So, temporal occurrences of the frequent patterns has been exploited as an interestingness measure to extract the class of user-interest based frequent patterns, called periodicfrequent patterns. A pattern is called periodic-frequent if it satisfies the thresholds of minimum support and maximum periodicity. Minimum support controls the minimum number of transactions in which the 1

Figure 1.1: Application of Frequent patterns in eCommerce - Amazon pattern should be present and maximum periodicity controls the maximum time difference between two transactions in which the pattern is present. In this thesis, we have put efforts to improve the memory and run-time performance of mining periodic-frequent patterns.

1.1

Periodic frequent patterns

Tanbeer et al. [40] were the first ones to introduce the notion of periodic interestingness of a pattern. Temporal periodicity of pattern appearances is used for measuring the interestingness of a frequent pattern. In the preceding example, we mentioned that ‘sweater’ and ‘gloves’ can be considered as a frequent pattern. But it can be noted that these items are bought together mostly in the winter season and is of very less interest to the vendors in other seasons (summer and monsoon). Thus the temporal occurrences of the patterns is crucial in many real-world applications. A classic application to demonstrate the usefulness of these patterns is market-basket analysis. It analyzes how regularly a set of items are being purchased by the customers. An example of a periodic-frequent pattern is as follows: {Bread, Butter} [support = 10%, periodicity = 1 hour] The above pattern demonstrates that the items ‘Bread’ and ‘Butter’ have been purchased by 10% of the customers, and the maximum time interval between any two consecutive purchases containing both of these items is not more than an hour. This predictive behavior of the customers’ purchases can facilitate the vendors in proper inventory management. Periodic frequent pattern mining has many applications. In a retail market, among all frequently sold products, the user may be interested only on the regularly sold products compared to the seasonal items. Besides, for improved web site design or web administration an administrator may be interested on the click sequences of regularly and heavily hit web pages. Also, in genetic data analysis [46] the set of all genes that not only appear frequently but also co-occur at regular interval in DNA sequence may carry 2

more significance. In stock market trading, the set of high stocks indices that rise periodically may be of special interest to companies and individuals. For e-Commerce applications, it helps in improving the performance of recommender systems [38]. Tanbeer et al. [40] also described a pattern-growth algorithm, called periodic-frequent pattern growth (PF-growth), to find a complete set of periodic-frequent patterns from transactional databases. Using periodic-frequent pattern mining, one can ask queries like ‘extract all patterns which appear in 10% of the transactions and occur at least once in every 20 minutes’. The algorithm consists of three steps: (i) finding one-sized periodic-frequent patterns, (ii) constructing the periodic-frequent pattern tree (PFtree) and (iii) recursively mining the PF-tree to discover the complete set of periodic-frequent patterns. Research efforts [24, 5, 23, 21, 22, 35] have been put in the literature to extend the model of periodicfrequent patterns to make it either applicable to real-world datasets or making it performance efficient. In this thesis, we solve the performance issues related to periodic-frequent pattern mining model.

1.2

Performance issues with the existing approach

Following are the performance issues with the existing approach: i The PF-tree maintains a transaction-id list at each path’s tail-node of the tree to enable the calculation of periodic interestingness of a pattern. In real-world scenarios, datasets acquired from Twitter, Amazon etc. are very huge and maintaining transaction-ids in the tree requires more memory. Because of the increased memory, it becomes difficult to extract patterns as the huge tree has to be sent as an argument in the recursive mining function. ii The existing algorithm focuses on a single machine computation to extract periodic-frequent patterns. As the real world datasets are very huge, extraction of periodic-frequent patterns on a single machine takes a lot of time and in some cases it becomes difficult because of memory constraints.

1.3

Overview of the proposed approaches

In this section, we give a brief overview of how we solve the above two issues discussed. In the first sub-section, we introduce the notion of period summary to extract periodic-frequent patterns in an efficient manner and in the second sub-section, we introduce a scalable algorithm to extract periodicfrequent patterns using Map-Reduce.

1.3.1

Mining periodic frequent patterns using period summary

In the existing method to extract periodic-frequent patterns, the transaction-ids of each pattern has to be stored at their respective tail-nodes of the PF-tree. This not only increases the tree size but also increases the overall memory and runtime requirements of the PF-growth algorithm, because the huge 3

PF-tree has to be sent as an argument in the recursive call to find all periodic-frequent patterns. Extraction of periodic-frequent patterns from voluminous databases require ample main memory for building and mining PF-tree. To reduce the total memory consumed for mining periodic-frequent patterns, we proposed an efficient approach based on the notion of period summary. It is based on the observation that it is also possible to extract periodic-frequent patterns by maintaining the period summary instead of transactionids. The period summary captures the interval information along with the periodicity within that interval for a list of transaction-ids for a candidate pattern. The period summaries are stored in a prefix-tree like structure called Period-Summary tree (PS-tree). We proposed a pattern-growth algorithm on PS-tree, called PS-growth, to find the complete set of periodic-frequent patterns in a transactional database. We have also proposed various merging conditions on period summaries by which we determine if a pattern is periodic or not. With the proposed approach it is possible to achieve the improved performance as the memory required to store period summaries is significantly less than the memory required to store transaction-ids. Also, there is an improvement in time consumed as linear time-complexity for finding a pattern’s periodicity using transaction-id lists has been reduced to a constant time operation using period summaries. As a result, the proposed approach can be extended to mine the periodic-frequent patterns from the datasets of increased size. Experimental results of three types of datasets (mushroom, twitter and retail) show that the proposed approach is memory and run-time efficient when compared to the existing approaches.

1.3.2

Map-Reduce based approach to mine periodic frequent patterns

The existing periodic-frequent pattern mining algorithm constructs a single PF-tree and mines it recursively to extract patterns on a single machine. In recent times, the technology has moved to cloud architectures in which building scalable algorithms by utilizing multiple machines is a necessity. Moreover, the datasets provided by e-Commerce sites and social-networking sites are very huge, and mining periodic-frequent patterns from these datasets using a single machine is time consuming. There is also potential problem where recursive mining of PF-tree does not fit into the main memory. So, investigation of efficient methods to parallelize periodic-frequent pattern mining in a distributed environment is a research issue. Development of parallel algorithms on large scale shared-nothing environment such as cluster of machines has attracted a lot of attention since it is a promising platform for high performance data mining. However, a parallel algorithm for a complex data structure like PF-tree is much harder to implement compared to sequential program or shared-memory parallel system. Mining periodic-frequent patterns can be distributed mainly because of its divide and conquer nature. In the existing approach, the processing of a conditional pattern base for a suffix item is independent of the processing of conditional pattern base for other suffix item. Therefore, it is natural to consider this as the execution unit for the parallel processing. Hence, we propose an efficient Map-Reduce based algorithm called Parallel Periodic Frequent pattern growth (PPF-growth) for mining periodic-frequent 4

patterns which runs on a cluster of machines. In the proposed approach the database is initially divided into multiple partitions and distributed to independent machines. Each database partition undergoes two Map-Reduce phases. In the first phase, all the one-sized periodic-frequent patterns are extracted and in the second phase, independent local parallel periodic-frequent pattern trees (PPF-trees) are constructed on each machine and mined recursively to extract patterns. In addition, an improved approach with the notion of partition summary is introduced to reduce the amount of data shuffled among the machines. Experiments on Apache Spark’s distributed environment show that the proposed approach speeds up with increase in the number of machines. Also, the notion of partition summary significantly reduces the amount of data shuffled among the machines.

1.4

Contributions of the thesis

The major contributions of this thesis are as follows: i Notion of period summary to extract periodic-frequent patterns in an efficient manner. ii Different merging operations on period summaries for finding a candidate pattern’s periodicity and support. iii A Map-Reduce algorithm for mining periodic-frequent patterns on a distributed environment. iv Notion of partition summary to reduce the amount of data shuffled among the machines in the MapReduce algorithm to mine periodic-frequent patterns. v Through experimental results on various datasets, we examined that the proposed ideas performed better than the existing methods in terms of performance.

1.5

Organization of the thesis

The rest of the thesis is organized as follows. Chapter 2 discusses about the work done in literature related to periodic-frequent pattern mining and show how the proposed approach is different from other approaches in the literature. Chapter 3 explains the background of frequent pattern mining, periodicfrequent pattern mining, Map-Reduce and parallel PF-growth using Map-Reduce. Chapter 4 presents the notion of period summary and show the improvement of the proposed approach over the existing approaches for mining periodic-frequent patterns. Chapter 5 presents an approach to mine periodicfrequent patterns using Map-Reduce framework. Chapter 6 discusses about the summary of thesis, conclusion and future prospects in this direction of research related to periodic-frequent pattern mining. Finally, we present the list of related publications and references at the end.

5

Chapter 2 Related Work

Research efforts are being made in the field of data mining to mine interesting knowledge from the transactional databases. Frequent pattern mining was first introduced to extract knowledge from transactional databases and has been extended to various fields based on application. Periodic-frequent pattern mining comes under one of the extensions of frequent pattern mining. In this chapter, we summarize the research work attempted to address various issues of periodic-frequent pattern mining. The organization of this chapter is as follows. In Section 2.1, we describe the background related to frequent pattern mining. In Section 2.2, we explain the background of periodic-frequent pattern mining, its variations and efforts made to improve the performance of periodic-frequent pattern mining. Further, in Section 2.3, we explain the work related to parallelization of FP-growth. In section 2.4, we explain how the proposed approach is different from the existing approaches and finally, in Section 2.5 we provide the summary of this chapter.

2.1

Frequent patterns

Agrawal et al. [2] were the first ones to propose the idea of extracting interesting knowledge from transactional databases by mining frequently occurring patterns. They used the notion of support to know if a pattern is interesting or not. All the patterns which satisfy the minimum support-constraint specified by the user are known as frequent patterns. A straight-forward naive approach to mine frequent patterns was first proposed in [2] by enumerating all the combinations of the items in a database (2n number of combinations will be generated, n is the number of items) and count their frequencies to find the support. The number of combinations increases exponentially with the number of items. To solve this problem, Agarwal et al. [3] introduced an algorithm called Aprori for mining frequent patterns efficiently from transactional databases along with its applications in associative rules and correlations. Downward closure property of the support measure was the key contribution for reducing the computational cost of Aprori. Downward closure property says that if any k-sized pattern is not frequent in the database, its (k + 1)-sized super-pattern can never be frequent. But, we can observe that the computational cost of the Aprori algorithm is very high as it scans the entire database for counting the 6

frequency of k-sized patterns in every iteration. Also, the runtime increases exponentially depending on the number of different items in the transactional database. Han et al. [18] proposed the frequent pattern growth (FP-growth) algorithm to mine frequent patterns from transactional databases in a memory and time efficient manner. Within two database scans, the algorithm compresses the database in a compact tree structure called frequent pattern tree (FP-Tree). It restores the prefix information of every pattern along with support at each node of the tree. A recursive mining approach is used to extract patterns from the FP-Tree. The performance gain achieved by the FP-growth is mainly due to the nature of the FP-tree, where it stores only the frequent items in a supportdescending order. We can note that the FP-growth algorithms scans the database only twice and Aprori algorithm scans the database once for every k-sized patterns. The FP-growth algorithm extracts frequent patterns in an efficient manner compared to Aprori but the number of frequent patterns generated by both the algorithms for lower minimum support values will grow exponentially. Also, the patterns generated are purely based on support and temporal occurrences of a pattern are not considered. To solve this problem, in this thesis, we focus on one of the interestingness measure called periodicity, which captures the time dimension of a patterns. This uses some early pruning techniques, which reduces the resultant pattern set and also reduces the computational cost.

2.2

Periodic frequent patterns

Tanbeer et al. [40] were the first ones to introduce the notion of periodic interestingness of a pattern. Temporal periodicity of pattern appearances is used for measuring the the interestingness of a frequent pattern. They also proposed a pattern-growth algorithm for mining periodic-frequent patterns, called Periodic Frequent Pattern growth (PF-growth). The algorithm compresses the transactional database into a prefix-tree like structure called periodic-frequent pattern tree (PF-tree), just like in FP-growth. The PF-tree is recursively mined to extract the complete set of periodic-frequent patterns. In this approach, the PF-tree maintains the transaction-id information in the tail-node of the tree to enable the calculation of periodicity value of a candidate pattern. The model of periodic-frequent patterns follow downward closure property, which says that if any k-sized pattern is not periodic-frequent then none of its (k+1)sized super-sets cannot be periodic-frequent. Amphawan et al. [6] focused on finding the top-k periodic-frequent patterns. The periodic patterns with the k highest support are know as top-k periodic-frequent patterns. This is proposed to make the mining process efficient. It is a single-pass algorithm and uses a best-first search strategy without any thresholds on the support constraint. Another advantage which the author mentions is that selecting the right minSup is also a difficulty, which is resolved in this model. But again, selecting the right k will be a problem with the proposed model. To fit the Tanbeer’s model [40] for real-world datasets, Kiran et al. [24] extended this model to discover those frequent patterns that exhibit partial periodic behavior in a database. The Tanbeer’s model has a strict constraint on periodicity of a pattern i.e., if the pattern is not periodic even for once 7

in the entire database, it is regarded as non-periodic. This condition is eased in the proposed model, where a pattern can be regarded as partial periodic even if it does not satisfy the periodicity condition for a few times. The user will specify the lower bound condition on how many times a pattern can be non-periodic in a transactional database. Given a transactional database and the user-defined thresholds of minimum support, maximum period and minimum periodic-ratio, this model discovers all partial periodic-frequent patterns that have support and periodic-ratio no less than the minimum support and minimum periodic-ratio, respectively. Alternatively, Rashid et al. [35] employed standard deviation of periods as a criterion to assess the periodic behavior of frequent patterns. The discovered patterns are known as regular frequent patterns. Even in this model, the transactional database is compressed into a tree structure called regularlyfrequent pattern tree (RF-tree), just like in PF-growth. Just like PF-tree, RF-tree also maintains the transaction-ids in the tail-node of the tree. The RF-tree is recursively mined for mining the complete set of regular-frequent patterns. Venkatesh et al. [41] enhanced Tanbeer’s model to address the rare item problem. In many businesses even the rarely purchased items contribute in the profit. The usage of single minSup and maxP er leads to rare item problem. The problem is addressed by using the concept of closeness of pattern’s support and periodicity to its items. Thus making discovery of periodic-frequent patterns consisting both frequent and rare items feasible. Uday et al. [23] extended the Tanbeer’s model to extract the knowledge of recurring patterns from timeseries databases. A pattern is called a recurring pattern if it exhibits periodic behavior only for particular time intervals within a series unlike periodic patterns which exhibit periodic behaviour through out the database. As as initial step, the time series database is converted into transactional database. This database is compressed into a prefix-tree structure called recurrent pattern tree (RP-tree), which is recursively mined to extract the complete set of recurring patterns. As recurring patterns do not follow download closure property, the estimated maximum recurrence is used to make the search space more efficient. Even the RP-tree maintains the transaction-ids in the tail-node of the tree. These recurring patterns are helpful in various applications like detecting weather patterns etc. All these approaches are variations in periodic-frequent pattern mining and innovations in pruning methods. However, in this thesis, we have proposed an innovation to improve the performance in terms of memory and time. So, the contribution is orthogonal to above mentioned approaches. In the next two paragraphs, we explain the existing methods to extract periodic-frequent patterns in an efficient manner. To improve the runtime of mining periodic-frequent patterns, Kiran and Kitsuregawa [21] have suggested a greedy-search technique to determine the periodic interestingness of a pattern. In the existing method, to find the periodicity of a candidate pattern, one have to iterate the complete list of transactionids which takes O(n) time complexity (n is size of the list). The greedy-search technique proposed is that we can break the looping condition if it is found out as non-periodic in between. It resulted in best case O(1) and worst case O(n) time complexities to find a candidate item’s periodicity, compared to O(n) time complexity in the existing method. This optimization resulted in improvement of total 8

time consumed compared to existing approach to mine periodic-frequent patterns from transactional databases. Amphawan et al. [5] introduced interval lists to reduce the the memory requirements of mining periodic-frequent patterns from transactional databases. In Amphawan’s model the transactional timeline is divided into intervals of size maximum period (maxPer). The interval information is stored when a pattern is occurring in that interval along with the support. Instead of storing the transactionids in the tail-node the interval list is stored resulting in lower memory consumption. Experiments of real-world datasets resulted in reduction of total memory consumed compared to existing approach to extract periodic-frequent patterns. An association rule is considered as cyclic if the rule has the minimum confidence and support at regular time intervals. Ozden et al. [31] have proposed an approach to extract cyclic association rules from transactional databases. They have investigated the periodic interestingness of a pattern to discover cyclic association rules. The database is divided into non-overlapping subsets with respect to time. Cyclic association rules are the ones which appear at least a certain number of subsets. Periodicfrequent patterns are not to be confused with cyclic patterns. The difference here is that this model is restrictive and it finds only those rules which appear in every cycle. Sequential pattern mining [4] has also been proposed which extracts all those sequences which occur frequently. Periodic-frequent pattern mining is different from sequential pattern mining because of two main reasons. Firstly, sequential pattern mining considers either sequential data and secondly, it does not consider the support threshold which is the only constraint to be satisfied by all frequent patterns. However, it should be noted that periodic interestingness of a pattern is not considered in this study.

2.3

Parallel mining of frequent patterns

In the preceding section we discussed about the FP-growth algorithm proposed by Han et al. [18] to mine frequent patterns in a memory and time efficient manner. Yet, this algorithm (FP-growth) is focused on single machine computation for mining frequent patterns. Osmar et al. [44] proposed an algorithm to parallelize the FP-growth algorithm across multiple threads of the cores of a single machine. They have exploited the property of FP-growth in which mining of patterns for a suffix item is independent from other suffix item. This can be considered as a basic unit of parallelization. This approach parallelizes FP-growth on a single machine with multiple cores but does not address the issue of very large databases. There is a potential problem where the entire database could not fit into a single machine or the constructed FP-tree does not fit into the main memory of a single machine. To overcome the issue of large databases which does not fit into a single machine and to increase the scalability of mining frequent patterns, Pramudiono et al. [33] proposed distributed variant of FPgrowth algorithm which runs over a cluster of commodity machines. Here, the task is divided into multiple machines, making the task scalable for any sized datasets. They were able to achieve good 9

scalability over a cluster of 128 machines. The communication between the machines is done using send and receive commands. Dean et al. [13] proposed Map-Reduce framework to make distributed processing for large datasets simple. Users specify a set of map and reduce steps which will utilize multiple machines to do the task. Hadoop framework, an Apache sponsored project included the Map-Reduce APIs in a java based programming language. Hadoop has been proved for its good scalability and automatic fault recovery when working with a cluster of machines. Taking advantage of this, Li et al. [27] proposed a parallel FPgrowth approach for query recommendation in a Map-Reduce environment using Hadoop. The proposed method consists of two Map-Reduce phases. In the first Map-Reduce phase, F-list is constructed and in the second Map-Reduce phase, independent local FP-trees are constructed on different machines. These independent local FP-trees are mined paralelly to extract the complete set of frequent patterns. They were able to achieve near-linear speed up in total time consumed with the increase in number of machines.

2.4

How proposed approach is different

The existing Tanbeer’s model suffers from high memory consumption because it stores transactionid information of every transaction in the PF-tree structure. In this thesis, one of the chapter focuses on reduction of total memory consumed to extract periodic-frequent patterns using the notion of period summary. Ampawan’s model [5] also focuses on reducing the memory consumed by PF-growth. But, the proposed model in this thesis is different from the Amphawan’s model as the size of interval is not restricted to maxPer and can expand and grow as long as the transaction-ids merge to be periodic. Different merging conditions on period summaries is also proposed which helps in finding the periodicity and support values of a candidate pattern. Also, to find a candidate item’s periodicity it takes linear time (O(n)) in Amphawan’s model, which is reduced to constant time (O(1)) using the proposed period summary technique. Experimental results show that the proposed method performs better than all the existing models. Parallel FP-growth using Map-Reduce [27] cannot be extended to mine periodic-frequent patterns. The difference between PF-growth and FP-growth comes in the tree structure and also storage of additional transaction-id (tid) information, which is crucial for computing the periodicity in PF-growth. Moreover, efficient methods to distribute these large arrays (tid-lists) in PF-growth needs investigation. The proposed method consists of two Map-Reduce phases and in both the phases, we have included the steps to process transaction-ids. To our knowledge, the existing literature did not address the problem of parallel mining of periodic-frequent patterns on a distributed environment and we claim to be the first to do so. 10

2.5

Summary

In this Chapter, we have explained all the existing work related to periodic-frequent pattern mining. It can be noted that the existing approaches to mine periodic-frequent patterns from transactional databases consume more memory and take more time. Few efforts have been made to make the periodic-frequent pattern algorithm efficient. In the upcoming chapters, we explain the proposed ideas, which are efficient than all the existing ones.

11

Chapter 3

Background

As part of the background, we explain a few approaches in a more elaborated manner. We first explain the frequent pattern growth algorithm proposed by Han et al. [18] for mining frequent patterns. Later, we explain the model of periodic-frequent patterns and then we explain the periodic-frequent pattern growth proposed by Tanbeer et al. [40]. Finally, we explain Map-Reduce framework [13] and parallel FP-growth using Map-Reduce [27]. In the end, we conclude the chapter with summary.

3.1

Frequent pattern mining using FP-growth

Agarwal et al. [3] first introduced Aprori algorithm to extract frequent patterns from transactional databases. But the time taken by this algorithm is high because it scans the entire database for each of the k-sized patterns (k = [1, n], n is the number of items in the database). Han et al. [40] proposed frequent pattern growth (FP-growth) to extract frequent patterns in an efficient manner. FP-growth consists of the following two steps: (i) compressing the transactional database into a prefix-tree like structure called frequent pattern tree (FP-tree) and (ii) recursively mining the FP-tree using pattern-growth approach to discover the complete set of frequent patterns.

3.1.1

Structure of FP-tree

The structure of FP-tree contains an FP-list and a prefix-tree. An FP-list consists of itemname (Item) and support (sup). Here, support is the frequency of the item in the transactional database. The items in FP-list are sorted in descending order of support to facilitate high compactness. Each node in the FP-tree keeps track of the support value and also maintains parent-child relationships (refer Figure 3.2). The header table maintains the list of frequent items sorted based on support and each item in the header table contain node links to all the occurrences of that item (refer Figure 3.2). 12

Figure 3.1: A sample dataset and corresponding ordered frequent itemsets. (Image taken as a reference from paper [18])

3.1.2

Figure 3.2: Constructed FP-tree. (Image taken as a reference from paper [18])

Construction of FP-tree

Two scans of the database is required to represent the database in a compact structure, FP-tree. In the first database scan, the FP-list is constructed which contains the frequent items of size 1. Since frequent patterns follow downward closure property1 , frequent patterns of unit length play a key role. Therefore, the first database scan to extract one-sized frequent patterns is an important step. Now only these items in the FP-list will be used for the construction of the prefix tree and the insertion of a branch is done in the same order as in FP-list. In the second database scan, the items in the FP-list will take part in the construction of FP-tree. The elements which are not present in the FP-list are not considered while inserting into the prefix tree. While inserting a transaction into the tree, support of the nodes in that branch are incremented by one. Figure 3.1 shows a sample dataset and corresponding frequent itemsets for minSup = 3. Figure 3.2 shows the FP-tree constructed from the sample dataset.

Example 1 Consider the first two columns of Figure 3.1 as a transactional database with minSup = 3. The third column in Figure 3.1 contain the filtered transactions, which contains only the frequent items. Each filtered transaction is inserted in the FP-tree in the same order as in the header table. The support value at the corresponding node is incremented for every occurrence.

3.1.3

Mining of FP-tree

To discover frequent patterns from FP-tree, FP-growth employs the following steps: – Choosing the last item i in the FP-list as an initial suffix item, its prefix-tree (denoted as PTi ) is constructed. This constitutes the prefix sub-paths of nodes labeled i. 1

Downward closure property states that if pattern is not frequent then none of its super-sets are frequent

13

– For each item j in PTi , we find its support and determine if it is a frequent pattern or not by comparing its support against minSup. If ij is a frequent pattern, then we consider j to be frequent in PTi . – Choosing every frequent item j in PTi , we construct its conditional tree, CTij , and mine it recursively to discover the patterns. – After finding all frequent patterns for a suffix item i, we prune it from the original FP-tree and push the corresponding nodes’ support to their parent nodes. We repeat the above steps until the FP-list becomes NULL.

3.2

Model of periodic-frequent patterns

The basic model of periodic-frequent patterns described by Tanbeer et al. [40] is as follows: Let I = {i1 , i2 , · · · , in }, 1 ≤ n, be a set of items. A set X = {ij , · · · , ik } ⊆ I, where j ≤ k and j, k ∈ [1, n] is called a pattern (or an itemset). A transaction t = (tid, Y ) is a tuple, where tid represents a transaction-id (or timestamp) and Y is a pattern. A transactional database (T DB) over I is a set of transactions, i.e., T DB = {t1 , t2 , · · · , tm }, m = |T DB|, where |T DB| represents the size of T DB in total number of transactions. If X ⊆ Y , it is said that t contains X and such transactionX X X id is denoted as tidX j , j ∈ [1, m]. Let T ID = {tidj , · · · , tidk }, j, k ∈ [1, m] and j ≤ k, be the set of all transaction-ids where X occurs in T DB. The support of a pattern X is the number of transactions containing X in T DB, which is denoted as Sup(X). Therefore, Sup(X) = |T IDX |. Let X tidX i and tidj , i, j ∈ [1, m − 1] be two consecutive transaction-ids where X has appeared in T DB. X The period of a pattern X is the number of transactions or the time difference between tidX j and tidi . X X Let P X = {pX 1 , p2 , · · · , pr }, r = Sup(X) + 1, be the complete set of periods of X in T DB. The periodicity of a pattern X is the maximum difference between any two adjacent occurrences of X, X X denoted as P er(X) = max(pX 1 , p2 , · · · , pr ). A pattern X is a periodic-frequent pattern if Sup(X) ≥ minSup and P er(X) ≤ maxP er, where minSup and maxP er represent the user-defined thresholds on support and periodicity respectively. Both support and periodicity of a pattern can be described in percentage of |T DB|. Table 3.1 lists the nomenclature of different terms used in periodic-frequent pattern model. Example 1 illustrates this model using the transactional database shown in Table 3.2. Example 2 Consider the transactional database shown in Table 3.2. Each transaction in this database is uniquely identifiable with a transactional-id (tid), which is also a timestamp of that transaction. Here, the list of items, I = {a, b, c, d, e, f, g}. The set of items containing ‘a’, ‘c’ and ‘d’, i.e., ‘acd’ is a pattern. This pattern contains 3 items. Therefore, it is a 3-pattern. In the database, this pattern occurs in the tids of 1, 3, 4, 6, 7, 8, 9, 11, 13, 15, 17 and 19. Therefore, T IDacd = {1, 3, 4, 6, 7, 8, 9, 11, 13, 15, 17, 19}. The support of this pattern, i.e., Sup(acd) = |T IDacd | = 12. 14

Table 3.1: Nomenclature of various terms used in periodic-frequent pattern model Terminology The complete set of items Transactional database The transaction-id of a transaction containing X The set of all transaction-ids containing X The support of X The user-defined minimum support The periodicity of X The user-defined maximum period

Symbol I T DB tidX i T IDX Sup(X) minSup P er(X) maxP er

Table 3.2: A running example of a transactional database with continuation in Chapter 4 TID 1 2 3 4 5

Items acdg cef acd abcde bf

TID 6 7 8 9 10

Items abcd acdf abcd acde bef g

TID 11 12 13 14 15

Items acde beg acdg bef acd

TID 16 17 18 19 20

Items bef g abcde beg acdeg bg

The periods for this pattern are 1(= 1−tidi ), 2(= 3−1), 1(= 4−3), 2(= 6−4), 1(= 7−6), 1(= 8−7), 1(= 9 − 8), 2(= 11 − 9), 2(= 13 − 11), 2(= 15 − 13), 2(= 17 − 15) and 1(= tidl − 19), where tidi = 0 represents the initial transaction and tidl = 20 represents the last transaction in the transactional database. The periodicity of acd, P er(acd) = maximum(1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1) = 2. If the user-specified minSup = 10 and maxP er = 4, the pattern acd is a periodic-frequent pattern because Sup(acd) ≥ minSup and P er(acd) ≤ maxP er.

3.3

Periodic frequent pattern growth

Tanbeer et al. [40] proposed periodic-frequent pattern growth (PF-growth) to extract periodicfrequent patterns from transactional databases. PF-growth consists of the following two steps: (i) compressing the transactional database into a prefix-tree like structure called periodic-frequent pattern tree (PF-tree) and (ii) recursively mining the PF-tree using pattern-growth approach to discover the complete set of periodic-frequent patterns. The algorithm accepts a transactional database (TDB), minSup and maxP er as inputs and outputs a complete set of periodic-frequent patterns. As an example, we consider Table 3.2 as our transactional database with minSup = 10 and maxP er = 4. 15

Item

a c d g

sup per idl

1 1 1 1

1 1 1 1

1 1 1 1

Item

a c d g e f

(a)

sup per idl

1 2 1 1

1 1 1 1

1 2 1 1

1 1

2 2

2 2

Item

a c d g e

sup per idl

12 13 12 8

2 2 2 9

19 19 19 20

Item

c a d b

sup per idl

13 12 12 11

2 2 2 4

19 19 19 20

11 7 19 f 6 4 16 b 11 4 20

(b)

(c)

(d)

Figure 3.3: Construction of PF-list. (a) After scanning first transaction (b) After scanning second transaction (c) After scanning all transactions (d) Final sorted list of periodic-frequent items of size 1.

3.3.1

Structure of PF-tree

The structure of PF-tree contains a PF-list and a prefix-tree. A PF-list consists of three fields itemname (Item), support (sup) and periodicity (per). The items in PF-tree are sorted in descending order of support to facilitate high compactness. Two types of nodes are maintained in PF-tree: ordinary node and tail-node. The former is the type of node similar to that used in FP-tree, whereas the latter node represents the last item of any sorted transaction. However, the nodes in the PF-tree do not maintain the support count as in FP-tree. Instead, they explicitly maintain the transaction-ids (tids) for each occurrence of that pattern only at the tail-node of every transaction. Here, tids are maintained in the tail-node in order to calculate the periodicity of a pattern.

3.3.2

Construction of PF-tree

Two scans of the database is required to represent the database in a compact structure, PF-tree. In the first database scan, the PF-list is constructed which contains the periodic-frequent items of size 1 (one-sized patterns or 1-patterns). Since periodic-frequent patterns follow downward closure property, periodic-frequent patterns of unit length play a key role. Therefore, the first database scan to extract one-sized periodic-frequent patterns is an important step. Algorithm 1 describes the procedure for construction of PF-list. Figure 3.3(a)-(c) shows the construction of PF-list after scanning first transaction, second transaction and entire database, respectively. Figure 3.3(d) shows the sorted list of one-sized periodic-frequent items discovered from PF-list. We prune the items e, f and g from the list as these items do not satisfy the minSup and maxPer constraints. Therefore, only {a, b, c, d} will be considered as one sized periodic-frequent patterns. These patterns will be sorted based on their support value. Now only these patterns will be used for the construction of the prefix tree and the insertion of a branch is done in the same order as in PF-list. In the second database scan, the items in the PF-list will take part in the construction of PF-tree using the FP-tree construction technique. The tree construction starts by inserting the first transaction, (1, 16

Item sup per idl

c a d b

13 12 12 11

2 2 2 4

19 19 19 20

{}null {}null c

c:2

{}null c:2

b:5,10,12,14, 16,18,20

a

a

a

d:1

d:1

d:1,3,7,9, 11,13,15,19

b:4,6,8,17 (a)

(b)

(c)

(d)

Figure 3.4: Construction of PF-tree. (a) PF-list (b) After scanning first transaction (c) After scanning second transaction (d) After scanning all transactions.

Algorithm 1 Construction of PF-list (T DB : Transactional database, maxP er : maximum periodicity, minSup : minimum Support) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

for each transaction tcur ∈ T DB do for each item i in tcur do if tcur is i’s first occurrence then Set supi = 1 pi = ticur idil = ticur else Set supi += 1 picur = ticur - idil idil = ticur if picur > p then p = picur end if end if end for end for

17

PF List

PF List

{} null

i:f:p c:13:2 a:12:2 d:12:2

c:2 a d:1,3,4,6,7,

{} null

i:f:p c:4:9 a:4:9 d:4:9

PF List

{} null

i:f:p

c a d:4,6,8,17

8,9,11,13, 15,17,19 (a)

(b)

(c)

Figure 3.5: Mining using PFP-growth algorithm. (a) PF-tree after removing ‘b’ (b) Prefix tree of ‘b’ (c) Conditional Tree of ‘b’. PF List

{} null

PF List

PF List

{} null

{} null

i:f:p i:f:p i:f:p c: 2 c c:12:2 c c:13:2 c:12:2 a:12:2 a:1,3,4,6,7, a:12:2 a:12:2 a:1,3,4,6,7, a:1,3,4,6,7, 8,9,11,13, 8,9,11,13, 15,17,19

(a)

8,9,11,13, 15,17,19

(b)

15,17,19 (c)

Figure 3.6: Mining using PFP-growth algorithm. (a) PF-tree after removing ‘d’ (b) Prefix tree of ‘d’ (c) Conditional Tree of ‘d’.

acdg), according to PF-list order, as shown in Figure 3.4(b). The elements which are not present in the PF-list are not considered while inserting into the prefix tree. The tail-node ‘d’ carries the transaction-id of the first transaction, ‘d : [1]’. The second transaction, (2, cef), is inserted into the tree with node ‘c : [2]’ as the tail-node (see Figure 3.4). When a transaction has the same path, the transaction-id is appended to the existing tail-node. A similar process is repeated for remaining transactions in the database. The final PF-tree generated after scanning the entire database is shown in Figure 3.4(d).

3.3.3

Mining of PF-tree

To discover periodic-frequent patterns from PF-tree, PFP-growth employs the following steps: Table 3.3: Patterns generated using PFP-growth Approach for transactional database in Table 3.2

b d ad cad

support 11 12 12 12

periodicity 4 2 2 2

18

cd a ca c

support 12 12 12 13

periodicity 2 2 2 2

– Choosing the last item i in the PF-list (Figure 3.4(a)) as an initial suffix item, its prefix-tree (denoted as PTi ) is constructed. This constitutes the prefix sub-paths of nodes labeled i. – For each item j in PTi , we aggregate all of its nodes’ transaction-id list to derive the transaction-id list of the pattern ij, i.e., TIDij . Then, we determine whether ij is a periodic-frequent pattern or not by comparing its support and periodicity against minSup and maxP er, respectively. If ij is a periodic-frequent pattern, then we consider j to be periodic-frequent in PTi . – Choosing every periodic-frequent item j in PTi , we construct its conditional tree, CTij , and mine it recursively to discover the patterns. – After finding all periodic-frequent patterns for a suffix item i, we prune it from the original PF-tree and push the corresponding nodes’ transaction-id lists to their parent nodes. We repeat the above steps until the PF-list becomes NULL. Figure 3.5, Figure 3.6 show the mining process for item ‘b’ and item ‘d’. Table 3.3 contains the list of periodic-frequent patterns extracted using PFP-growth.

3.4

Map-Reduce framework

Jeffrey Dean et al. [13] introduced the Map-Reduce framework to enable the processing of large datasets on a cluster of commodity machines. It is an easy and abstracted programming model which can solve a large set of computational problems. This framework is primarily used to process huge amounts of data, in parallel, on large clusters of multiple machines with a shared-nothing environment. Here, the data is initially partitioned into multiple shards on to a distributed file system and each machine processes the partition of data which is assigned to it. Users specify the problem as a sequence of MapReduce steps. In both the steps, the data representation is in the form of key-value pairs. • Map: The map function is applied on the input data in distributed file system and will run in parallel on different partitions. It iterates over a set of the input key-value pairs, and generates intermediate output key-value pairs. Map-Reduce framework takes the responsibility of grouping all the intermediate values by means of key and introducing them to the reduce function. Figure 3.7 shows word count example using Map-Reduce. The map function takes each word from the partition of input data assigned to it and outputs key-value pairs as hword, 1i. • Reduce: This stage is the combination of the Shuffle stage and the Reduce stage. All the pairs with the same key are applied with the user-defined reduce function at the receiving node. After processing, it produces a new set of output, which will be stored in the distributed file system. Continuing the example from Figure 3.7, the shuffle phase groups all the keys and assigns them to a machine and the reduce function counts the number of ones to determine the word count.

19

Map

Machine 1

Krishna Krishna Kumar Krishna Kumar Kumar Anirudh Kumar Krishna

Reduce

Anirudh,1 Anirudh,1 Anirudh,1 Krishna,1 Krishna,1 Krishna,1 Kumar,1 Kumar,1 Kumar,1

Anirudh,8 Machine 1

Anirudh,1

Machine 2

Anirudh Krishna Krishna Anirudh Kumar Krishna Anirudh Krishna Anirudh

Machine 3

Krishna,1 Krishna,1 Krishna,1 Krishna,1 Kumar,1 Kumar,1 Kumar,1 Kumar,1 Anirudh,1 Anirudh,1 Anirudh,1 Anirudh,1 Krishna,1 Krishna,1 Krishna,1 Krishna,1 Kumar,1

Krishna,11 Machine 2

Output Data

Input Data

Anirudh Kumar Krishna Kumar Krishna Kumar Anirudh Anirudh Krishna

Shuffle

Kumar,8 Machine 3

Figure 3.7: Map-Reduce framework

Input Data

Anirudh Kumar Krishna Kumar Krishna Kumar Anirudh Anirudh Krishna Krishna Krishna Kumar Krishna Kumar Kumar Anirudh Kumar Krishna

Anirudh Krishna Krishna Anirudh Kumar Krishna Anirudh Krishna Anirudh

Machine 1

Combine Anirudh,1 Anirudh,1 Anirudh,1 Krishna,1 Krishna,1 Krishna,1 Kumar,1 Kumar,1 Kumar,1 Anirudh,1

Machine 2

Machine 3

Krishna,1 Krishna,1 Krishna,1 Krishna,1 Kumar,1 Kumar,1 Kumar,1 Kumar,1 Anirudh,1 Anirudh,1 Anirudh,1 Anirudh,1 Krishna,1 Krishna,1 Krishna,1 Krishna,1 Kumar,1

Shuffle

Reduce

Anirudh,3 Krishna,3 Kumar,3

Anirudh,8 Machine 1

Anirudh,1

Krishna,4

Krishna,11 Machine 2

Kumar,4

Anirudh,4

Kumar,8

Krishna,4 Machine 3 Kumar,1

Figure 3.8: Map-Reduce execution model using combine function

20

Output Data

Map

3.4.1

Combine function

Map-Reduce also provides a combine function which can be applied after the map phase. This acts as a local reducer and operates on the key-value pairs within the machine. It helps in reducing the amount of data shuffled among the machines. As we can see from Figure 3.7 that the key-value pair hAnirudh, 1i is appearing three times for machine 1. Instead of that we can apply a combine function and send the key-value pairs as hAnirudh, 3i. The same example with combine function is depicted in Figure 3.8.

map(key1 , value1 ) → list(key2 , value2 ) reduce(key2 , list(value2 )) → list(key3 , value3 ) Encouraged by power of Map-Reduce paradigm, researchers are making efforts to propose parallel algorithms under Map-Reduce framework in different fields of computer science which deals with large amounts of data [25, 47, 14]. But there is a limitation in the class of algorithms, which can be extendable to Map-Reduce. For example, most of the standard algorithms which deal with graph processing cannot be extended to Map-Reduce. So, making an algorithm extendable to Map-Reduce is not a straightforward task.

3.5

Parallel FP-growth using Map-Reduce

Periodic-frequent pattern mining comes under the family of frequent pattern mining as both the algorithms use recursive approach to mine patterns from a tree structure. Han et al. [18] proposed FP-growth algorithm to mine frequent patterns in a memory and time efficient manner, yet this uses only a single machine. Li et al. [27] proposed parallel FP-growth to extract frequent patterns using Map-Reduce framework. Algorithm 2 FP-listConstruction (TDB) Procedure: Map(key = null, value = T DBi ) for each transaction tcur ∈ T DBi do for each item it in tcur do Output (it, 1) end for end for Procedure: Reduce(key = it, value = S(it) ) Initialize sup = 0 for each 1 ∈ S(it) do Set sup+ = 1 end for

21

Algorithm 3 FP-treeConstructionMining (TDB, FP-list) Procedure: Map(key = null, value = T DBi ) // T DBi is the segment of TDB for each transaction tcur ∈ T DBi do filter and sort the elements in tcur which are not in FP-list for j = (|tcur | − 1) to 0 do partition-id = getPartition(tcur [j]) if H does not contain partition-id then Output(partition-id; (tcur [0 : j], 1) ) end if end for end for Procedure: Reduce(key = partition-id, value = transactions) Initialize PPF-tree, T for tcur in transactions do for it in tcur do if T does not have child it then Create a new child node it and link it with the parent end if Traverse to the child it and increment the support of that node. end for end for Procedure: Map(key = partition-id, value = P P F -tree) // Parallel Mining for each suffix item i in FP-list do if current partition-id is responsible for item i then Generate P Ti and CTi and mine recursively in CTi for patterns with suffix i end if end for

22

Parallel FP-growth consists of two Map-Reduce phases. The mapper and reducer functions do the following computation.

i Map-Reduce Phase 1: The first phase constructs the F-list, which contains 1-sized frequent patterns (Algorithm 2). Map: For each transaction output key-value pairs as hitem, 1i. Reduce: Aggregate all the 1’s for each item to count the corresponding frequencies. ii Map-Reduce Phase 2: The second phase constructs independent local FP-trees on different machines (Algorithm 3). Map: For each transaction, it generates all the sub-patterns and outputs the following key-value pairs hpartition-id,sub-patterni. Reduce: Aggregates all the transactions and construct local FP-trees. PF patterns are extracted at each worker by mining the local FP-trees recursively.

3.6

Summary

In this chapter, we have explained the existing approach to mine frequent patterns and periodicfrequent patterns, background of Map-Reduce and parallel FP-growth using Map-Reduce. In the next chapter, we introduce the proposed approach using the notion of period summary to improve the performance of periodic-frequent pattern mining.

23

Chapter 4 Mining Periodic Frequent Patterns Using Period Summary

In this chapter, we discuss the proposed approach to mine periodic-frequent patterns in an efficient manner using the notion of period summary. It is based on the observation that it is also possible to extract periodic-frequent patterns by maintaining the period summary instead of the list of transactionids. The organization of this chapter is as follows. In Section 4.1, we explain the motivation and in Section 4.2, we explain the basic idea. In section 4.3, we explain the proposed approach, period summary growth. In section 4.4, experimental results and reported and shown that the proposed approach is efficient compared to the existing approach in terms of memory consumed and time taken. Finally, in section 4.5 we conclude the chapter with a summary.

4.1

Motivation

The datasets provided by modern eCommerce sites and social networking sites are very huge. Mining periodic-frequent patterns from these huge datasets using the existing algorithm [40] is a major restriction. It is because the periodic-frequent pattern tree (PF-tree) explicitly stores the transaction-id information in the tail-node of every branch. The approximate space complexity for constructing a PFtree is O(n + |T DB|), where n represents the total number of nodes generated in PF-tree and |T DB| represents the total number of transactions in T DB. Here, |T DB| amount of memory is consumed because the transaction-id lists (or tid-lists) are stored at every tail-node of a sorted transaction in PF-tree. The transaction-ids are stored in order to calculate periodicity. If |T DB| is a very large number, the size of PF-tree will also increase.

4.2

Basic Idea - Period Summary

While investigating the approaches to reduce the memory requirement of the PF-tree, it was observed that when a transactional database is compressed into a PF-tree, the entire tid-list of a pattern gets 24

1

0

3

7

4

9

11

8

13

12

15

16

19

20

Figure 4.1: Occurrence timeline for pattern ‘cad’ fragmented into multiple distinct sub-lists. As an example, consider the dataset in Table 3.2 (presented in the preceding chapter) as a transactional dataset with minSup = 10 and maxP er = 4. Example 3 In Table 3.2, the tid-list of pattern ‘b’ is, T IDb = {4, 5, 6, 8, 10, 12, 14, 16, 17, 18, 20}. When the transactional database shown in Table 3.2 is compressed into PF-tree, the tid-list of ‘b’ gets divided into two branches ‘b’ and ‘cadb’ (Figure 3.4(d)), and the corresponding sub-lists at the tail-node are {4, 6, 8, 17} and {5, 10, 12, 14, 16, 18, 20}. It can be observed that instead of storing transaction-ids, it is sufficient to store the interval information in which the pattern is appearing periodically. The interval information include the interval extremes and the periodicity within that interval. Let us say a pattern X is appearing periodically in the range [a,b] such that the difference between any two of its occurrences in that range is no more than per (here per should be less than maxP er). Then, it is enough to store the interval extremes [a,b] and corresponding per denoting that the pattern is periodic in that range with periodicity per. Based on our investigation, we have proposed two concepts. One is the concept of period summary which is employed to build the tree in a memory efficient manner and the other is the process of merging period summaries, which is used during for the mining of periodic-frequent patterns to detect if a pattern satisfies the minSup and maxPer thresholds. The concept of period summary is defined as follows. Definition 1 A period summary of a pattern X, psX i , captures the interval information in which a pattern has appeared periodically in the data and the periodicity of the respective pattern within that X X X X X interval. That is, psX i = htidj , tidk , peri i, where tidj and tidk , 1 ≤ j ≤ k ≤ |T DB|, represents

the first and last tids of that range respectively, in which a pattern has appeared periodically in a subset of database and periX is the periodicity of a pattern within the interval whose tids are within tidX j X X and tidX k . Let tidj be denoted as first and tidk be denoted as last in the respective intervals. Let X X P S X = {psX 1 , ps2 , · · · , psk }, 1 ≤ k ≤ |T DB|, denote the complete set of period summaries of X

for any tail-node. Example 4 Continuing with the previous example, the sub-lists of ‘b’ will result in following period summaries at their respective tail-nodes: P S cadb = {h4, 8, 2i, h17, 17, 0i} (b along with c, a and d) 25



0

4

8

12

16

20

Figure 4.2: Occurrence timeline for pattern ‘cad’ using the notion of period summary and P S b = {h5, 5, 0i, h10, 20, 2i} (b in isolation). The first element in P S cadb , says that ‘cadb’ has occurred with the periodicity of 2 in the interval whose tids are from 4 to 8. The second element in P S cadb , says that ‘cadb’ has occurred with periodicity of 0 in the interval whose tids are from 17 to 17. Similarly, the first element in P S b , says that ‘b’ has occurred with the periodicity of 0 in the interval whose tids are from 5 to 5. The second element in P S b , says that ‘b’ has occurred with periodicity of 2 in the interval whose tids are from 10 to 20. Figure 4.1 shows the timeline of twenty transactions and each box shows an occurrence of the pattern ‘cad’. Here, T IDcad = {1, 3, 7, 9, 11, 13, 15, 19}. The existing approach stores all these tids in the tailnode d. Instead of storing these transaction-ids, we can store the period summary as h1, 19, 4i as shown in Figure 4.2. In the mining phase, period summaries at different tail-nodes have to be merged to generate the final period summary of the pattern to determine whether a pattern X is periodic or not. We encounter the following cases while merging two intervals: i One interval is subset of another interval. ii Both the intervals are overlapping. iii Both the intervals are non-overlapping, and the difference between them is less than maxP er. iv Both the intervals are non-overlapping, and the difference between them is greater than maxP er. In the first three cases, we merge both the intervals and store the extended interval in final period summary. In the fourth case, we store both the intervals separately in final period summary. The periodicity of the merged element is calculated as follows: X perkX = maximum(periX , perjX , (tidX i (l) − tidj (f )))

In the above equation, periX and perjX are the periodicity values of ith and j th period summaries. The X th th tids tidX i (l) and tidj (f ) represent the last tid and first tid of the i and j period summaries which are to be merged. 26

Example 5 The merging starts with pointers P1 and P2 pointing to the start of P S cadb and P S b respectively. Since h5, 5, 0i of P2 is subset of h4, 8, 2i (4 < 5 < 8), we increment P2 and point it to h10, 20, 2i. Now, intervals h4, 8, 2i and h10, 20, 2i are non-overlapping and the difference between rightmost tid of P2 and leftmost tid of P1 is less than maxP er (case 3) i.e., 10−8 = 2 < maxP er(= 4). Therefore, we merge the intervals to form h4, 20, 2i and add this element to final period summary. The periodicity of the added element is maximum(2, 2, 2) = 2. Now, we move our pointer P1 to h17, 17, 0i. P1 is subset of h4, 20, 2i. Hence, the final period summary of ‘b’ gets merged into single element as {h4, 20, 2i}. Let P S be the final period summary. To check if a pattern is periodic or not, we check for three conditions in P S : i P S.size() = 1 ii P S[0].f irst ≤ maxP er iii (|T DB| − P S[0].last) ≤ maxP er Example 6 Continuing with the previous example, the final period summary of ‘b’ is {h4, 20, 2i}. Since it satisfies all the three conditions mentioned above (i.e., size = 1, 4 ≤ 4 and (20 − 20) = 0 ≤ 4), we say pattern ‘b’ is periodic in the database. This optimization helps in reducing the memory requirements for constructing a tree. The advantage will be even more if the original tid-list does not split into multiple branches as we just store one element in the tail-node implying that it is periodic in the entire range of database. The optimization also helps us in reducing the amount of time taken because, in the existing approach, to find if a candidate pattern is periodic or not, one has to iterate through the entire tid-list to find its periodicity. This takes O(n) time, where n is the size of the tid-list. Whereas in the proposed approach, we can tell if the pattern is periodic or not in a constant time (O(1)) by checking the three conditions mentioned in the preceding paragraph.

4.3

Proposed approach - Period summary growth

Based on the notion of period summary, we propose a novel tree structure called period summary tree (PS-tree), which compresses the transactional database in a further compact manner than PF-tree. A pattern-growth-based approach called PS-growth is proposed to extract patterns from the PS-tree. The PS-growth algorithm takes a transactional database, minSup and maxP er as inputs and outputs a complete set of periodic-frequent patterns. The algorithm involves the following two steps: (i) construction of PS-tree and (ii) recursively mining PS-tree to discover periodic-frequent patterns. Before we discuss these two steps, we describe the structure of PS-tree. 27

Item

a c d g

sup per idl

1 1 1 1

1 1 1 1

1 1 1 1

Item

a c d g e f

(a)

sup per idl

1 2 1 1

1 1 1 1

1 2 1 1

1 1

2 2

2 2

Item

a c d g e

sup per idl

12 13 12 8

2 2 2 9

19 19 19 20

Item

c a d b

sup per idl

13 12 12 11

2 2 2 4

19 19 19 20

11 7 19 f 6 4 16 b 11 4 20

(b)

(c)

(d)

Figure 4.3: Construction of PS-list. (a) After scanning first transaction (b) After scanning second transaction (c) After scanning all transactions (d) Final sorted list of periodic-frequent items of size 1.

4.3.1 4.3.1.1

PS-Tree - Structure and Construction Structure of PS-Tree

Period Summary tree contains PS-list and a summarized prefix-tree. A PS-list consists of three fields - itemname (Item), support (sup) and periodicity (per). Two types of nodes are maintained in PS-tree: ordinary node and tail-node. The former is the type of node similar to that used in FP-tree, whereas the latter node represents the last item of any sorted transaction. In PS-tree, we maintain the period summary of occurrences of that branch’s items only at the tail-node of every transaction. The tail-node structure maintains (i) the summarized tid list (period summaries), which is of the structure explained in Definition 1 and (ii) the support of that branch’s items.

4.3.1.2

Construction of PS-tree

To construct the PS-tree, we have to scan the database twice. PS-list is constructed in the first scan and PS-tree is constructed in the second scan. 1) In the first database scan, the PS-growth scans the database contents and constructs the PS-list to discover periodic-frequent items of size 1. The construction of PS-list is same as the construction of PF-list (Algorithm 1). Figure 4.3(a)-(c) shows the construction of PS-list. The final PS-list is shown in Figure 4.3(d). 2) In the second database scan, the items in the PS-list will take part in the construction of PS-Tree (Algorithm 4). The tree construction starts by inserting the first transaction, (1, acdg), according to PS-list order, as shown in Figure 4.4(a). All the items in the transaction are inserted in the same order as in PS-list except ‘g’. The tail-node ‘d : [h1, 1, 0i], 1’ carries the summarized transaction-id list and 28

Item sup per idl

c a d b

13 12 12 11

2 2 2 4

19 19 19 20

{} null

{} null

{} null

c

c:,1 c:,1 b:;

a

a

d:,1

d:,1 d:,8

,7

a

b:; ,4

(a)

(b)

(c)

(d)

Figure 4.4: Construction of PS-Tree. (a) PS-list (b) After scanning first transaction (c) After scanning second transaction and (d) After scanning all transactions.

support. The support of the tail-node has to be incremented by one, every time that pattern occurs in the database. In the similar fashion, the second transaction, (2, cef), is also inserted into the tree. The tail-node structure for the second transaction would be ‘c : [h2, 2, 0i], 1’ (see Figure 4.4). The interval range keeps on increasing as the transactions add. After inserting all the transactions in the database, we get the final PS-tree as shown in Figure 4.4. Algorithm 4 PS-Tree (TDB, PS-List, minSup, maxP er) 1: Create the root of an PS-tree, T, and label it “null”. 2: for each transaction t ∈ T DB do 3: Set the timestamp of the corresponding transaction as tcur . 4: Sort and filter the candidate items in t according to the order of PS-List. Let the sorted candidate item list in t be [p|P ], where p is the first item and P is the rest of the list. 5: Call insert tree([p|P ], tcur , T, maxP er). 6: end for

Example 7 Consider the tail-node structure of ‘d’, from the branch ‘cad’. The original occurrences of ‘cad’ in T DB are {1, 3, 7, 9, 11, 13, 15, 19} and maxP er = 4. For the first transaction, the list will be {h1, 1, 0i}, 1. For the third transaction, it checks for condition 1 + 4(= maxP er) ≥ 3. Since it is satisfied, the list’s last occurrence will be updated as {h1, 3, 2i}, 2; 2 in the list being the periodicity within that interval since maximum tid difference between adjacent transactions is 2 (3-1). In a similar fashion, the list gets updated for each occurrence of the pattern ‘cad’ as {h1, 7, 4i}, 3; {h1, 9, 4i}, 4; {h1, 11, 4i}, 5; {h1, 13, 4i}, 6; {h1, 15, 4i}, 7 and {h1, 19, 4i}, 8. After scanning all the transactions, the tail-node structure of ‘d’ in ‘cad’ will be {h1, 19, 4i}, 8. 29

Algorithm 5 insert tree ([p|P ], tcur , T, maxP er) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

while P is non-empty do if T has a child N such that p.itemN ame 6= N.itemN ame then Create a new node N. Let its parent link be linked to T. Let its node-link be linked to nodes with the same itemName via the node-link structure. Remove p from P. end if end while len = PP.size() // Size of ps-list at the tail-node. if (PS[len-1].last + maxP er) ≥ tcur then PS[len-1].last = tcur PS[len-1].per = max(PS[len-1].per, tcur - PS[len-1].last) else PPS.append([tcur , tcur , 0]) end if support+ = 1 PF List

{} null

i:f:p c:,1 c:13:2 a:12:2 a d:12:2 d:,12

PF List

{} null

i:f:p c:4:9 a:4:9 d:4:9

PF List

{} null

i:f:p

c a d:; ,4

(a)

(b)

(c)

Figure 4.5: Mining using PS-growth algorithm. (a) PS-Tree after removing ‘b’ (b) Prefix tree of ‘b’ (c) Conditional Tree of ‘b’.

4.3.2

Mining of PS-tree

The PS-tree is recursively mined using a pattern-growth-based approach to extract periodic-frequent patterns (Algorithm 6). PS-growth is similar to that of PF-growth [40] except that different merging operations are employed on period summaries to check if it is periodic or not, whereas merging of two transaction-id lists is straight-forward in the existing approach. The mining process starts by considering the last item i in the PS-list (least support) and constructing its prefix tree (PTi ), which has the prefix sub path of the nodes labeled i. PTi is constructed by pushing the summarized transaction-id list to its parent. Once the prefix tree is constructed, we update the support and periodicity values by merging the summarized tid-lists of each item in the PS-list by traversing the item node pointers, which are maintained in the PS-Tree structure just like in FP-Tree. The merging of two intervals (summaries) of summarized tid-lists is explained in Algorithm 7. All items in PS-list whose support is greater than minSup and periodicity is less than maxP er are used in conditional tree - CTi (Figure 4.5(c)). In Figure 4.5(c), we can see that none of the elements satisfy minSup and maxP er thresholds. So, the conditional tree for ‘b’ is NULL. 30

PF List

{} null

PF List

{} null

PF List

{} null

i:f:p i:f:p i:f:p c:13:2 c:,1 c:12:2 c c:12:2 c a:12:2 a:12:2 a:12:2 a:,12 a:,12 a:,12 (a)

(b)

(c)

Figure 4.6: Mining using PS-growth algorithm. (a) PS-Tree after removing ‘d’ (b) Prefix tree of ‘d’ (c) Conditional Tree of ‘d’.

Algorithm 6 PS-growth (PS-Tree, α, minSup, maxP er) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

Select the last element in the PS-list. for each ai in header of Tree do Generate pattern β = ai ∪ α. Aggregate all of the ai ’s period summaries into PSβ , using Algorithm 7. if TSβ .support ≥ minSup and Check(T S β ) then Construct β’s conditional PS-tree, Treeβ . if Treeβ 6= φ then PS-growth(Treeβ , β, minSup, maxP er) end if end if Remove ai from the tree Push ai ’s period summaries to its parent nodes using Algorithm 7. end for

Algorithm 7 Aggregating Intervals(I1 , I2 ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

Initialize interval vector I3 if I1 .start > I2 .start then swap (I1 , I2 ) end if if I1 .stop > I2 .stop then I3 .append(I1 .start, I1 .stop) else if I1 .stop + maxP er ≥ I2 .start then I3 .append(I1 .start, I2 .stop) else I3 .append(I1 .start, I1 .stop) I3 .append(I2 .start, I2 .stop) end if return I3

31

Table 4.1: Patterns generated using PS-growth Approach

b d ad cad

support 11 12 12 12

periodicity 4 2 2 2

cd a ca c

support 12 12 12 13

periodicity 2 2 2 2

For each item j in PTi , we aggregate all of its nodes’ summarized transaction-id list to derive the summarized transaction-id list of the pattern ij, i.e., summarized TIDij . For merging two summarized transaction-id lists, we iterate over the lists and employ Algorithm 7 for merging two intervals. Next, to check if the final summarized transaction-id list is periodic or not, we use Algorithm 8. The main condition to check in the final summarized tid-list is, whether the size of the list is one or not. If the size is not one, it means that the pattern is not periodic in some part of the database. If j satisfies Algorithm 8 and support > minSup, pattern ij is a periodic-frequent pattern, then we consider j as periodic-frequent in PTi . In Figure 4.6(c), it can be seen that we are able to generate a conditional tree for d (denoted by CTd ). Choosing every periodic-frequent item j in PTi , we construct its conditional tree, CTij , and mine it recursively to discover the patterns.

Algorithm 8 Check (I) 1: 2: 3: 4: 5: 6: 7: 8:

if I.size() = 1 then if I[0].f irst

Suggest Documents