A Survey of Sequence Patterns in Data Mining Techniques

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 1 (2015) pp. 1807-1815 © Research India Publications http://www...
Author: Augusta Cook
5 downloads 1 Views 254KB Size
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 10, Number 1 (2015) pp. 1807-1815 © Research India Publications http://www.ripublication.com

A Survey of Sequence Patterns in Data Mining Techniques S. Muthuselvan1,Dr. K. Soma Sundaram2 1

Research Scholar (CSE), St. Peter’s University, Chennai.& Assistant Professor Gr.- II, Aarupadai Veedu Institute of Technology, Paiyanoor, Chennai. 2 Professor, Jaya Engineering College, Chennai. 1 [email protected], [email protected] Abstract Data mining techniques are used in many areas in the world to retrieve the useful knowledge from the very large amount of data. Sequence pattern mining is the important techniques in data mining concepts with the wide range of applications. The applications of the sequence patterns data mining are weblog click streams, DNA sequences, sales analysis, telephone calling patterns, stock markets and etc., The methods for sequential pattern mining are categorised in to two approached. First approach is Apriori-based approach and second is Pattern-Growth-based approaches. In this paper, a methodical review of the sequential pattern mining algorithms is accomplished. Finally, reasonablestudy is done on the base of important key features reinforced by many algorithms and current research encounters are discoursed in this area of data mining. In this paper, an organized survey of the sequential pattern mining algorithms is accomplished. This paper examines these algorithms by studying the classification algorithm for sequential pattern-mining. These algorithms classified into two extensive classes. First, on the foundation of algorithms which are considered to surge effectiveness of mining and the other, on the origin of numerous additions of sequential pattern mining planned for certain application.At the end, comparative analysis is done on the basis of important key features supported by various algorithms and current research challenges are discussed. [4] Keywords:Data Mining, Sequence pattern, Association Rule, Pattern Mining.

1808

S. Muthuselvan,Dr. K. Soma Sundaram

Introduction In Knowledge Discovery Process, Data mining techniques are divided into two major categories. These are descriptive type and prediction type. Each of the type will have different type of the approaches. The sequential pattern mining is anidenticalmain concept of data mining, a further extension to the concept of association rule mining [1].The set of sequences of the given data is called data-sequences. Customer transactions list is the data sequences and the set of items is the transactions. Each transaction is associated with the transaction time of the sequence database. Association rule mining and the sequential pattern mining is more or less comparable, the events linked with the time is the difference among them. The sequential pattern mining determines the correlation between the dissimilar transactions, but in the event of association rule mining it determines the association of items in the similar transaction [2]. In this paper segments are ordered as follows: Section II deals with types of the sequential pattern mining models, Section III discusseslimitations of sequential pattern mining algorithms, Section IV discusses the comparative analysis of sequential pattern mining algorithms, Section V discusses about the comparative analysis of sequential pattern mining algorithms. Finally, the conclusion part is discussed about the Problem definition: Let I= {i1, i2, in} be a set of all items. An itemset is a nonempty set of items. A sequence is an ordered list of itemsets. A sequence is denoted by, where sj is an itemset, i.e., sj⊆I for 1≤j≤l. sj is also called an element of the sequence and denoted as (x1,x2,…xm), where xk∈I for 1≤k≤m. The number of instances of items in a sequence is called the length of the sequence. A sequence with length l is called a l-sequence. A sequence a=is called a subsequence of b= and b a super sequence of a, denoted as a⊆b, if there exist integers 1≤j1≤j2…≤jn≤m such that a1⊆ bj1 , a2⊆ bj2 , … , an⊆bjn. A sequence database D is a set of tuples where sid is a sequence-id and s is a sequence. A tuple is said to contain a sequence a, if a is a subsequence of s, i.e., a⊆s. The number of tuples in a sequence database D containing sequence a is called the support of a, denoted as sup (a). [22] Given a sequence database D and some user specified minimum support min_sup, a sequence a is a sequential pattern in D if sup(a) min_sup. The sequential pattern mining problem is to find the complete set of sequential pattern with respect to D and min_sup.

Categories of sequence pattern mining Techniques As defined by Yen-Liang Chen and Ya-Han Hu [4] in latest years, several methods in sequential pattern mining have been projected; these studies cover a wide-ranging variety of problems. In general, there are two different concerns in the area of sequential pattern mining in research. The first is to increase the efficacy in sequential pattern mining process while the other one is to. Secondly,extend the mining of sequential pattern to other time- related patterns.

A Survey of Sequence Patterns in Data Mining Techniques

1809

The algorithms of sequential pattern mining are differed in two different ways, based on the researches done on the fields of sequential pattern mining [3]. First, generating the sequences of candidates with storing, and the second is, how the counting and testing performed on the candidate sequence in a frequent manner. The main goal of the primary one is to reduce the generation of the total number of candidate sequences, so that the I/O cost will be reduced. The main goal of the second one is,to remove any database or data structure that has to be sustained all the period for support of counting commitments only. The main benefits and shortcomings of sequential pattern mining are listed in Table 1. Sequential Pattern Mining

Apriori-Based Algorithms

Pattern Growth Algorithms

Breadth-first Search

FREESPAN

Generate-and-test

WAP-MINE

Multiple Scans of the Database

PREFIXSPAN

GSP

SPIRIT

SPADE

SPAM

Fig. 1 Categories of Sequential Pattern Mining [6] The above fig. 1 explaining about the sequential pattern mining algorithm’s categories in broadly. There are two different important types are Apriori based and pattern growth. The above mentioned algorithms also having the some of the algorithms. Table 1Pros and Cons of Sequential Pattern Mining Type/Techniques

Pros

Cons

Apriori-Based Algorithms [5].

It is easy algorithm to implement.

It takes more memory, lot of space and it will take more time for the process of candidate generation.

Pattern Growth Algorithms [7].

It can be faster when given large volume of data.

Normally more multifarious to progress, investigation and maintain.

S. Muthuselvan,Dr. K. Soma Sundaram

1810

Limitations of Sequential Pattern Mining Algorithms Sequential pattern mining algorithms are typically centred on string. It is not focus on discovery of the sequential patterns with the limitations in an agreed database. In query languages like, SQL or MySQL, it will not permit the practice of the nonaggregate functions for the portion of the query compilation [19]. Sequential pattern mining retrieved the relationships among objects in sequential dataset [18]. The most familiar pattern mining in the sequential is Apriori. This algorithm, also having the drawbacks like, too many candidate sets, more number of passes over the databases. Another disadvantages of the above mentioned algorithm is, requirement of the huge memory space [20]. The assignment of determining entire frequent sequences in huge databases is relatively interesting. The exploration of the memory space is tremendously large[21]. Table 1 Comparative analysis of algorithm performance [3]. The symbol ―-‖ means an algorithm crashes with the parameters provided, and memory usage could not be measured. [3] Algorithm

GSP Apriori

SPAM Apriori

PrefixSpanPattern Growth

Data Set Size

Minimum Support

Medium (D=200K) Large (D=800K) Medium (D=200K) Large (D=800K) Medium (D=200K) Large (D=800K)

Low(0.1%) Medium(1%) Low(0.1%) Medium(1%) Low(0.1%) Medium(1%) Low(0.1%) Medium(1%) Low(0.1%) Medium(1%) Low(0.1%) Medium(1%)

Execution Time(sec) >3600 2126 136 674 31 5 1958 798

Memory Usage (MB) 800 687 574 1052 13 10 525 320

Comparative Analysis of Sequential Pattern Mining Algorithms Sequential pattern mining is precise significant because it is the foundation of numerous applications. A sequential mining algorithm should discover the entire set of patterns, when potentially, adequate the least support. Working with the big data, the scalability is the one of the important issue of the mining the knowledge from the huge amount of data. The above mentioned issue will be raised in MapReduce model in the cloud. The SPAM algorithm, suggestively decrease the mining period with big data, and also it will attain enormously great scalability [11]. The important and familiar algorithm for mining the data is Apriori. Using this algorithm, finding the sequence data from the d-dimensional sequence data is not possible. Using the PREFIXMD SPAN algorithm, the retrieval of the sequence data is possible from the d-dimensional data [14]. Generating the huge amount of the unpromising candidate sub sequences is difficult, while using the Generate-and-test algorithm. This will be

A Survey of Sequence Patterns in Data Mining Techniques

1811

overcome, applying the algorithm called Maximum weighted upper-bound model. The maximum weighted upper-bound model will give the good performance of pruning efficiency and also it will improve the performance efficiency [17]. The huge amount of repeated projected databases in mining data sets will be creating applying pattern growth type of algorithm. It will be overwhelmed, using the SMPM [13] algorithm. This algorithm will avoid the repeated projected database and evade physical forecast [13]. The greedy algorithm will raise the issues in the sensor network applications, by creating the multiple interleaved patterns. The GAIS [15] method algorithm will find the sequential pattern from the small amount of quality data. The Frequent Pattern Tree type is another type for finding the pattern using the sequence mining. In this algorithm will be work in scanning the database many number of time. It will be time consuming comparing with another type of algorithm. Yi Sui, Feng Jing Shao, Rencheng Sun and Jinlong Wang were used the STMFP algorithm. In this algorithm required to scan the database in a single. After the single scan itself, the tree can store the all the sequences from the source data [9]. The Association rule mining algorithm is the important type of algorithm in the Apriori model of mining methods. The Apriori based association rule algorithm is the single minimum support. The single minimum support cannot exactly discover the interesting pattern. The number of minimum support is very high in the usage of MSCP growth algorithm [10]. More number of minimum supports will produce the interesting pattern.Xilu Wang and Weill Yao used their optimum maximum sequence pattern mining for getting the sequence pattern. The advantage of this algorithm is, to acquiring the sequential pattern is very reliable. The existing mathematical models for mining the sequential pattern will be failed in noisy data with the candidate patterns[12]. Measuring the multidimensional-attribute of the material is not completely measured concurrently in modified Apriori and PrefisSpan algorithms. It will be overwhelmed using the Leaner Preference Tree (LPT) algorithm. The advantage of this algorithm is, the learners actual learning favourite can be fulfilled perfectly [16]. Mining the pattern from the incremental data is very difficult to handle. In this problem will be solved using the Direct Appending (DirApp) algorithm. The improvement of this method, the incremental data can be easily dealt and also the static database [8].

Performance Based Comparative Study The above table 1 described about the Comparative analysis of different algorithm based on their performance in sequential mining. These algorithms are studied with the help of the different size of the data sets. The parameters chosen for these studies are Minimum Support, execution time (sec) and memory usage (MB). The execution time is measured here is in the form seconds and, the memory usages of these algorithm is measured in the form of megabytes. In the data sizes, we have categories like medium size of data sets and the large size of data sets. The data size is denoted as D. The value of D for the all the algorithms are categories in to medium size and large size. The Medium size value is 200k and the

S. Muthuselvan,Dr. K. Soma Sundaram

1812

large size value is 800k. Minimum support for the each algorithm is categorised as low and medium for the both data sizes medium and large respectively. The large amount of the execution time taken by the GSP Apriori algorithm was more than the 3600sec with the memory usage 800mb in the medium size of the data sets. The minimum support of this highest execution time low. The PrefixSpan pattern algorithm execution time is very less. The execution time for this algorithm is 5sec with the memory usage of MB in the large data set size of medium support size. Table 2 Comparative Analysis of Sequential Pattern Mining Algorithms Refere nce Paper ABCF [16]

OMSP M [12]

MMS [10]

SPMP D [8]

IFPT [9]

GAIS [15]

WSP [17]

Methodolog y Used

Algorithm Used

Mojtaba Salehi, Isa Nakhai Kamalabadi and Mohammad Bagher Ghaznavi Ghoushchi Xilu Wang and Weill Yao

Modified Apriori and PrefixSpan algorithms

Mathematic al Model.

Ya-Han Hu, Fan Wu and Yi-Chun Liao

Apriori based Association Rule mining -

Author

Jen-Wei Huang, Taipei, Chi-Yao Tseng, JianChih Ou and Ming-Syan Chen Yi Sui, FengJing Shao, Rencheng Sun and Jinlong Wang Ruotsalainen, M, AlaKleemola, T and Visa

Guo-Cheng Lan, Tzung-Pei Hong and

Existing System

Proposed System

Leaner Preference Tree (LPT)

Multidimensionalattribute of materials is not completely measured concurrently.

Learner’s actual learning favourite can be fulfilled perfectly

Optimum maximum sequence pattern mining MSCPGrowth

Noisy data, with fewer candidate patterns.

Acquired sequential patterns are reliable.

Single minimum support cannot exactly discover interesting pattern. Dealing the incremental data is difficult.

Numerous minimum supports possible.

STMFP Algorithm

Need to scan the database many times.

After the single scan, the tree can store the all the sequences.

Greedy Algorithm

GAIS method.

Sequential patterns can be identified from little quality data

Generateand-test.

Maximum weighted upper-

Issues in sensor network application are multiple interleaved patterns. Generate a large number of unpromising

Direct Appending (DirApp)

Frequent Pattern Tree

It can easily deal with a static database or an incremental database as well.

Good enactment of pruning efficiency and

A Survey of Sequence Patterns in Data Mining Techniques Hong-Yu Lee MRMC [11]

Chun-Chieh Chen, Chi-Yao Tseng and Ming-Syan Chen

MapReduce model on the Cloud

MDSD [14]

Chung-Ching Yu and YenLiang Chen

Apriori (APRIORIM D)

SEME [13]

Yong-Gui Zou and Hong Yu

Pattern Growth

bound model SPAM Algorithm

candidate sub sequences Scalability issues while working with big data.

PREFIXM D SPAN ALGORIT HM SMPM

Finding sequential patterns from ddimensional sequence data is not possible. Creating huge amount of repeated projected databases in mining data sets.

1813 performance efficiency. Suggestively decrease mining period with big data, attain enormously great scalability D-dimensional sequence data is possible where d>2. Evading the repeated projected database and evade physical forecast.

Conclusions In this paper, we discussed about the sequential pattern mining and also briefly represented the major categories of the sequential pattern mining. The comparison between the some of the types of algorithm was discussed with the help of previously completed work. Primarily, this topicwas initiated based on the improvement of the performance of the algorithm with the help of the dissimilar data structure and representation. The comparative study of different type of the algorithm is used for the mining the sequential pattern. As well as, we discussed about the comparative analysis of the algorithm performance. From the discussion about the pros and cons of sequential mining, easily can be define the strength and their limitations. The analysis of the comparison based on the different type of methodology and their algorithms are discussed in detail.

Reference [1] [2]

[3]

[4]

J. Han and M. Kamber, ―Data Mining: Concepts and Techniques‖, Morgan Kaufman publishers, 2001. Vishal S. Motegaonkar, Prof. Madhav V. Vaidya ―A Survey on Sequential Pattern Mining Algorithms‖, International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2486-2492. Nizar R. Mabroukeh and C. I. Ezeife, ―A Taxonomy of Sequential Pattern Mining Algorithms‖, ACM Computing Surveys, Vol. 43, No. 1, Article 3, Publication date: November 2010. J.Pei, J.Han, B.MortazaviAsl, J.Wang, H.Pinto, Q.Chen, U.Dayal and M.C.Hsu, ―Mining sequential patterns by pattern-growth: The PrefixSpan approach‖, IEEE Transactions on Knowledge and Data Engineering, vol.16, no.11, 2004, pp. 1424-1440.

1814 [5] [6]

[7] [8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

S. Muthuselvan,Dr. K. Soma Sundaram Hilderman R. J., Hamilton H. J.,‖Knowledge Discovery and Interest Measures‖,In: Kluwer Academic Publishers, Boston, 2002. V.Chandra Shekhar Rao, P.Sammulal,Ph.D, ―Survey on Sequential Pattern Mining Algorithms ― International Journal of Computer Applications (0975 – 8887) Volume 76– No.12, August 2013. Carl H. Mooney and John F. Roddick, ACM Journal Name, Vol. V, No. N, M 20YY, Pages 1–46. Jen-Wei Huang , Nat. Taiwan University, Taipei, Chi-Yao Tseng, Jian-Chih Ou and Ming-Syan Chen,―A General Model for Sequential Pattern Mining with a Progressive Database‖, IEEE Trans. Knowledge and Data Eng., Volume:20 , Issue 9, page no 1153-1167, September 2008. Yi Sui, Feng, Jing Shao, Rencheng Sun and Jinlong Wang,―A Sequential Pattern Mining Algorithm Based on Improved FP-tree‖,. Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, 2008. SNPD 2008, Page(s): 440- 444. Ya-Han Hu, Fan Wu and Yi-Chun Liao, ―Sequential pattern mining with multiple minimum supports: A tree based approach‖, 2nd International Conference on Software Engineering and Data Mining (SEDM), 2010, Page(s): 428-433. Chun-Chieh Chen, Chi-Yao Tseng and Ming-Syan Chen ―Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud‖, IEEE International Congress on Big Data (Big Data Congress), 2013, Page(s) 310317. Xilu Wang and Weill Yao, ―Sequential Pattern Mining: Optimum Maximum Sequential Patterns and Consistent Sequential Patterns‖, IEEE International Conference on Integration Technology, 2007, Page(s): 365-368. Yong-Gui Zou and Hong Yu, ―Moving sequential pattern mining based on Spatial Constraints in Mobile Environment‖, IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS), 2010, Page(s) 103107. Chung-Ching Yu and Yen-Liang Chen, ―Mining Sequential Patterns from Multidimensional Sequence Data‖, IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 1, January 2005. Ruotsalainen, M, Ala-Kleemola, T and Visa, ―A GAIS: A Method for Detecting Interleaved Sequential Patterns from Imperfect Data‖, IEEE Symposium on Computational Intelligence and Data Mining, 2007, Pages(s) 530- 534. Mojtaba Salehi, Isa Nakhai Kamalabadi, Mohammad Bagher Ghaznavi Ghoushchi, ―Personalized recommendation of learning material using sequential pattern mining and attribute based collaborative filtering‖, Education and Information Technologies, December 2014, Volume 19, Issue 4, page(s) 713-735.

A Survey of Sequence Patterns in Data Mining Techniques [17]

[18]

[19]

[20]

[21]

[22]

1815

Guo-Cheng Lan, Tzung-Pei Hong and Hong-Yu Lee, ―An efficient approach for finding weighted sequential patterns from sequence databases‖, Applied Intelligence, September 2014, Volume 41, Issue 2, pp 439-452. Thanh-Trung Nguyen, Phi-Khu Nguyen, ―A New Approach for Problem of Sequential Pattern Mining‖, Lecture Notes in Computer Science on Computational Collective Intelligence. Technologies and Applications Volume 7653, 2012, pp 51-60. VangipuramRadhakrishna, Chintakindi Srinivas and C.V.Guru Rao, "Constraint Based Sequential Pattern Mining in Time Series Databases - A Two Way Approach", AASRI Conference on Intelligent Systems and Control, AASRI Procedia 4(2013)313-318. ShamilaNasreen, Muhammad AwaisAzamb, KhurramShehzada, Usman Naeemc, and Mustansar Ali Ghazanfara, ―Frequent Pattern Mining Algorithms for Finding Associated Frequent Patterns for Data Streams: A Survey‖, The 5th International Conference on Emerging Ubiquitous Systems and Pervasive Networks (EUSPN-2014), Procedia Computer Science 37 ( 2014 ) 109 – 116. Mohammed J. Zak, ―SPADE: An Efficient Algorithm for MiningFrequent Sequences‖, Kluwer Academic Publishers. Manufactured in The Netherlands, 42, 31–60, 2001. RamakrishnanSrikant, Rakesh Agrawal,‖ Mining Sequential Patterns: Generalizations and Performance Improvements‖, Advances in Database Technology — EDBT '96, Lecture Notes in Computer Science, Springer, Volume 1057, 1996, pp 1-17

1816

S. Muthuselvan,Dr. K. Soma Sundaram