Mining Frequent Patterns in Print Logs with Semantically Alternative Labels

Mining Frequent Patterns in Print Logs with Semantically Alternative Labels Xin Li1 , Lei Zhang1, Enhong Chen1 , Yu Zong2, and Guandong Xu3 1 Univers...

Author: Ashlie Lynch

1 downloads 0 Views 341KB Size

Report

Download PDF

Recommend Documents

Mining Best-N Frequent Patterns in a Video Sequence

Mining Periodic Frequent Patterns using Period Summary and Map-Reduce

Pattern Decomposition Algorithm for Data Mining Frequent Patterns

Frequent Item Set Mining

Frequent Subgraph Mining

Frequent subsequence mining

On Finding Frequent Patterns in Event Sequences

Data Mining I. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods. Keith E. Emmert

Chapter 5, Frequent Pattern Mining

Survey on Frequent Pattern Mining

Extraction of synonyms and semantically related words from chat logs

Edward Burra Large Print Labels

Mining Sequential Patterns with Regular Expression Constraints

Keywords : Data mining,weighted frequent pattern mining,updated TDB, Mining with tree structure

Frequent Pattern Mining with Serialization and De-Serialization

Mining moving flock patterns in large spatio-temporal datasets using a frequent pattern mining approach. Andres Oswaldo Calderon Romero

Mining Sequential Patterns with Constraints in Large Databases

Discovering Frequent Tree Patterns over Data Streams

ZIMBABWE ALTERNATIVE MINING INDABA

Foundation for Frequent Pattern Mining Algorithms Implementation

A probabilistic algorithm for mining frequent sequences

Mining Compressing Sequential Patterns

BIDE: Efficient Mining of Frequent Closed Sequences

A Survey of Frequent Subgraph Mining Algorithms

Mining Frequent Patterns in Print Logs with Semantically Alternative Labels Xin Li1 , Lei Zhang1, Enhong Chen1 , Yu Zong2, and Guandong Xu3 1

University of Science and Technology of China 2 West Anhui University 3 University of Technology, Sydney {leexin,stone,cheneh}@ustc.edu.cn, [email protected], [email protected]

Abstract. It is common today for users to print the informative information from webpages due to the popularity of printers and internet. Thus, many web printing tools such as Smart Print and PrintUI are developed for online printing. In order to improve the users’ printing experience, the interaction data between users and these tools are collected to form a so-called print log data, where each record is the set of urls selected for printing by a user within a certain period of time. Apparently, mining frequent patterns from these print log data can capture user intentions for other applications, such as printing recommendation and behavior targeting. However, mining frequent patterns by directly using url as item representation in print log data faces two challenges: data sparsity and pattern interpretability. To tackle these challenges, we attempt to leverage delicious api (a social bookmarking web service) as an external thesaurus to expand the semantics of each url by selecting tags associated with the domain of each url. In this setting, the frequent pattern mining is employed on the tag representation of each url rather than the url or domain representation. With the enhancement of semantically alternative tag representation, the semantics of url is substantially improved, thus yielding the useful frequent patterns. To this end, in this paper we propose a novel pattern mining problem, namely mining frequent patterns with semantically alternative labels, and propose an efficient algorithm named PaSAL (Frequent Patterns with Semantically Alternative Labels Mining Algorithm) for this problem. Specifically, we propose a new constraint named conflict matrix to purify the redundant patterns to achieve a high efficiency. Finally, we evaluate the proposed algorithm on a real print log data. Keywords: print log data, frequent pattern mining, delicious.com, PaSAL.

1 Introduction With wide applications of internet and office automation tools, printing informative information from the web pages is becoming popular today. There are many web printing tools such as Smart Print1 and PrintUI 2 are developed for online printing. With the use of these tools, many interaction data by users are collected under certain consent. These 1 2

www.smartprint.com/ www.printui.com/

H. Motoda et al. (Eds.): ADMA 2013, Part II, LNAI 8347, pp. 107–119, 2013. c Springer-Verlag Berlin Heidelberg 2013

108

X. Li et al. Table 1. User and Printed URLs User Printed URLs http://maps.google.com/maps?saddr=27400+Old+Trilby+Roa user1 http://www.groupon.com/deals?city=houston http://www.booking.com/hotel/us/the-huston.en.html ...... user2 http://maps.google.com/maps?saddr=2800+League+City+Parkwa ...... ......

Table 2. URL domains and Their Tag Representations Tag Representations maps, google, travel, map, reference, search, directions, tools, clearlake, brian shopping, coupons, deals, social, discount, coupon, business, crowdwww.groupon.com sourcing, marketing, travel travel, hotel, hotels, booking, accommodation, search, viajes, online, www.booking.com hoteles, turismo Domain

maps.google.com

interaction data is also called Print log data, where each record is the set of urls selected for printing by a user within a certain period of time. Table 1 gives an example of print log data. Compared with the search log data [13], the print log data is a new kind of user interaction data and is attracting researchers’ attention. Since the webpages printed by users may reveal their interests, mining frequent patterns from these print log data becomes one of important methods capturing users’ interests and has benefiting other applications such as printing recommendation and behavior targeting. However, we find that using the url directly as the item representation for frequent pattern mining faces two challenges in real print log data, namely, data sparsity and pattern interpretability. For example, as shown in Table 1, user1 prints the map of “Old Trilby Roa” while user2 prints the map of “League City Parkwa”, both of them print their destination maps on google maps and take along with them. However, the two printed urls from the two users are completely different and cannot be recognized as the same item when running any kinds of the frequent pattern mining algorithms unless we manually label. In addition, patterns using urls as items representation has a poor interpretability. To that end, we attempt to enrich the semantics of each url by utilizing the semantically alternative tags associated with the domain of each url to form the representation of url. More precisely, when processing a url, we extract and input the domain of this url to delicous.com api 3 and obtain the top-10 returned tags as the representation. Table 2 shows three domains in Table 1 and their corresponding returned tags. By replacing each domain of the url with a set of tags and removing redundant tags in a transaction, the print log of url transactions is transformed to be a tag transaction dataset. Mining frequent patterns on this new tag transaction data is more practical, since these tags labeled by humans are abundant and meaningful. If we run frequent pattern mining algorithms on tag transaction data without considering any extra constraint, we could get massive patterns and most of them are redundant and meaningless. For example, if the domain maps.google.com in Table 2 is frequent, then any combination of tags (e.g., {maps,google}, {maps, google, travel}) 3

Delicious.com api: https://delicious.com/developers

Mining Frequent Patterns in Print Logs with Semantically Alternative Labels

109

referred to the domain is also frequent, resulting in a larger number of trivial frequent patterns (210 ). Therefore we introduce a conflict matrix to reflect these restrict conditions, which will be defined in Section 3. To this end, we propose a novel frequent pattern mining technique for print log data, named PaSAL (Frequent Patterns with Semantic Alternative Labels Mining Algorithm) highlighted by enhancing the url representation via semantically alternative tags and incorporating the constraint of conflict matrix for further pruning the lattice during the mining process. In summary, we make following contributions. • We define a novel frequent pattern mining problem for print log data with semantically alternative labels. In order to solve the problem of data sparsity and pattern interpretability, we use the returned tags from delicious.com api as the representations for urls. • We devise an efficient algorithm called PaSAL for the above problem. We define a new constraint named conflict matrix for pruning meaningless patterns. PaSAL can exploit this constraint to further reduce the search space during the mining process, thus achieving a high efficiency. • We conduct several experiments on a real print log data to evaluate our proposed algorithm. The experimental results show that our method outperforms the baseline and achieves a better pattern interpretability. The paper is organized as follows. In Section 2, we describe the related work. In Section 3, a formal description about the problem will be given. We give the basic algorithm and our algorithm PaSAL in Section 4. We evaluate the effectiveness and efficiency of PaSAL in Section 5 and conclude the paper in Section 6.

2 Related Work Frequent Pattern Mining Algorithm. Mining association rule was first proposed by Agrawal et al. in [1]. Since then, there are many algorithms have been developed to mine frequent patterns, such as Apriori [2] and FP-growth [5]. Instead of using horizontal data format, Zaki et al. proposed to use vertical data format in [14]. Of course, researchers try to mine maximal frequent patterns [3] and closed frequent patterns [7] in order to avoid mining massive patterns. Our work is based on these works yet using a bitmap representations of database. Semantic Information Extension. Besides inventing new algorithms to mine patterns, Han et al. [4] and Srikant et al. [11] proposed structured model of the items in order to mine patterns at a different level, which can be regarded as introducing external information to help mining patterns. Our work is different from their work in two aspects. First, the previous works have multi-level structures while we only have one. Second, items in the structure are owner-member relationship, however tags in our restrict conditions are semantically relevant to each other. In addition, Mei et al. [6] interpret the frequent patterns by giving semantic annotations through natural language processing, which is a post process while we represent each url before mining. Information can either be generated by machine inferring from the history log or artificial rules. In our paper, we use delicious tags to expand the semantic information of url, which is newly demonstrated.

110

X. Li et al.

Constraint Based Pruning. Another related work is about constraint on pattern mining process, which is mentioned in works of Pei et al. [8] [9] and Raedt et al. [10]. In this paper, we propose a new constraint named conflict matrix for further pruning the lattice. In the following, we will show that the conflict matrix constraint is a anti-monotone one, which can be deeply exploited in the mining process to achieve high efficiency. Besides, [12] proposed to mine dominant and frequent patterns based on the data from the Web printing tool Smart Print. However, the dateset from [12] is very different from our. Specifically, in [12] each record of the dataset is the selected clips (contents) for a certain Webpage while that in this paper is a set of printing URLs from one user.

3 Problem Statement First, we will give some preliminaries about frequent pattern mining. Let T be the complete set of transactions and I be the complete set of distinct items. Any non-empty set of items is called an itemset (or pattern) . The support of a pattern P is the percentage of transactions that contain P. Frequent pattern mining is to find frequent patterns whose support is above the user specified threshold. Note that the primitive transaction database is a URL database, where each transaction is the set of URLs selected by a user. If we use the domain of each URL as the representation, the URL database can be transformed as a domain database (denoted as D). As is mentioned above, when you put a domain into delicious.com api, the server will return top-10 tags as representations referred to the domain you submitted. By replacing domain by a set of tags {t1 , t2 , ..., tk } ( k is equal to 10 in most instances), the domain database is transformed as a tag database. Note that if a domain is frequent in the domain database, then all the subsets of tags in this domain is also frequent. In order to reducing the number of patterns like this, we propose to use Conflict Matrix CM to purify the frequent pattern mining results. Formally, Definition 1 (Conflict Matrix). Conflict Matrix CM is a matrix with |I| × |D| dimensions. Here |T | denotes the number of tags while |D| counts the number of domains. The value in the matrix is defined as follows: CMi,j =

1 ti ∈ T (dj ) / T (dj ) 0 ti ∈

(1)

where ti is a tag and T (dj ) is a set of tags associated with domain dj . Definition 2 (Conflict Matrix Constraint). Given a pattern P and a conflict matrix CM , P is a valid pattern if there is no subset of P that is included in any one of transactions in CM . In order to achieve high efficiency, we can store CM with bitmap representation. Example. As shown in Table 1 and 2, we sample some few tags for the 10 returned tags to make an example due to the limit of the page. Let D = {maps.google.com, www.groupon.com, www.booking.com} and I = {maps, travel, search, hotel, coupon}, so we get CM , displayed in Equation 2.

Mining Frequent Patterns in Print Logs with Semantically Alternative Labels

google ⎛ maps 1 travel ⎜ 1 CM = ⎜ search ⎜ 1 hotel ⎝ 0 coupon 0

111

groupon booking ⎞ 0 0 1 1 ⎟ ⎟ 0 1 ⎟ ⎠ 0 1 1 0

(2)

Consider the database shown in Table 3(a) and let user specified minimum support be 0.3 (that is, 2 transactions at least). Then the patterns with minimum support (frequent patterns) are shown in Table 3(b). Note that travel, search co-occur in both domain google and booking, thus it should be removed from our candidate basket. The same thing happened on maps, search and hotel, search. Finally, we get two patterns marked with in Table 3(b) and those are what we want to get. Somehow we can say tags in a transaction of RC are mutual exclusion. Table 3. A Sampled Example (a) Sampled Database T trans Transaction Id 100 200 300 400 500 600

Tags maps, hotel, search maps, hotel, coupon travel, search, hotel, maps coupon, search search, travel, coupon travel, search

(b) Frequent Patterns Pattern Support maps, hotel 2 coupon, search 2 maps, search 2 travel, search 3 hotel, search 2

(c) Example No. t1 t2 t3 t4 t5

Items bc ab abc abd abcd

Problem Statement. Given a set of transactions T , conflict matrix constraint CM and a user-specified minimum support (min sup), the problem of mining frequent patterns with semantically alternative labels is to find all frequent patterns that do not violate conflict matrix constraints. Note that if CM is null, the problem degenerates into a normal frequent pattern mining problem. The problem we defined here has many applications. One immediate application is widely used in supermarket, that is how to put goods on shelves reasonably. Goods displaying has many strict limits, like our restrict matrix. For example, when people go to supermarket buying some food, they may bring raw beef, vegetables and fried chicken together home. However, raw food and cooked food are absolutely forbidden to put together. Hence, though we mine many frequent patterns from shopping list transaction, we need to think carefully before we act to do cross-selling. A useful way is to take restrict matrix into consideration.

4 Algorithms In the rest of the section, we first show the naive approach for the proposed novel problem, and then present our efficient algorithm PaSAL. 4.1 Algorithm Basic The basic algorithm is designed below. 1. First, find all the frequent patterns by any of the frequent pattern mining algorithms like “Apriori” or “FP-growth”.

112

X. Li et al.

Algorithm 1. BASIC input : the tag transaction data T , lexicographic ordering I, the conflict matrix CM output: restrict condition constrained patterns P 1 Root.head ← empty; 2 Root.tail ← I; 3 DF S(Root); 4 DF S(N ode) begin 5 for tag ∈ N ode.tail do 6 ptmp ← N ode.head ∪ tag; 7 if ptmp .f requency > min sup × |T | then 8 N odetmp .head ← ptmp ; 9 remove tag fromN odetmp .tail ← N ode.tail; 10 Children ← N odetmp ; 11 Pcandidate ← N odetmp .head; 12 13

for node ∈ Children do DF S(node);

16

for pattern ∈ Pcandiate do if pattern violate CM then remove pattern fromPcandidate ;

17

P ← Pcandidate ;

14 15

2. Then, check each frequent pattern whether violates the restrict matrix constraint and obtain the ultimate result. Algorithm 1 gives an overview of the whole process, using the notation in Section 3. In this algorithm, we set the activate node be the root of the lattice, with its head be empty and tail be the lexicographic ordering of all tags in T . Then we iteratively visit each tag in its tail. By adding the tag to its head, we count the cardinality of its binary set after bitwise-AND operation and filter out those node whose cardinality is below the user-specified threshold (min sup ×|I|). Patterns (Nodes) without been filtered will be added to Pcandidate and Children, where Pcandidate is used for post process and nodes in Children is waiting for recursive procedure. After filtering the nodes without minimum support, we actually prune the lattice by abandoning all the descendants, which is under the basic assumption that supersets of infrequent itemsets are also infrequent. At last, every pattern is checked in Pcandidate whether it violates restrict matrix constraint. Specifically, we do bitwise-AND on all the tag vectors from CM in a pattern to see if the cardinality of the binary set is zero. If so, it means that tags have no overlap on any restrict conditions, thus it will be kept or it will be removed. Finally, we get P as a result. 4.2 Algorithm PaSAL Instead of using Basic algorithm, we propose a new algorithm called PaSAL. In this algorithm, we do pruning during the traversal rather than post process in basic one. This is carried out by exploiting the anti-monotonicity of conflict matrix and the bitmap data structure for storing conflict matrix.

Mining Frequent Patterns in Print Logs with Semantically Alternative Labels

113

Anti-monotonicity. In order to do pruning, we aim to estimate the restrict condition or the conflict matrix of all nodes in lattice L. The basic idea is described as follows. First, we propose a label to calculate the extent of the conflict to a certain pattern, denote as σ(P ). σ(P ) can be calculated in this way. Supposing that P = {t1 , t2 , ..., tm } and each tag ti in pattern P is a conflict vector in conflict matrix CM , so σ(P ) is the number of domains that conflict take places, which can be formalized as: σ(P ) =

card(ti & tj )

(3)

1≤i,j≤m

In Equation 3, & is bitwise-AND operation and card(∗) means the number of 1’s in a bitwise vector ∗. Second, we need to find the properties of label σ(P ). Since σ(P ) is defined above, we can easily infer that σ(P ) is not decreasing as the pattern grows as shown in Property 1. Property 1. For any pattern P and its extension te , the label σ(∗) satisfies that σ(P ) ≤ σ(P + te )

(4)

Proof. Let’s consider the definition of σ(P ) in Equation3. Still P = {t1 , t2 , ..., tm } σ(P + e) = 1≤i,j≤m+1 card(ti & tj ) = 1≤i,j≤m card(ti & tj ) + 1≤i≤m card(ti & tm + 1) = 1≤i,j≤m card(ti & tj ) + 1≤i≤m card(ti & te ) ≥ 1≤i,j≤m card(ti & tj ) = σ(P ) Note that te here is same to tm+1 , and 1≤i≤m card(ti & te ) is definitely no less than zero. Last, similar to metrics support, we assign each pattern that do not violate any constraint a value called “traversability”, which is define as follows: traversability(P ) = 1/σ(P )

(5)

Thus traversability satisfies the property 2 Property 2. Property “traversability” in Equation 5 is anti-monotonicity. Proof. Since σ(P ) is nomotonicity, the reciprocal of σ(P ) is anti-monotonicity.

Hence, Property 2 can be used for pruning the lattice. The key point of the algorithm is to set a flag in each node of the lattice in order to distinguish if the node violates the conflict matrix constraint by adding the tag from the tail to head. The process is shown in Algorithm 2. Bitmap Operations. Since each tag could find its own binary vector from CM , we can conduct the calculation easily by bitwise-AND and bitwise-OR. As described in Equation 1, each row tag vector in CM indicates which domain contains the referred tag. The position is set to binary one when the domain contains the tag else zero. More intuitional feeling can be got in Equation 2. By doing bitwise-AND, we first judge the conflict between two tags. In Equation 2, also known as an example, the vector of tag maps equals to {1, 0, 0} and vector of tag travel is {1, 1, 1}, when operating bitwise-AND, we get {1, 0, 0}, which means that

114

X. Li et al.

Algorithm 2. PaSAL input : the tag transaction data T , lexicographic ordering I, the conflict matrix CM output: restrict condition constrained patterns P 1 Root.head ← empty; 2 Root.tail ← I; 3 Root.f lag ← (0)2 ; 4 DF S(Root); 5 DF S(N ode) begin 6 for tag ∈ N ode.tail do 7 ptmp ← N ode.head ∪ tag; 8 f lagtmp ← (N ode.f lag & CMtag )2 ; 9 if ptmp .f requency > min sup × |T | and f lagtmp == 0 then 10 N odetmp .head ← ptmp ; 11 remove tag fromN odetmp .tail ← N ode.tail; 12 N odetmp .f lag ← (N ode.f lag | CMtag )2 ; 13 Children ← N odetmp ; 14 Pcandidate ← N odetmp .head; 15 16 17

for node ∈ Children do DF S(node); P ← Pcandidate ;

they violate the restrict condition on domain “google”. The judging process is corresponding to line 8-9 in Algorithm 2. If we get vector full of binary zero, we say two tags are compatible. Then we operate bitwise-OR to aggregate two vectors to gain more restricted vector and wait for another tag to be added, which is shown in line 12. We assign the vector to the flag of descendant node whose tags comes from his father’s head and one tag from father’s tail. Of course, minimum support should be taken into consideration at the same time when doing the judge. Next, we add descendants satisfying those two rules into Pcandidate the same as in Algorithm 1. There is no need for us to do extra post process to purify the final result Pcandidate , that is P = Pcandidate . 4.3 Discussion Bitmap Representation. Bitmap representation is chosen for the transaction data. In doing this way, each transaction corresponds to one bit. If item i appears in transaction T, the T th position of bitmap for item i is set to binary one, otherwise, the position is set to zero. This character is naturally fit for the print log data. Here we use bitmap representation twice in our method. First, tag transaction data is transformed to bitmap, thus we can calculate the support easily. Supposing that we get two tags t1 and t2 , each with a bitmap vector, denoted as bitmap(t1) and bitmap(t2). So we get bitmap(t1 t2 ) = bitmap(t1) & bitmap(t2 ), where & is a bitwise-AND. The same thing happened when an itemset (pattern) meets a tag, which is common when we add an element in a node’s tail to its head. Thus bitmap(head tag) = bitmap(head) & bitmap(tag). In the following, we only need to calculate the cardinality of head tag,

Mining Frequent Patterns in Print Logs with Semantically Alternative Labels

115

where cardinality is number of 1’s in a bitmap vector. Second, the restrict condition RC is also shown in bitmap representation, i.e., conflict matrix CM . By doing this way, it’s much convenient for us to judge whether it violates the previously stipulated condition when adding a element (tag) from a node’s tail to its head. Let bitmap of a tag in conflict matrix be bitmap(CMtag ) and a bitmap of a node’s flag be bitmap(node.f lag), then do bitwise-AND between bitmap(CMtag ) and bitmap(node.f lag). If the cardinality of the bitmap result equals to zero, there is no conflict, otherwise, they must conflict at some transaction (domain). If no conflict happens, we do bitwise-OR between bitmap(CMtag ) and bitmap(node.f lag) and then assign the bitmap value to node’s descendant. Pruning Lattice. In Figure 1 and Figure 2, an example of lattice is shown under lexicographic order. Each node with a bubble in right top corner denotes or an itemset. The number in the bubble is the frequency of the node according to Table 3(c). With the basic method, we search 12 nodes in total while the number of searched nodes decrease to 8 by using our method. By comparing the two figures, we can easily draw a conclusion that our method reduces the search space on lattice. This works well because we introduce another constraint during the traversal.

{}

{}

{b}

{a}

{ a,b }

{ a,b,c }

{ a,b,c,d }

{ a,c }

{ a,b,d }

{ a,d }

{ a,c,d }

{c}

{ b,c }

{a}

{d}

{ b,d }

{ b,c,d }

Item order: a < b < c < d

Fig. 1. Post Process on Lattice

{ a,b }

{ c,d }

{ a,b,c }

{ a,c }

{ a,b,d }

{b}

{ a,d }

{ a,c,d }

{c}

{d}

{ b,c }

{ b,d }

{ c,d }

{ b,c,d }

Restrict condition:{a,c} {a,d} Item order: a < b < c < d

{ a,b,c,d }

Fig. 2. In Process on Lattice

5 Experiment Evaluation In this section we evaluate both the effectiveness and efficiency of our method. Our experiment was conducted on a real-world print log dataset. The characteristics of the data set is summarized in Table 4. All experiments were performed on a personal computer with a AMD Athlon X2 240 2.81GHz CPU and 4G of memory running the windows 7 operating system. Table 4. Characteristics of the Print Log Database Name Number of Records Number of Transactions URLs 107031 23212 Domains 59416 23211 Tags 328752 16041

116

X. Li et al.

The three transaction database are generated from the same print log data. Database Domains and Tags are originated from database URLs. By removing the specific information in a url after slash, we get a domain. For instance, given a user printed url in Table 1, we get domain in Table 2 by using the method just mentioned. Then we use delicious.com api returned tag sets to expand each domain’s semantic information, which is well shown in Table 2 and discussed in Section 1. According to our experiment, delicious.com tags cover more than 60% of the database Domains. In order to get fair treatment, we test our method on the database with the tag representations and ignore the rest part, thus the transaction number of the three database are equal. 5.1 Evaluation on Effectiveness In order to prove the effect of our method in solving the problem of data sparsity and pattern interpretability, experiment can be divided into two parts: (1) the number of patterns by expanding the semantic information and (2) verifying pattern interpretability through case study. The experiment was conducted on three database of the same transaction number. 5

10

URLs Domains Tags

4

number of patterns

10

3

10

2

10

1

10

0

10

1

2

4

3

support

6

5

−3

x 10

Fig. 3. Patterns mined from Three Database

Number of Patterns. In this subsection, we compare the number of frequent patterns on the three datasets with the different minimal frequent threshold min sup. Any of the algorithms for mining frequent patterns can be chosen here and we implement a simple frequent pattern mining algorithm in [3]. Figure 3 shows the number of patterns mined from the three database in relevance to the minimum support thresholds. Value on y-axis is taken the logarithm. We set the threshold from 0.1% up to 0.6%. The result clearly show that after expanding the semantic information, we gain more patterns than the baselines. Case Study. The frequent patterns mined on URLs dataset are all frequent 1-itemset, like ‘{maps.google.com/maps?hl = en&tab = wl} due to the data sparsity. When applying on database Domains, we get several patterns with frequency equal to 2, such as {maps.google.com, www.mapquest.com}. From the domain pattern, we can roughly infer that user may want some map service. For database Tags, we frequent pattern {maps, shopping, travel}. Now, besides knowing that people wants map service, we can tentatively say that the user is preparing for a tour since shopping exists here in

Mining Frequent Patterns in Print Logs with Semantically Alternative Labels

117

Table 5. Patterns Mined From Three Transaction Database Database Patterns http://www.facebook.com/ URLs http://www.foodnetwork.com/food/cda/recipe-print/ https://www.shopping.hp.com/webapp/shopping/ en.wikipedia.org, www.ehow.com Domains www.amazon.com, www.ehow.com maps.google.com, www.bing.com food, health, shopping Tags facebook, google, networking, social google, maps, travel 5

2.5

6000

x 10

PaSAL BASIC

PaSAL BASIC 2

4000

runtime(ms)

visiting nodes

5000

3000

2000

1

0.5

1000

0 2

1.5

2.5

3

3.5

4

4.5

5

support

Fig. 4. Visiting Nodes

6

5.5

−3

x 10

0 1

2

4

3

6

5

support

−3

x 10

Fig. 5. Running Time

the set. By using tags to expand url’s semantic information, we can not only gain more information explainable but also to understand a user’s interest at fine-grained. More cases can be found in Table 5. 5.2 Evaluation on Efficiency Since we use delicious tags as representations, we need to purify our mined result because the replacement is under the assumption that tags in a delicious returned set referred to a domain are semantically alternative or relevant to each other. In order to evaluate the efficiency of our proposed algorithm PaSAL, we choose the Brute-Force method as the baseline. This experiment was conducted only on the database Tags. Here we choose the running time and pruning effect as the measures. Figure 5 shows that our algorithm run faster over 3 times than baseline when min sup=0.0001. In order to show the effect of pruning, we also show the number of visiting nodes in Figure 4. We can see that by embedding conflict matrix constrait in the traversal we do more pruning on the lattice, thus leading to high efficiency. The running time of two methods varies greatly when the minimum support is low. The reason is that there exists huge search space when setting min sup low and this is what our algorithm good at. PaSAL can perform better pruning effect when meeting with large lattice, otherwise, the baseline running time is approaching our algorithm when min sup increases. This is because that lots of time is spent on retrieving and calculating the bitmap structure in PaSAL.

118

X. Li et al.

6 Conclusion In this paper, we defined a novel pattern mining problem, namely mining frequent patterns with semantically alternative labels. This problem is motivated by enriching the semantics of each url in print log data to solve the problem of data sparsity and pattern interpretability. Specifically, we attempt to utilize the semantically alternative tags returned from delicious.com api as the representation of each url. Then, we propose an efficient algorithm named PaSAL for this novel problem. Specifically, we propose a conflict matrix constraint to purify the redundant patterns and this constraint is then deeply exploited in the mining process to achieve high efficieny. Finally, we show the effectiveness and efficiency of the proposed algorithm on a real print log data. It is worth mentioning that the proposed algorithm PaSAL can be applied to many fields, such as super market cross-selling checking. Acknowledgements. The authors Xin Li, Lei Zhang and Enhong Chen were supported by grants from Natural Science Foundation of China (Grant No. 61073110), Research Fund for the Doctoral Program of Higher Education of China (Grant No. 20113402110024), and National Key Technology Research and Development Program of the Ministry of Science and Technology of China (Grant No. 2012BAH17B03). The author Yu Zong are supported by grants from the Nature Science Foundation of Anhui Education Department of China under Grant No. KJ2012A273 and KJ 2012A274; and the Nature Science Research of Anhui under Grant No. 1208085MF95. Enhong Chen gratefully acknowledges the support of Huawei Technologies Co., Ltd. (Grant No. YBCB2012086)

References 1. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993) 2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB (1994) 3. Burdick, D., Calimlim, M., Gehrke, J.: Mafia: a maximal frequent itemset algorithm for transactional databases. In: International Conference of Data Engineering (2001) 4. Han, J., Fu, Y.: Discovery of multiple-level association rules from large databases. In: Proceedings of the 21th International Conference on Very Large Data Bases (1995) 5. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD (2000) 6. Mei, Q., Xin, D., Cheng, H., Han, J., Zhai, C.: Semantic annotation of frequent patterns. ACM Transactions on Knowledge Discovery from Data, TKDD (2007) 7. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 398–416. Springer, Heidelberg (1998) 8. Pei, J., Han, J.: Can we push more constraints into frequent pattern mining? In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2000) 9. Pei, J., Han, J.: Constrained frequent pattern mining: a pattern-growth view. ACM SIGKDD Explorations Newsletter, 31–39 (2002)

Mining Frequent Patterns in Print Logs with Semantically Alternative Labels

119

10. Raedt, L.D., Zimmermann, A.: Constraint-based pattern set mining. In: SIAM International Conference on Data Mining (2007) 11. Srikant, R., Agrawal, R.: Mining generalized association rules. In: Proceedings of the 21st International Conference on Very Large Data Bases (1995) 12. Tang, L., Zhang, L., Luo, P., Wang, M.: Incorporating occupancy into frequent pattern mining for high quality pattern recommendation. In: CIKM (2012) 13. Wang, X., Zhai, C.: Learn from web search logs to organize search results. In: SIGIR, pp. 87–94 (2007) 14. Zaki, M.: Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering 12, 372–390 (2000)