An Incremental Algorithm for Mining Privacy-Preserving Frequent Itemsets

An Incremental Algorithm for Mining Privacy-Preserving Frequent Itemsets Jinlong Wang, Congfu Xu∗, Yunhe Pan Institute of Artificial Intelligence, Zhe...
2 downloads 2 Views 153KB Size
An Incremental Algorithm for Mining Privacy-Preserving Frequent Itemsets Jinlong Wang, Congfu Xu∗, Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027, China [email protected] [email protected] [email protected]

Abstract Privacy preserving data mining is a novel research direction in data mining and statistical databases, where data mining algorithms are analyzed for the side-effects they incur in data privacy. There have been many studies on efficient discovery of frequent itemsets in privacy preserving data mining. However, it is nontrivial to maintain such discovered frequent itemsets because a database may allow frequent itemsets updates and such frequent itemsets may be turned into infrequent itemsets. In this paper, an incremental updating algorithm IPPFIM is proposed for efficient maintenance of discovered frequent itemsets when new transaction data are added to a transaction database in privacy preserving. The algorithm makes use of previous mining results to cut down the cost of finding new frequent itemsets in an updated database, the performance evaluation shows the efficiency of this method. Keyword: Data mining, privacy-preserving, incremental

1

Introduction

With the development of computer hardware and software and the rapid computerization of business, the amount of data available for analysis has grown exponentially in many areas such as retail industry, financial forecast, decision support and intrusion detection. When the scale of data manipulation, exploration and inferencing went beyond human capacities, a new technique named data mining emerged. The term data mining refers to the nontrivial extraction of valid, implicit, potentially useful and ultimately understandable information in large databases ∗ Correspondence

author: Congfu Xu

with the help of the ubiquitous modern computing devices [1, 2]. During the past decade, many successful applications in data mining have been reported from varied sectors such as marketing, finance, banking, manufacturing and telecommunication. As a valuable technique, data mining is developing flourishly, meanwhile, there arise serious concerns over individual privacy in data collection, processing and mining [3], as a result preserving privacy appears as a prime concern in the field of data mining. The conventional wisdom held that data mining, with its promise to efficiently discover valuable, non-obvious information from large databases, is particularly vulnerable to misuse [4]. [5] predicted the making of a conflict between data mining and privacy. The reality is that data mining results rally violate privacy. The objective of data mining is to generalize across populations, rather than reveal information about individuals. The hitch is that data mining works by evaluating individual data that is subject to privacy concerns. So, the true problem is not data mining, but the way data mining is done. [6] and [7] proposed the concept of privacy preserving data mining (PPDM) aimed at alleviating the conflict between data mining and privacy. As a novel research direction in data mining and statistical databases [8], PPDM has begun to receive attentions and be investigated by many researchers [6, 7, 9, 10, 11, 12, 13, 14, 15]. PPDM is defined as “getting valid data mining results without learning the underlying data values” [9]. PPDM encompasses the dual goal of meeting privacy requirements and providing valid data mining results [10]. As a first introduction of PPDM, the idea of [6] is to perturb individual data values. By the perturbation, the original data is hidden and only the randomized values are revealed, but the statistical characteristics have been kept in order that accurate models without access to the precise information in individual data records can be developed. An alternative approach [7] in PPDM is to build a data mining model from local data sets of various participating sites without revealing individual records of one site to other participating sites. The

method is usually based on SMC (Secure Multiparty Computation) [16]. Although the security definition of the randomization model is much weaker than the one in the SMC model, the randomization model aims to protect the (exact) actual data value, and it can gain the higher efficiency than the SMC model (When the number of participants becomes large, the performance will not be desirable.), because of this, randomization method is currently greatly applied to privacy preserving data mining [12, 13, 14, 15]. Randomization methods address the issue of privacy preserving by perturbing the data and reconstructing the distributions at an aggregate level in order to perform mining. All these papers [12, 13, 15] consider randomization techniques in privacy preserving frequent itemset mining. These algorithms try to extract the data itemsets without directly accessing the original data and attempt to guarantee that the mining process does not get sufficient information to reconstruct the original data. However, in reality, data changes from time to time. The itemset mined can present some development trends, when the database dynamically incrementing, some new frequent itemsets can appear, and some old frequent itemsets can disappear, so the incremental mining is paramount important. When database changes, mining the updating database again will not meet the requirements of the real-time response, so, the efficient updating algorithm must be devised to update, maintenance and management the mined knowledge. In this paper, an efficient updating technique will be applied to privacy preserving frequent itemset mining, and an incremental algorithm, called IPPFIM (Incremental Privacy-Preserving Frequent Itemset Mining) will be proposed to improve the efficiency in database incremental updating. The remainder of this paper is organized as follows. The related concepts are described in Section 2. In Section 3, an efficient incremental privacy preserving frequent itemset mining algorithm IPPFIM is presented. The performance study of IPPFIM is reported in Section 4, which shows the efficiency of this method. Finally, Section 5 concludes this paper.

2 2.1

Preliminaries Frequent Itemset Mining

As a key stage in many data mining applications, including the discovery of association rules, strong rules, correlations, sequential rules, episodes, multidimensional itemsets, and many other important discovery tasks, frequent itemset (itemset) mining problem has received a great deal of attentions since its introduction in 1993 by Agrawal et al. [17]. Let I = {i1 , i2 , . . . , im } be a set of distinct literals, usually called items. Let D, the task-relevant data, be a set of

database transactions where each transaction has a unique identifier, called T ID, and contains a set of items. For a set of items A ⊆ I, a transaction T is said to contain A if A ⊆ T . A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. Definition 1 Suppose A ⊆ I is an itemset, then the frequent itemset in a given database D with respect to a frequency threshold min sup is F (D, min sup) = {A ⊆ I|sup(A, D) ≥ min sup}, where, sup(A, D), the support of an itemset A, is the relative frequency of an item set A in transaction databases D, and min sup is the least support defined by users, min sup ∈ (0, 1). The problem of mining frequent itemset is to mine all itemsets whose support is no less than s, a user-specified minimum support threshold. It usually makes use of the download closure of frequent itemsets: all subsets of a frequent itemset are frequent and that all supersets of an infrequent itemset are infrequent. These two properties are usually to be applied to prune elements of the itemset lattice.

2.2

Privacy Preserving Frequent Itemset Mining

In [13], Rizvi and Srikant presented a scheme called MASK (Mining Associations with Secrecy Konstraints), based on a simple probabilistic distortion of user data, employing random numbers generated from a pre-defined distribution function. In order to address the runtime efficiency issue in MASK, [15] achieved the improvement through changes in both the distortion process and the mining process of MASK, presenting a new algorithm EMASK (Efficient MASK). By generalizing the distortion process to perform symbol-specific distortion and appropriately choosing the distortion parameters and applying a variety of optimizations of set theory in the reconstruction process, runtime efficiencies are well achieved. By the virtue of randomization to distort the original database, frequent itemset can be mined in privacy preserving, the definition is as the following. Definition 2 Let D be a transactional database, D∗ be a distorted database from D in order to preserve individual privacy in D. It is this distorted database D∗ that is eventually supplied to the data miner, along with a description of the distortion procedure. The data miner mines the distorted database D∗ to estimate the frequent itemsets with support count satisfying the minimal support in the original database D, by virtue of the distribution procedure. Figure.1 illustrates the process.

Table 1. Incremental Mining Relationship Table

Figure 1. Privacy preserving frequent itemset mining process.

X ∈ F pd Yes Yes No No

X ∈ Fp Yes No Yes No

X ∈ F p′ Yes Visiting D to computing Visiting d to computing No

Property 3 For c ∈ F p, when c ∈ F pd , c ∈ F p′ .

3

Efficient Incremental Privacy Preserving Frequent Itemset Mining

Property 4 For c ∈ F p, when c 6∈ F pd , c ∈ F p′ ⇔ sup(c, D ∪ d) ≥ s.

In reality, data changes from time to time in many areas, including the retail industry and the finance sector. The itemset mined in these wide applications can present some development trends. When the transaction database changes with time, dynamically increments, some new frequent itemsets can appear, and some old frequent itemsets can disappear, which induces the incremental mining, a paramount important mining method. When increasing, mining the updating database, composed with the original and update database again, will not meet the requirements of the real-time response. In these dynamic databases, the knowledge acquired can facilitate successive discovery processes, and the efficient updating algorithm must be devised to updating, maintenance and management such knowledge,

2. For infrequent itemsets in D, computing the support of the frequent itemsets in F pd visiting D∗ .

Definition 3 Let D be the original database and s be the minimum support threshold, and d denote the incremental database where new transactions or new customers are added to D. D∗ and d∗ denote the database after distorted. F p expresses the frequent itemsets in the original database D, F pk expresses k-frequent itemsets, F p′ is the set of frequent itemsets in D ∪ d, F p′k expresses k-frequent itemsets in D ∪ d. Property 1 Suppose c ∈ F pk . c 6∈ F p′k ⇔ sup(c, D∪d) < s. Property 2 Suppose c 6∈ F pk . c ∈ F p′k ⇔ sup(c, D ∪d) ≥ s. In Table.1, we summarize the relationship of the frequent itemset in an incremental update environment (Let F pd denote the frequent itemset in incremental database d.). From Table.1, we can find the two key problems in incremental mining: 1. For frequent itemsets F p in D, find the not or still available frequent itemsets.

Property 5 For c 6∈ F p, when c 6∈ F pd , c 6∈ F p′ . Property 6 For c 6∈ F p, when c ∈ F pd , c ∈ F p′ ⇔ sup(c, D ∪ d) ≥ s. In this paper, an efficient incremental privacy preserving frequent itemset mining algorithm is proposed to address the updating problem in PPDM, when new transactions appended. In the algorithm, we make use of the distortion technique mentioned in [15], and some optimizations in [15]. Because the reconstruction procedure is cost, a k-itemset may be distorted to produce any of 2k combinations, in order to accurately reconstruct the support of the k-itemset, we need the counts of all these 2k combinations in the distorted database. Through the basic formula from set theory, the support of all combinations of the itemset can be computed efficiently [15]. In mining, for the frequent itemset, the support can be obtained in the file F p, but for the infrequent itemset not in the file F p, a new scan must be done to recomputed the support, decreasing the efficiency. In order to overcome this, a minor modification is proposed, when mining, the support of the frequent itemsets in the distorted database is also registered along with registering the support in the original database, conveniencing the incremental mining, improving the efficiency. For example, itemsets {A} and {B} are frequent, but {AB} is infrequent, in incremental mining, if we need compute the support of {AB} in the original database D, when the support counts of {A} and {B} in the distorted database D∗ are registered, the support of {AB} can be computed quickly through the basic formula from set theory [15] without scanning the database D∗ . In the following, an efficient incremental privacy preserving frequent itemset mining IPPFIM is shown

2. When sup(X, d) < s, itemset X is infrequent in incremental database d.

in Algorithm.1, the input F pD∗ denotes the corresponding frequent itemsets F p and their support in the distorted database D∗ . Algorithm 1: An efficient incremental privacy preserving frequent itemset mining algorithm IPPFIM. The objective of this algorithm is to process privacy preserving frequent itemset mining in an incremental updating environment. Input: D∗ , d∗ , F p, F pD∗ , minimum support s, distorted parameter p, q. Output: F p′ (Frequent itemset and the support in D ∪ d) Method: As Fig.2

(a) Making use of the Property 5, X is a infrequent itemset in the whole database file if X 6∈ F p. (b) Step 3 - 5 make use of Property 4, when sup(X, D ∪ d) ≥ s, put the itemset X into F p′ .

4

Experimental Results

In order to assess the performance of IPPFIM, experiments are conducted to compare its performance with that of EMASK [15], an efficient privacy preserving frequent itemset mining algorithm. Both algorithms are implemented using Cygwin with gcc 2.9.5. Our target platform is a Pentium4 1.6GHz processor, with 384MB memory, using a Maxtor IDE disk (7200rpms, 80GB). The operating system is Windows Xp with service pack 2. Benchmark data sets. The performance tests are performed on synthetic database benchmark, publicly available from IBM synthetic market-basket data generator [18]. The data sets, using the IBM data generator, mimic the transactions in a retailing environment. In the data sets, the meanings of the parameters are shown in Table.2. In the following experiments, we use the data set T25.I4.D1M.N1K as the original database.

D T I N

Figure 2. An efficient incremental privacy preserving frequent itemset mining.

When computing 1-itemset, itemset X is brought through scanning d∗ . When computing k-itemset (k > 1), X is brought through Ck+1 , generated by F p′k . In incremental mining, we use the aforesaid Property 3-6. 1. When sup(X, d) ≥ s, itemset X is frequent in the incremental database d. (a) Step 1 and 2 make use of the Property 3, put the frequent itemset X into F p′ . (b) Step 6 - 10 make use of the Property 6, computing the support of X in the original database, if the condition is satisfied, put it into F p′

Table 2. Parameters meanings Number of transactions Average size of the transactions Average size of the maximal potentially large itemsets Number of items

Performance Testing. We compare the performance of IPPFIM against EMASK for different data sets as Fig.3 and Fig.4. In the experiments, every databases are distorted with parameter p=0.5, q=0.97 as EMASK [15]. The first experiment is done on the original database T25.I4.D1M.N1K with an update database T25.I4.D25K.N1K. The execution time and performance ratio (execution time ratio) with various minimum support thresholds are evluated as Fig.3. Fig.3(a) shows that the execution time of IPPFIM is much less than the time of EMASK. When the setting on minimum support is decreased, the execution time of both algorithms increases, moreover, the increasing rates of both algorithms are dissimilar. As Fig.3(b), for the same data sets, the less minimum support is, the more the execution time ratio (EMASK/IPPFIM) is (The fluctuate in the Fig.3(b) is relate to the synthetic database, Fig.4(b) is the same.). For small support, IPPFIM is 10 to 20 times faster than EMASK. For larger support, it is less costly to re-run the mining

algorithm on the updated database since the number of large itemsets is relatively smaller. For example, when he support increases from 0.75% to 1.0%, the execution time ration decreases from 10.24 to 3. In general, the larger the increment is, the longer it would take to do the update operations. Also, the gain in speedup would slow down. Fig.4 shows the execution time and ratio of the two methods on the data set T25.I4.D1M.N1K with updates of 5K, 10K ,25K, 50K, 100k and 250K for the same minimal support. Just as Fig.4, for the same support, the efficiency is improved. When update size increases, the speed-up ratio decreases. For example, when the incremental databases increase from 50K to 250K, the execution time ratio decreases from 15.98 to 4.74. As the size increases in the update, the performance ratio will decrease, because IPPFIM adopts the incremental updating technique, it only need to compute a fraction of the candidate itemsets generated newly, which saves the time of scanning the original database compared to re-run the EMASK on the update database, eventually the performance of IPPFIM will not be under that of EMASK.

5

(a) Execution Time

Conclusion

In this paper, we have presented an efficient, incremental update algorithm IPPFIM for the maintenance of the frequent itemsets discovered by data mining in privacy preserving when a set of new transactions are added to the transaction database. Our algorithm strives to reduce the I/O requirements for updating the frequent itemsets. IPPFIM uses the information available from a previous mining to reduce the amount of work that has to be done to remove itemsets that no longer exist in the updated database, and to add new itemsets which were not in the set of old transactions but now exist in the updated database. The experiments on data sets show that IPPFIM achieves a better performance than re-running EMASK (an efficient privacy preserving frequent itemset mining algorithm) over the whole set of transactions. To extend IPPFIM algorithm by applying the strategies for mining evolving data streams with privacy preservation is under our current study.

Acknowledgements This paper was supported by the Natural Science Foundation of China (No. 60402010), Zhejiang Provincial Natural Science Foundation of China (Y105250) and the Science-Technology Progrom of Zhejiang Province of China (No. 2004C31098).

(b) Execution Time Ratio

Figure 3. Performance Comparison with Different Minimum Support

References [1] U. M. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Konwledge Discovery and Data Mining. AAAI/MIT Press, 1996. [2] J. Han, M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, August 2000. [3] The Economist. The end of Privacy. May 1st, 1999. pp. 15. [4] C. Clifton and D. Marks. Security and privacy implications of data mining. ACM SIGMOD WOrkshop on Research Issues on Data Mining and Knowledge Discovery, 1996. pp. 15-19. [5] K. Thearling. Data mining and privacy: a conflict in making. DS, November 1998.

[11] D. Agrawal and C. Aggarwal. On the Design and Quantification of Privacy Preserving Data Mining Algorithms. PODS 2001. pp. 247-255. [12] A. Evfimievski, R. Srikant, R. Agrawal and J. Gehrke. Privacy preserving mining of association rules. SIGKDD 2002. pp. 217-228. [13] S. Rizvi and J. Haritsa. Maintaining data privacy in association rule mining. VLDB 2002. pp. 682-693. [14] W. Du and Z. Zhan. Using Randomized Response Techniques for Privacy-Preserving Data Mining. SIGKDD 2003. pp. 505-510. (a) Execution Time

[15] S. Agrawal and J. Haritsa. On addressing efficiency concerns in privacy-preserving mining. DASFAA 2004. pp. 113-124. [16] A. C. Yao. Protocols for secure computations (Extended Abstract). FOCS 1982. pp. 160-164. [17] R. Agrawal, T. Imielinski and A. N. Swami. Mining association rules between sets of items in large databases. SIGMOD 1993. pp. 207-216. [18] R.Agrawal and R.Srikant. Fast Algorithms for Mining Association Rules. VLDB 1994. pp. 487-499.

(b) Execution Time Ratio

Figure 4. Performance Comparison with Different Incremental Database

[6] R. Agrawal and R. Srikant. Privacy-preserving data mining. SIGMOD 2000. pp. 439-450. [7] Y. Lindell and B. Pinkas. Privacy preserving data mining. Crypto 2000. pp. 36-54. [8] N. R. Adam and J. C. Wortmann. Security Control Methods for Statistical Databases: A Comparison Study. ACM Comput. Surv. 21(4), 1989. pp. 515-556. [9] C. Clifton, M. Kantarcioglu, and J. Vaidya. Defining Privacy For Data Mining. Proc. of the National Science Foundation Workshop on Next Generation Data Mining, 2002. pp. 126-133. [10] S. R. M. Oliveira and Osmar R. Zaiane. Toward Standardization in Privacy-Preserving Data Mining. DMSSP 2004 (In conjunction with SIGKDD 2004).

Suggest Documents