Intelligent Technologies for Database Systems: Data Mining. Tehnologii inteligente pentru sistemele de baze de date: Data Mining

FACULTATEA DE MATEMATICĂ UNIVERSITATEA DE VEST ŞI INFORMATICĂ DIN TIMI ŞOARA Intelligent Technologies for Database Systems: Data Mining Tehnologii...
Author: Claud Simpson
3 downloads 0 Views 239KB Size
FACULTATEA DE MATEMATICĂ

UNIVERSITATEA DE VEST

ŞI INFORMATICĂ

DIN TIMI ŞOARA

Intelligent Technologies for Database Systems: Data Mining Tehnologii inteligente pentru sistemele de baze de date: Data Mining - REFERAT -

DOCTORAND As. Daniel POP

CONDUC ĂTOR ŞTIINŢIFIC Prof. Dr. Ştefan MĂRUŞTER

CONTENTS

Abstract

3

1. Introduction

3

2. An Overview of KDD phases and DM techniques

4

3. Parallel and Distributed Data Mining

6

4. Association Rule Discovery (ARD)

7

4.1 Theoretical Foundation of ARD and Algorithm Complexity

7

4.2 Sequential Algorithms for Association Rule Discovery

9

4.3 Parallel Approaches of ARD Algorithms 4.3.1 Distributed Memory Systems 4.3.2 Shared Memory Machines 4.3.3 Hierarchical Systems 4.4 Summary of Parallel Algorithms 5. Classification

13 15 18 19 20 21

5.1 Theoretical Foundation of Decision Trees

22

5.2 Sequential Decision-Tree Classifiers

24

5.3 Parallel Decision-Tree Classifiers 5.3.1 Distributed Memory Machines 5.3.2 Shared Memory Machines

25 25 27

6. Open Problems in Parallel and Distributed Data Mining Systems

28

7. Conclusions

29

Abstract This paper presents a survey of the state-of-the-art in parallel and distributed data mining algorithms. This is direly needed given the importance of data mining, and given the tremendous amount of research it has attracted in recent years. The report provides a taxonomy of the extant data mining methods, characterizing them according to the type of knowledge they mine: association rules, classifications etc. The survey clearly lists the design space of the parallel and distributed algorithms based on the platform used (distributed or shared-memory), kind of parallelism exploited (task or data), and the load balancing strategy used (static or dynamic). A large number of parallel and distributed data mining methods are reviewed and grouped into related techniques. It is shown that there are a few dominant paradigms, while the other techniques propose optimizations over these base schemes. The main goal of this survey is to serve as a reference for both researchers and practitioners interested in the state-of-the-art in parallel and distributed data mining methods.

1. Introduction The amount of data being collected in databases today far exceeds our ability to reduce and analyze data without the use of automated analysis techniques. Many scientific and transactional business databases grow at a phenomenal rate. For example, departmental stores stock thousands of items, and collect millions of customer transactions everyday. Also, the NASA Earth Observing System (EOS) delivers close to a terabyte of remote sensing data per day, and NASA will have petabytes of archived data in the next few years. Similar scenarios will occur in other areas: radiological images generated in hospitals, bioinformatics, medical informatics, scientific data analysis, financial analysis, consumer profiling, credit card fraund etc. Exploring useful information from such data will require efficient parallel algorithms running on highperformance computing systems with powerful parallel I/O capabilities. Without such systems and techniques, invaluable data and information will remain undiscovered. We also need to quantify the time it takes to retrieve pertinent information for time-critical applications. Knowledge discovery in databases (KDD) is the field that is evolving to provide automated analysis solutions. Knowledge discovery in Databases is defined as ``the non-trivial extraction of implicit, unknown, and potentially useful information from data'' [20]. Data mining (DM), the process of extracting trends or patterns from data, is the heart of KDD process [19]. Not always a clear distinction between data mining and knowledge discovery is drawn. The knowledge discovery process takes the raw results from data mining and transforms them into useful and understandable information. This information is not typically retrievable by standard techniques but is uncovered through the use of AI techniques. In this report, we present a survey of the different parallel and distributed DM algorithms that have been proposed on various hardware platforms. Given the astonishing amount of research in this area, it is necessary to present to interested parties the state-of-the-art in DM, and to identify the current open problems. Due to the large number of parallel and distributed approaches that exist in DM, we are forced to focus only on the most significant works. The next section overviews the Knowledge Discovery in Databases phases and outlines the differences between KDD and data mining. Further, the main DM tasks are presented in short. Section 3 introduces available parallel architectures. The next section presents in detail both sequential and parallel and distributed algorithms for mining association rules. After the big picture for association rule discovery is given, some theoretical background is overviewed. Next two subsections present sequential, respectively parallel algorithms for association rule discovery. Section 5 presents in detail both sequential and parallel

and distributed algorithms for decision tree construction. After the classification paradigm is presented, some theoretical background of decision trees is overviewed. Next two subsections present sequential, respectively parallel algorithms for decision tree construction. Open problems of parallel approaches are highlighted in section 6. Last section of this report concludes the data mining story.

2. An Overview of KDD phases and DM techniques The CRISP-DM (Cross-Industry Standard Process Model for Data Mining) project [15], developed by four big companies (NCR, DaimlerChrysler, Integral Solutions Limited (ISL) (SPSS since December, 1998) and OHRA, a Netherlands' insurance company), identifies the following phases of a KDD process: 1. Business Understanding: focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. 2. Data Understanding: starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information. 3. Data Preparation: covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data 4. Modeling: various modeling techniques are selected and applied, and their parameters are calibrated to optimal values; this is the phase where various data mining techniques are applied 5. Evaluation: thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. 6. Deployment: depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process.

Figure 1. Phases of the CRISP-DM Data mining techniques are represented by machine learning algorithms adapted for very large data sets [19]. Traditionally machine learning techniques may be supervised or unsupervised. According to [6], learning algorithms are complex and generally considered the hardest part of any data mining technique. There are quantitative approaches, such as the probabilistic and statistical approaches. There are approaches that utilize visualization techniques. There are classification approaches such as Bayesian classification, inductive logic, data cleaning/pattern discovery, and decision tree analysis. Other approaches include

deviation and trend analysis, genetic algorithms, neural networks, and hybrid approaches that combine two or more techniques. In this paper, we categorize data mining techniques based on the type of knowledge to be mined: association rule mining, classification, clustering, sequential pattern mining, OLAP and Web mining. The task of association rule mining is to find certain association relationships among a set of items in a database. Its original application is on “market basket data”. The rule has the form X => Y, where X and Y are sets of items and they do not intersect. Each rule has two measurements, support and confidence. Given the user-specified minimum support and minimum confidence, the task is to find rules with support and confidence above the minimum support and minimum confidence. Pattern discovery and data cleaning systematically reduces a large database to a few pertinent and informative records [27]. If redundant and uninteresting data is eliminated, the task of discovering patterns in the data is simplified. The pattern discovery and data cleaning techniques are useful for reducing enormous volumes of application data, such as those encountered when analyzing automated sensor recordings. Classification is probably the oldest and most widely-used of all the KDD approaches [49]. This approach groups data according to similarities or classes. There are many types of classification techniques and numerous automated tools available. In the classification task, each tuple belongs to a class, among a pre-defined set of classes. The class of a tuple is indicated by the value of a user-specified goal attribute. Tuples consists of a set of predicating attributes and a goal attribute. The task is to discover some kind of relationship between the predicating attributes and the goal attribute, so that the discovered knowledge can be used to predict the class of new tuples. The classification on continuous data is also called regression. One of the eldest approaches is the decision tree approach, which uses production rules, builds a directed acyclical graph based on data premises, and classifies data according to its attributes. This method requires that data classes are discrete and predefined [49]. According to [19], the primary use of this approach is for predictive models that may be appropriate for either classification or regression techniques. Another type of classification may be considered the Bayesian approach although it uses probabilities and a graphical means of representation. Bayesian networks are typically used when the uncertainty associated with an outcome can be expressed in terms of a probability. This approach relies on encoded domain knowledge and has been used for diagnostic systems. Other pattern recognition applications, including the Hidden Markov Model, can be modeled using a Bayesian approach [3]. The task of clustering is to group the tuples with similar attribute values into the same class. The principle is to maximize the intra-class similarity and minimize the inter-class similarity. In clustering task, there is no goal attribute. So classification is supervised with the goal attribute, while clustering is an unsupervised classification. Neural networks are most often used in clustering. The particular case of SelfOrganizing Maps (SOM), which is a neural network model that is capable of projecting high dimensional input data onto a low-dimension arrays, have been successfully applied in clustering. Genetic algorithms, also used for classification, are similar to neural networks although they are typically considered more powerful. Sequential pattern mining is the process to analyze a collection of data to identify trend among them. In sequential pattern mining, the input data is a set of sequences, called data-sequences. Each datasequence is an ordered list of transactions. A sequential pattern consists of list of sets of items. The problem is to find all sequential patterns with a user-specified minimum support. Deviation and trend analysis techniques are normally applied to temporal databases. A good application for this type of KDD is the analysis of traffic on large telecommunications networks. OLAP (On-Line Analytical Processing) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data. OLAP functionality is characterized by dynamic multi-dimensional analysis of data. Statistical approaches are used for rule discovery and are based on data relationships. An ``inductive learning algorithm can automatically select useful join paths and attributes to construct rules from a database with many relations'' [31]

Web mining is a new area of data mining. Since web is one of the biggest repositories of data, analyzing and exploring regularities using data mining in web user behavior can improve system performance and enhance the quality and delivery of Internet information services to the end user. Forrester Research Inc. says that 16 percent of large companies expected a more effective use of customer information to help them cut costs in 1999. They also predict that an additional 34 percent are banking on savings by 2001 [16]. A hybrid approach to KDD combines more than one approach and is also called a multi-paradigmatic approach. Although implementation may be more difficult, hybrid tools are able to combine the strengths of various approaches. Some of the commonly used methods combine visualization techniques, induction, neural networks, and rule-based systems to achieve the desired knowledge discovery. Deductive databases and genetic algorithms have also been used in hybrid approaches. The need for automated discovery tools had caused an explosion in the number and type of tools available commercially and in the public domain. The Kdnuggets™ web site [45] is updated frequently and is intended to be an exhaustive listing of currently available data mining and web mining tools and resources. Automated tools using these techniques are available both commercially and in the public domain. Available tools for data mining can be classified according to various criteria, such as: tasks and methods used, supported data sources, architectural model (client/server, standalone, parallel processing), operating system or product and legal status [24].

3. Parallel and Distributed Data Mining High-performance data mining refers to developing algorithms for data-mining techniques that lend themselves to parallelism and can map efficiently onto parallel computers. Such algorithms should consider the underlying architecture—such as distributed shared memory or a dedicated cluster of workstations—and the standard issues and challenges in parallel processing such as data locality, load balancing, interprocessor communication, and parallel I/O. Parallel I/O becomes very important in this case, because massive data analysis will require substantial data access to secondary storage and tape libraries. Issues of data organization, prefetching, and latency hiding by overlapping the mining tasks with the I/O soon become even more critical, and many new research challenges and opportunities will materialize. There are two dominant approaches for utilizing multiple processors that have emerged: distributed memory, in which each processor has a private memory; and shared memory, in which all processors access common memory. A shared-memory architecture has many desirable properties. Each processor has direct and equal access to all the memory in the system. Parallel programs are easy to implement on such a system. A different approach to multiprocessing is to build a system from many units, each containing a processor and memory. In a distributed memory architecture, each processor has its own local memory that can only be accessed directly by that processor. For a processor to have access to data in the local memory of another processor, a copy of the desired data elements must be sent from one processor to the other, via message passing. Although a shared memory architecture offers programming simplicity, the finite bandwidth of a common bus can limit scalability. A distributed memory, message-passing architecture cures the scalability problem by eliminating the bus, but at the cost of programming simplicity. A third paradigm, combining the best of both the distributed and shared memory approaches, has become increasingly popular. These systems include hardware or software distributed-shared memory (DSM) systems, where the physical memory is distributed among the nodes, but a shared global address space is provided on each processor, and the hardware or software ensures cache-coherence, i.e., making sure that locally cached data always reflects the latest modification by any processor. Clusters of SMP workstations or CLUMPS also fall in this mixed paradigm. CLUMPS also necessitate an hierarchical parallelism approach, where SMP primitives are used within a node, and message passing among the SMP nodes. Parallel versions of most of the well-known data mining techniques have been developed in recent years. However, the expertise and effort currently required in implementing, maintaining, and performance tuning a parallel data mining application is a severe impediment in the wide use of parallel computers for scalable data mining.

4. Association Rule Discovery (ARD) Since its introduction, Association Rule Discovery (ARD) [2], has become one of the core data mining tasks, and has attracted tremendous interest among data mining researchers and practitioners. ARD is an undirected or unsupervised data mining technique, which works on variable length data, and it produces clear and understandable results. It has an elegantly simple problem statement, that is, to find the set of all subsets of items or attributes that frequently occur in many database records or transactions, and additionally, to extract the rules telling us how a subset of items influences the presence of another subset. The prototypical application of ARD is in market basket analysis, where the items represent products and the records the point-of-sales data at large grocery or departmental stores. An example rule could be that, “90% of customers buying product A also buy product B.” Other application domains for ARD include customer segmentation, catalog design, store layout, telecommunication alARD prediction, and so on. Although ARD has a simple statement, it is a computationally and I/O intensive task. Given m items there 2m subsets that might potentially be frequent. An exhaustive search over this exponential space is infeasible, except for very small values of m. The number of records is also huge. For example, typical departmental stores stock thousands of items and collect millions of customer transactions everyday. Companies like Metro, DHL, UPS, etc., already have data warehouses in the terabyte range. Processing all this data requires a lot of disk I/O. Given that data is increasing both in terms of the dimensions or the number of items, and the size or number of transactions, one of the desirable characteristics of an ARD algorithm is scalability, i.e., the ability to handle massive data-stores. It is clear that sequential algorithms cannot provide the scalability, in terms of the data dimension, size or the runtime performance, for such large databases. We must therefore rely on high performance parallel and distributed computing to fill this role. The ARD problem we will consider in this report is in fact the binary association problem, i.e., we have binary data, either an item is present or absent from a transaction. We can also think about the general case where the quantity of the items bought is also considered. This problem is called quantitative association mining. In general, we can have items that take values from a continuous domain, called numeric attributes, or items that take a finite number of non-numeric values, called categorical attributes. Applying ARD to such data typically requires binning or discretizing the continuous attributes into ranges, and then forming new items for each range of the numeric attributes or for each value of the categorical attributes. Another extension of ARD is called generalized association mining. Here the items are at the leaf levels in a hierarchy or taxonomy of items, and the goal is to discover association rules involving concepts at multiple (and mixed) levels, from the primitive item level to the root of the hierarchy. The computational complexity for both these generalizations of binary associations is significantly greater, and thus parallel computing is crucial for obtaining good performance.

4.1 Theoretical Foundation of ARD and Algorithm Complexity The association mining task can be stated as follows: Let I be a set of items, and D a database of transactions, where each transaction has a unique identifier (tid) and contains a set of items. A set of items is also called an itemset. An itemset with k items is called a k-itemset. The support of an itemset X , denoted s(X), is the number of transactions in which it occurs as a subset. A k length subset of an itemset is called a k-subset. An itemset is maximal if it is not a subset of any other itemset. An itemset is frequent or large if its support is more than a user-specified minimum support (minsup) value. The set of frequent k-itemsets is denoted Fk. An association rule is an expression A => B, where A and B are itemsets. The support of the rule is the joint probability of a transaction containing both A and B, and is given as S(A U B). The confidence of the rule is the conditional probability that a transaction contains B, given that it contains A, and is given as S(A U B) / S(A). A rule is frequent if its support is greater than minsup, and it is strong if its confidence is more than a user-specified minimum confidence (minconf).

The data mining task is to generate all association rules in the database, which have a support greater than minsup, i.e., the rules are frequent, and which also have confidence greater than minconf, i.e., the rules are strong. This task can be broken into two steps: 1. Find all frequent itemsets having minimum support. 2. Generate strong rules having minimum confidence, from the frequent itemsets. Transaction 100 200 300 400 500 600 a) Transactional Database

Items ACTW CDW ACTW ACDW ACDTW CDT

Itemsets Support C 100% (6) W, CW 83% (5) A, D, T, AC, AW, CD, CT, 67% (4) ACW AT, DW, TW, ACT, ATW, 50% (3) CDW, CTW, ACTW b) Frequent Itemsets (minsup=50%) Maximal Frequent Itemsets: ACTW, CDW

A->C (4/4), A-W (4/4), A->CW (4/4), D->C (4/4), T->C (4/4), W->C (5/5), AC->W (4/4), AT->C (3/3), AT->W (3/3), AW->C (4/4), DW->C (3/3), TW->A (3/3), TW->C (3/3), AT->CW (3/3), TW->AC (3/3), ACT->W (3/3), ATW->C (3/3), CTW->A (3/3) c) Association Rules (minconf=100%)

Figure 2. a) Transactional Database, b) Frequent Itemsets and c) Strong Rules

Consider an example bookstore sales database shown in Figure 2. There are five different items (names of authors the bookstore carries), i.e., I = {A, C, D, T, W}, and the database consists of six customers who bought books by these authors. Figure 2 shows all the frequent itemsets that are contained in at least three customer transactions, i.e., minsup = 50%. It also shows the set of all association rules with minconf = 100%. The itemsets ACTW and CDW are the maximal frequent itemsets. Since all other frequent itemsets are subsets of one of these two maximal itemsets, we can reduce the frequent itemset search problem to the task of enumerating only the maximal frequent itemsets. On the other hand, for generating all the strong rules, we need the support of all frequent itemsets. This can be easily accomplished once the maximal elements have been identified, by making an additional database pass, and gathering the support of all uncounted subsets. The search space for enumeration of all frequent itemsets is 2m, which is exponential in m, the number of items. One can prove that the problem of finding a frequent set of a certain size is NP-Complete, by reducing it to the balanced bipartite clique problem, which is known to be NP-Complete. However, if we assume that there is a bound on the transaction length, we can show that ARD is essentially linear in the database size. Once the frequent itemsets are known, they can be used to obtain rules that describe the relationship between different itemsets. We generate and test the confidence of all rules of the form X\Y => Y, where Y ⊂ X, and X is frequent, as shown in Figure 3. For example, the itemset CDW generates the following rules {CD=>W, CW=>D, DW=>C, C=>DW, D=>CW, W=>CD}. Out of these only DW=>C is strong at 100% confidence level. For an itemset of size k there are 2k-2 potentially strong rules that can be generated. This follows from the fact that we must consider each subset of the itemset as an antecedent, except for the empty and the full itemset. The complexity of the rule generation step is thus O(r*2 l), where r is the number of frequent itemsets, and l is the longest frequent itemset. The reader is referred to [60] for complete mathematical description and algorithm complexity analysis.

RuleGen(F, minconf): for all frequent itemsets f ∈ F do for all subsets g ⊂ f do conf = fr(f)/gr(g); if(conf >= minconf) then output the rule g=>f, and conf

Figure 3. Rule Generation Algorithm

4.2 Sequential Algorithms for Association Rule Discovery It would be fair to say that all parallel ARD algorithms proposed to-date have been based on their sequential counterparts, which we briefly review here. Our coverage of sequential ARD algorithms is not intended to be complete. Rather we give specific examples for only those algorithms that have been parallelized. We begin by discussing the characteristics that make the approaches different. The sequential methods fall in the design space which is composed of the following characteristics: Bottom-up vs. Hybrid Search. The main observation in ARD is that the subset relation ⊆ defines a partial order on the set of itemsets, also called a specialization relation. If g ⊆ f, we say that g is more general than f, or f is more specific than g. The second observation used is that the relation ⊆ is a monotone specialization relation with respect to the frequency s(f), i.e., if f is a frequent itemset, then all subsets g ⊆ f are also frequent. The different ARD algorithms differ in the manner in with they search the itemset lattice spanned by the subset relation. Figure 4 shows the itemset lattice, the frequent itemsets (grey circles), and the maximal frequent itemsets (black circles) for our example database. Most approaches have used a levelwise or bottom-up search of the lattice to enumerate the frequent itemsets. If long frequent itemsets are expected, a pure top-down approach might be preferred. Some have also proposed a hybrid search, which combines top-down and bottom-up approaches. Complete vs. Heuristic Candidate. Generation ARD algorithms can differ in the way new candidates are generated. The dominant approach is that of complete search, whereby we are guaranteed to generate and test all frequent subsets. Note that completeness doesn’t mean exhaustive, since we can use pruning to eliminate useless branches in the search space. Heuristic generation sacrifices completeness for the sake of speed. At each step, it only examines a limited number of “good” branches. Random search to locate the maximal frequent itemsets is also possible. Typical methods that can be used here include genetic algorithms, simulated annealing, and so on, but these methods have not received much attention in ARD literature, since a lot of emphasis has been placed on completeness. All vs. Maximal Frequent Itemset Enumeration. Current ARD algorithms differ depending on whether they generate all frequent subsets or only the maximal ones. In essence, the identification of the maximal itemsets is the core task, since all other subsets can be generated in an additional database scan. Nevertheless, the majority of algorithms list all frequent itemsets. Horizontal vs. Vertical Data Layout. Most of the current ARD algorithms assume a horizontal database layout, where we store each customer transaction (tid), along with the items contained in the transaction. Some methods also use a vertical database layout, where they associate with each item X its tidlist, denoted TidL(X), which is a list of all transaction identifiers (tid) containing the item. Figure 4 contrasts the two data layout schemes.

Horizontal Layout

Vertical Layout

1 2 3 4 5 6

A 1 3 4 5

A C A A A C

C D C C C D

T W T D D T

W W W T W

C 1 2 3 4 5 6

D 2 4 5 6

T 1 3 5 6

W 1 2 3 4 5

Figure 4. a) Itemset Lattice b) Horizontal and Vertical DB Layout Apriori The Apriori algorithm by Agrawal et al. [3] has emerged as one of the current best ARD algorithm. It also serves as the base algorithm for the vast majority of parallel algorithms. Apriori uses a complete, bottom-up search, with a horizontal layout and enumerates all frequent itemsets. Apriori is an iterative algorithm that counts itemsets of a specific length in a given database pass. The process starts by scanning all transactions in the database and computing the frequent items. Next, a set of potentially frequent candidate 2-itemsets is formed from the frequent items. Another database scan is made to obtain their supports. The frequent 2-itemsets are retained for the next pass, and the process is repeated until all frequent itemsets have been enumerated. There are three main steps in the algorithm: 1. Generate candidates of length k from the frequent k-1 length itemsets, by a self join on Fk-1. For example, for our F2 = {AC, AT, A, CD, CT, CW, DW, TW}, we get F3 = {ACT, ACW, ATW, CDT, CDW, CTW}. 2. Prune any candidate with at least one infrequent subset. As an example, CDT will be pruned since DT is not frequent. 3. Scan all transactions to obtain candidate supports. The candidates are stored in a hash tree (shown in Figure 5) for fast support counting.

Figure 5: Hash Tree vs. Prefix Tree or Trie (after 3 iterations)

Dynamic Hashing and Pruning (DHP) The DHP [42] algorithm proposed by Park et al. is based on the level-wise Apriori approach. DHP uses a hash table to precompute approximate support of 2-itemsets while counting 1-itemsets. Then in the second iteration, while considering all pairs of frequent items as candidates, a check is made to see if the cell corresponding to that candidate has minimum support. If not, the candidate can be discarded. This hash table technique is very useful in reducing the number of candidate 2-itemsets, since without the hash table, all pairs are considered as candidates. For the remaining passes DHP defaults to Apriori, since generally only the second iteration candidates are problematic. DHP also reduces the size of the database in each iteration by removing items that do not appear in any frequent itemsets, and by removing transaction that do no contain any frequent itemsets. Their experiments demonstrate that the hash table approach can significantly reduce the size of candidate 2-itemsets. However, the benefits of database pruning are not clear. The reason is that only a logical pruning can be performed via filtering, since a physical pruning would involve rewriting the database in each iteration. This is not practical. Partition Savasere et al. [51] proposed the two-pass Partition algorithm. It logically divides the horizontal database into a number of non-overlapping partitions. Each partition is read, and vertical tidlists are formed for each item, i.e., list of all tids where the item appears. Then all locally frequent itemsets are generated via tidlist intersections. All locally frequent itemsets are merged to form the global candidates for a second pass. At this stage they eliminate any candidate found to be frequent in all partitions, since its global support is already known. They also remove those candidates which cannot be frequent, i.e., sum of their cumulative support and the number of partitions in which they are frequent is less than the sum of the minimum supports of partitions in which they are frequent plus the number of partitions. After eliminating such candidates a second pass is made through all the partitions. The database is again converted to the vertical layout and the global counts of all the chosen itemsets are obtained via tidlist intersections. The size of a partition is chosen so that it can be accommodated in main-memory. Partition thus makes only two database scans. The key observation used is that a globally frequent itemset must be locally frequent in at least one partition, and so all frequent itemsets are guaranteed to be found. For Partition to be effective the counting of 2-itemsets has to be done using the horizontal format as in Apriori, since the pair-wise intersections of tidlists, for the large number of candidate 2-itemsets, can be very expensive. SEAR, SPTID, SPEAR, and SPINC As part of his Master’s thesis Mueller [39] presented four algorithms based on Apriori and Partition. The Sequential Efficient Association Rules (SEAR) algorithm is identical to Apriori except that SEAR stores candidates in a prefix tree instead of a hash tree (see Figure 4), and uses a pass bundling optimization where it generates candidates for multiple passes if the candidates will fit in memory, i.e., instead of generating only candidates of length k, it generates those of length k+1, k+2, etc., as long as there is memory to hold these. Then all current candidates can be counted in a single database scan. This can cut down on the number of database scans over Apriori. The Sequential Partitioning with TID (SPTID) method is identical to Partition, except that SPTID uses a prefix tree and directly performs tidlist intersections even for candidate 2-itemsets. The SPEAR algorithm is similar to SEAR, but it uses the Partition technique, i.e., it is the nontidlist version of Partition. SPEAR uses the horizontal data format, but makes two scans, first to gather potentially frequent itemsets, and the second to obtain their global support. SPINC further reduces processing of some partitions during the second scan. Mueller’s objectives were to evaluate the intrinsic benefits of partitioning irrespective of the data format used. He experimentally concluded that SPTID was too expensive, because the 2-itemset joins were too expensive. He also found that partitioning did not help, because of high overhead of processing multiple partitions, and due to the many locally frequent, but globally infrequent itemsets found by partitioning. SEAR was the winner, since it also performed pass bundling.

Dynamic Itemset Counting (DIC) The DIC algorithm proposed by Brin et al. [8] is a generalization of Apriori. Instead of counting itemsets of length k in iteration k, DIC counts candidates of multiple lengths in the same pass. The database is divided into p equal-sized partitions such that each partition fits in memory. For partition 1, DIC gathers the supports of single items. Items found to be locally frequent (i.e., only in this partition) are used to generate candidate 2-itemsets. Then partition 2 is read and supports for all current candidates are obtained, i.e., the single items and the candidate 2-itemsets. This process is repeated for the remaining partitions. DIC starts counting candidate k-itemsets while processing partition k in the first database scan. After the last partition p has been processed, the processing wraps around to partition 1 again. The global support of a candidate is known once the processing wraps around the database and reaches the partition where it was first generated. If no new candidates are generated from the current partition, and all previous candidates have been counted, DIC terminates. To store different length candidates, DIC uses a Trie instead of a hash tree (see Figure 5). DIC is very effective in reducing the number of database scans if most partitions are homogeneous, i.e., have similar frequent itemset distributions. It data is not homogeneous, then DIC can generate many false positives, i.e., itemsets that are locally frequent but not globally frequent, and may scan the database more than Apriori. DIC proposes a random partitioning technique to reduce the data partition skew. Eclat, MaxEclat, Clique and MaxClique A completely different design characterizes the equivalence class based algorithms proposed by Zaki et al. [58], among which the simplest is Eclat, while MaxClique is the best. These methods utilize a vertical database format, complete search, a mix of bottom-up and hybrid search, and generate a mix of maximal and non-maximal frequent itemsets. The main advantage of using a vertical format is that one can determine the support of any k-itemset by simply intersecting the idlists of any two of its k-1 length subsets. In particular, they use the lexicographically first two k-1 length subsets that share a common prefix (the generating itemsets) to compute the support of a new k length itemset. A simple check on the cardinality of the resulting idlist tells whether the new itemset is frequent or not. No in-memory hash tree needs to be kept. Since there is a limited amount of main-memory, and all the intermediate idlists will not fit in memory, they break up the large search space into small, independent, manageable chunks which can be processed in memory, via prefix-based or clique-based equivalence classes. The key observation is that each class is a sub-lattice of the original lattice and can be processed independently. Each class is independent in the sense that it has complete information for generating all frequent itemsets that share the same prefix. For example, if prefix class [A] has the elements AC and AW, the only possible frequent sequences at the next step can be ACW, and no other item, say, say D, can lead to a frequent itemset with the prefix A, unless AD is also in [A]. The prefix and clique-based approach differ in the size of the resulting classes, with the clique approach generating smaller classes at additional expense. Among the four algorithms proposed, Eclat uses prefix based classes and bottom-up search, MaxEclat uses prefix-based classes and hybrid search, Clique uses clique-based classes and bottom-up search, and MaxClique uses clique-based classes and hybrid search. The best approach was MaxClique which was shown to outperform Apriori by as much as a factor of 40, Partition by a factor of 20, and Eclat by a factor of 2.5. BitStream and Sparse-Matrix The BitStream and Sparse-Matrix algorithms proposed by Wur et al. [56] are based on a Boolean approach. The proposed Boolean algorithm mines association rules in two steps. In the first step, logic OR and AND operations are used to compute frequent itemsets. In the second step, logic AND and XOR operations are applied to derive all interesting association rules based on the computed frequent itemsets. By only scanning the database once and avoiding generating candidate itemsets in computing frequent itemsets, the Boolean algorithm gains a significant performance improvement over the Apriori algorithm. Two efficient implementations of the Boolean algorithm, the BitStream approach and the Sparse-Matrix approach are proposed. Both the BitStream approach and the Sparse-Martrix approach outperform the Apriori

algorithm in all database settings. Especially, the Sparse-Matrix approach shows a very significant performance improvement over the Apriori algorithm. FP-growth In [29] Han et al. propose a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, which avoids costly, repeated database scans, (2) the FP-tree-based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. The FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods. Algorithm Apriori DHP Partition SEAR SPTID SPEAR SPINC DIC Eclat MaxEclat Clique MaxClique BitStream Sparse-Matrix FP-growth

DB Layout Horizontal Horizontal Vertical Horizontal Vertical Horizontal Horizontal Horizontal Vertical Vertical Vertical Vertical Horizontal

Data Structure Hash Tree Hash Tree None Prefix Tree Prefix Tree Prefix Tree Prefix Tree Trie None None None None Sparse Matrix Horizontal FP-tree

Search

Enumeration

Bottom-up Bottom-up Bottom-up Bottom-up Bottom-up Bottom-up Bottom-up Bottom-up Bottom-up Hybrid Bottom-up Hybrid Bottom-up

All All All All All All All All All Maximal & non-maximal All Maximal & non-maximal All

No. of DB Scans K K 2 K I/2+K 2 ≤2 ≤K ≥3 ≥3 ≥3 ≥3 1

Bottom-up

All

2

Table 1: Algorithm Characteristics (K denotes the size of the longest frequent itemset, I denotes the number of frequent Items) Table 1 presents a summary of the major differences among all the algorithms reviewed above. These algorithmic characteristics should aid understanding of the parallel algorithms presented below.

4.3 Parallel Approaches of ARD Algorithms Parallelism is expected to relieve current ARD methods from the sequential bottleneck, providing the ability to scale to massive datasets, and improving the response time [61]. To achieve good performance on today’s multiprocessor systems is a non-trivial task. The main challenges include synchronization and communication minimization, work-load balancing, finding good data layout and data decomposition, and disk I/O minimization, which is especially important for ARD. The parallel design space spans three main components including the hardware platform, the kind of parallelism exploited, and the load balancing strategy used.

Distributed Memory Machines(DMM) vs. Shared Memory (SMP) Systems. The objectives change depending on the underlying architecture. In DMMs synchronization is implicit in message passing, so the goal becomes communication optimization. For shared-memory systems, synchronization happens via locks and barriers, and the goal is to minimize these points. Data decomposition is very important for distributed memory, but not for shared memory. While parallel I/O comes for “free” in DMMs, it can be problematic for SMP machines, which typically serialize I/O. The main challenge for obtaining good performance on DMM is to find a good data decomposition among the nodes, and to minimize communication. For SMP the objectives are to achieve good data locality, i.e., maximize accesses to local cache, and to avoid/reduce false sharing, i.e., minimize the ping-pong effect where multiple processors may be trying to modify different variables which coincidentally reside on the same cache line. For today’s non-uniform memory access (NUMA) hybrid and/or hierarchical machines (e.g., cluster of SMPs), the optimization parameters draw from both the DMM and SMP paradigms. Task vs. Data Parallelism. These are the two main paradigms for exploiting algorithm parallelism. For ARD, data parallelism corresponds to the case where the database is partitioned among P processors; logically partitioned for SMPs, but physically for DMMs. Each processor works on its local partition of the database, but performs the same computation of counting support for the global candidate itemsets. Task parallelism corresponds to the case where the processors perform different computations independently, such as counting a disjoint set of candidates, but have/need access to the entire database. SMPs have access to the entire data, but for DMMs this can be done via selective replication or explicit communication of the local portions. Hybrid parallelism combining both task and data parallelism is also possible, and perhaps desirable for exploiting all available parallelism in ARD methods. Static vs. Dynamic Load Balancing. In static load balancing work is initially partitioned among the processors using some heuristic cost function, and there is no subsequent data or computation movement to correct load imbalances which result from the dynamic nature of ARD algorithms. Dynamic load balancing seeks to address this by stealing work from heavily loaded processors and re-assigning it to lightly loaded ones. Computation movement also entails data movement, since the processor responsible for a computational task needs the data associated with that task as well. Dynamic load balancing thus incurs additional costs for work/data movement, and also for the mechanism used to detect whether there is an imbalance, but it is beneficial if the load imbalance is large and if load changes with time. Dynamic load balancing is especially important in multi-user environments with transient loads and in heterogeneous platforms, which have different processor and network speeds. These kinds of environments include parallel servers, and heterogeneous, meta-, and super-clusters, i.e., the so-called “grid” platforms becoming common today. All extant ARD algorithms use only a static load balancing approach that is inherent in the initial partitioning of the database among available nodes. This is because they assume a dedicated, homogeneous environment. Figure 6 shows where each parallel ARD method, discussed below in detail, falls in the design space.

Figure 6. Parallel ARD Algorithms

4.3.1 Distributed Memory Systems The main design issues in distributed memory systems concern the minimization of communication, and an even distribution of data for good load balancing. We look at several distributed memory ARD algorithms below. These algorithms assume that the database is partitioned among all the processors in equal-sized blocks, which reside on the local disk of each processor. SEAR/SPEAR-based Some of the first parallel ARD methods were proposed by Mueller [39], built on top of his sequential methods, which in turn were based on Apriori and Partition. PEAR is the parallel version of SEAR. In each iteration each processor generates a candidate prefix tree from the global frequent itemsets from the previous pass. This step is replicated on all processors, i.e., each processor has the entire copy of the same candidate set. Each node then gathers local supports, followed by a sum-reduction to obtain global supports on each processor. PPAR is based on SPEAR. In fact PPAR is the parallelization suggested, but not implemented, by Partition’s authors, with the exception that PPAR uses the horizontal data format. PPAR works as follows. Each processor gathers the locally frequent itemsets of all sizes in one pass over their local database (which may be partitioned into local chunks as well). All potentially frequent itemsets are then broadcast to other processors. Then each processor gathers the counts of these global candidates in the second local pass. Finally a broadcast is performed to obtain the global frequent itemsets. Experiments on a 16-node IBM SP2 distributed memory machine showed that PEAR always outperformed PPAR. This is because PEAR uses pass bundling, while PPAR might generate unnecessarily many candidates that turn out to be infrequent. DHP-based The PDM algorithm by Park et al. [43] is based on their DHP algorithm. PDM works as follows. Each processor p generates the local supports of 1-itemsets and approximate counts for the 2-itemsets via a hash table. The global counts for 1-itemsets are obtained by an all-to-all broadcast of local counts. Since the 2itemset hash table can be very large, directly exchanging the counts by an all-to-all broadcast can be expensive. They use an optimized method that exchanges only those cells that are guaranteed to be frequent. However, this methods requires two rounds of communication. For the second pass, local candidates are generated using the global 2-itemset hash table, and for subsequent passes directly from frequent itemsets from previous pass. Note that candidates are generated in parallel. Each processor generates its own local set, which is exchanged via an all-to-all broadcast to construct the global candidate set. Next the local counts for all candidates is obtained and exchanged among all processors to determine the globally frequent itemsets. The stage is now set for the next iteration. There are a number of limitations of PDM. First, PDM parallelizes the candidate generation at the cost of an all-to-all broadcast to construct the entire candidate set. The communication costs may render this parallelization ineffective. PDM like DHP, prunes items and transactions that do not appear in any frequent itemset. Physical pruning is not desirable since any benefits of pruning will be overshadowed by the disk I/O. They presented only simulation results on an IBM SP2 type distributed memory machine, so it’s difficult to assess the practical impact of their optimizations. As noted above, the communication and writing costs of PDM are likely to render the optimizations ineffective. Apriori-based Count, Data and Candidate Distribution We begin with a description of three parallel algorithms proposed by Agrawal and Shafer [4], from the group that developed Apriori. Their target machinewas a 32-node IBM SP2 distributed memory machine. The Count Distribution algorithm is a simple parallelization of Apriori. All processors generate the entire candidate hash tree from Fk-1. Each processor can thus independently get partial supports of the candidates from its local database partition. This is followed by a sum-reduction to obtain the global counts, by exchanging local counts with all other processors. Note that only the partial counts need to be communicated, rather than merging different hash trees, since each processor has a copy of the entire tree. Once the global Fk has been determined each processor builds the entire candidate Ck+1 in parallel, and

repeats the process until all frequent itemsets are found. This simple algorithm minimizes communication since only the counts are exchanged among the processors. However, since the entire hash tree is replicated on each processor, it doesn’t utilize the aggregate memory efficiently. The Data Distribution algorithm was designed to utilize the total system memory by generating disjoint candidate sets on each processor. However to generate the global support each processor must scan the entire database (its local partition, and all the remote partitions) in all iterations. It thus suffers from high communication overhead, and performs poorly when compared to Count Distribution. The Candidate Distribution algorithm partitions the candidates during iteration i, so that each processor can generate disjoint candidates independent of other processors. The partitioning uses a heuristic based on support, so that each processor gets an equal amount of work. At the same time the database is selectively replicated so that a processor can generate global counts independently. The choice of the redistribution pass involves a trade-off between decoupling processor dependence as soon as possible and waiting until sufficient load balance can be achieved. In their experiments the repartitioning was done in the fourth pass. After this the only dependence a processor has on other processors is for pruning the candidates. Each processor asynchronously broadcasts the local frequent set to other processors during each iteration. This pruning information is used if it arrives in time, otherwise it is used in the next iteration. Note that each processor must still scan its local data once per iteration. Even though it uses problemspecific information, it performs worse than Count Distribution. Candidate Distribution pays the cost of redistributing the database, and it then scans the local database partition repeatedly. Non-Partitioned, Simply-Partitioned, and Hash-PartitionedApriori Independently, Shintani and Kitsuregawa [53] proposed four Apriori-based parallel algorithms, which bear close similarity to the three discussed above. Their target machine was a 64-node Fujitsu AP1000DDV distributed memory machine. Non-Partitioned Apriori is essentially the same as Count Distribution, except that the sum-reduction is performed on a single master processor. Simply Partitioned Apriori is the same as Data Distribution. Hash Partitioned Apriori (HPA) is similar in spirit to Candidate Distribution. Each processor generates candidates from the previous level’s frequent set, and applies a hash function to determine a home processor for that candidate. If it itself is the home for a candidate, it inserts the candidate in the local hash tree, otherwise it discards the candidate. For counting, instead of selectively replicating the database as in Candidate Distribution, each processor generates a k-subset for each local transaction, calculates the destination processor, and communicates that subset to the processor. The home processor is responsible for incrementing the counts using the local database and any messages sent by other processors. They also propose a variant of HPA called HPA-ELD (for HPA with Extremely Large Itemsets Duplication). The motivation is that even though one may partition candidates equally among processors, some candidates are more frequent than others, and their home processors will consequently be loaded, while others will have a light load. HPAELD addresses this by replicating the extremely frequent itemsets on all processors, and processes them using the NPA scheme, i.e., no subsets are communicated for these candidates, and local counts are obtained, followed by a sum-reduction for the global support. They experimentally confirmed that HPA-ELD indeed outperforms the other approaches. However, they used SPA, HPA, and HPA-ELD only for the second iteration, and the remaining passes were performed using NPA. This suggests that a hybrid approach using HPA-ELD as long as candidates do not fit in memory, and then switching to NPA is the best approach. This makes sense since NPA and Count Distribution require the least amount of communication. In [54], a parallel algorithm is proposed for mining association rules with classification hierarchy. In this algorithm, the available memory space is fully utilized by identifying the frequently occurring candidate itemsets and copying them over all the processors, through which frequent itemsets can be processed locally without any communication. Thus, it can effectively reduce the load skew among the processors. Intelligent Data Distribution and Hybrid Distribution Han et al. [28] have proposed two ARD methods based on Data Distribution. Their platform is a 128node Cray T3D distributed memory machine. They observe that Data Distribution uses an expensive all-to-

all broadcast to send the local database portion to every other processor. Further, while Data Distribution manages to divide up the candidates equally among the processors, it fails to divide the work done on each transaction, i.e., one still generates a subset of the transaction and searches to see if the hash tree contains that subset. In Intelligent Data Distribution (IDD), firstly, they use a linear time ring-based all-to-all broadcast for communication. Secondly, they switch to Count Distribution once the candidates fit in memory. Thirdly, instead of a round-robin partitioning of the candidates, they perform a single item prefix-based partitioning. Before processing a transaction, they make sure that it contains the relevant prefixes. If not, the transaction can be discarded. Note that the entire database is still communicated, but the transactions may not be processed if they do not contain relevant items. The Hybrid Distribution (HD) is a combination of Count Distribution and Intelligent Data Distribution. It partitions the P processors into G equal sized groups. Each of the G groups is considered a super-processor, and they use Count Distribution on the G super-processors. Among the P/G processors within a group they use Intelligent Data Distribution. The database is horizontally partitioned among the G super-processors, and the candidates are partitioned among the P/G processors in a group. Additionally, they decide on the number of groups dynamically for each pass. The advantages of Hybrid Distribution are that it cuts down the database communication costs by 1/G, and it tries to keep processors busy, especially during later iterations. Their experiments showed that while Hybrid has the same performance as Count Distribution, it can handle much larger databases. Fast Distributed Cheung et al. [12] proposed the Fast Distributed Mining (FDM) algorithm for ARD. The main difference between parallel and distributed data mining is the interconnection network latency and bandwidth. In distributed mining we assume that the network is much slower. Apart from this distinction, the difference between the two is getting blurred today. For a slow network, any variants of Data Distribution, which essentially communicate the entire database in each iteration, are not practical, given the communication costs. Since Count Distribution has the least communication cost, it is an ideal method to improve in a distributed environment. FDM builds on top of Count Distribution, and proposes new techniques to reduce the number of candidates considered for counting, and thus also minimizes communication. FDM assumes that the database is horizontally partitioned among the distributed sites. It uses the observation that any globally frequent itemset must be locally frequent at some site, so the only candidates a site has to consider are the ones generated from the ones globally frequent at that site, denoted as GLi for site i. For example, out of all frequent items F1 = {A, B, C, D, E}, lets say that GL1 = {A, B, C} and GL2 = {D, E}. Then the first site considers only the candidates CG1 = {AB, AC, BC}, and the other CG2 = {CD, CE, DE}. Instead of these six candidates, Count Distribution would generate C62 = 15 candidates. FDM also suggest three optimizations: local pruning, global pruning, and count polling. Their algorithm FDM-LP uses local pruning, and count polling. It works as follows. Each site generates candidates using the GLi from all sites, and assigns a home site for each candidate. Then each site computes the local support for all candidates. Next comes the local pruning step, i.e., remove any itemset X that is not locally frequent at the current site, since if X is globally frequent, then it must occur at some other site. The next step forms the count polling optimization. Each home site requests all other sites to send local counts for all candidates assigned to it, and computes their global support. The home site then broadcasts the global supports to all other sites. At the end, each site has the globally frequent set, and a new iteration may begin. Recall that Count Distribution broadcasts the local counts of all candidates to everyone else, while FDM sends it to only one home site per candidate. Thus it requires a lot less communication, and local pruning cuts it down even more. Another optimization they suggest is global pruning. Instead of sending only the global supports for the frequent itemsets, also send their local supports in each partition at the end of iteration k-1. FDM was evaluated on a cluster of six workstations connected with a 10Mb Ethernet LAN. Their experiments, tested only for local pruning with count polling, showed a reduction of 75-90% in the candidate set size on each site, and a reduction of 85-90% in the message size.

Fast Parallel Cheung and Xiao [13] have proposed a parallel version of FDM, called Fast Parallel Mining (FPM). The problem with FDM’s polling mechanism is that it requires two rounds of messages in each iteration. One for computing the global supports, and one for broadcasting the frequent itemsets. This two round scheme can degrade performance in a parallel setting. FPM generates fewer candidates and retains the local and global pruning steps. But instead of count polling and subsequent broadcast of frequent itemsets, it simply broadcasts local supports to all processors. The more interesting aspect of this work is a metric they define for data skewness, i.e., the distribution of itemsets among the various partitions. For an itemset X, let pX(i) denote the probability that X occurs in partition i. Then the entropy of X is given as: H(X)= – ∑(pX(i)log(pX(i)) ; i=1,n The entropy measures the distribution of the local support counts of X among all partitions. The skewness of an itemset X is given as: S(X)=(Hmax – H(X)) / Hmax where Hmax = log(n) for n partitions. S(X) takes on a value of 0 if X has equal support in all partitions, and 1 if X occurs in only one partition. The total data skewness of a database is the sum of the skew of all itemsets weighted by their supports. In practice, only the skew of the frequent itemsets needs to be considered. Their experiments on a 32-node IBM SP2 indicate that FPM can outperform Count Distribution by factor of 3 for very high skew datasets, and by a factor of 1.5 for low skew data. 4.3.2 Shared Memory Machines The main design issues in shared memory systems concern the minimization/elimination of false sharing, and the maintenance of good data locality. SMP platforms have not received wide-spread attention in parallel ARD literature. However, with the advent of multiprocessor desktops, and CLUMPS (e.g. nodes in an IBM SP2 can consist of 8-way SMPs), SMP platforms are becoming increasingly important. Apriori-based One of the first algorithms targeting SMP machines was Common Candidate Partitioned Database (CCPD) proposed by Zaki et al [57]. As the name suggests, CCPD uses a data parallel approach. The database is logically partitioned into equal-sized chunks, and all the processors synchronously process a global or common candidate hash tree. CCPD parallelizes the candidate generation step, using a method where each processor generates a disjoint candidate subset, and which has good computational division. To build the hash tree in parallel, CCPD associates a lock with each leaf node. When a processor wants to insert a candidate into the tree, it starts at the root, and successively hashes on the items till it reaches a leaf. It then acquires the lock and inserts the candidate. With this locking mechanism, each processor can insert itemsets in different parts of the hash tree in parallel. For support counting, each processor computes frequency from its logical partition. They also proposed additional optimizations like short-circuited join and hash tree balancing. Short-circuited join propagates bit markers up the hash tree so that one can avoid processing subtrees already processed earlier. For hash tree balancing they use a new hash function so that the resulting tree is balanced, while a simple mod function can lead to skewed trees. A balanced tree speeds up the processing, since it has shorter height. CCPD’s performance evaluation was done on a 12-node SGI Power Challenge. They were able to obtain reasonable speedup, but the serial I/O was detrimental to performance. They also implemented the Partitioned Candidate Common Database (PCCD) algorithm, where the processor construct disjoint candidate trees and scan the entire database to get candidate supports. However, the I/O overhead and disk contention for PCCD was unacceptable, resulting in slow-downs on more than one processor.

In recent work, they have proposed memory placement optimization for speeding-up CCPD. They showed that due to the nature of hashing, the candidate hash tree has very poor data locality, and further a common tree can lead to false sharing in the support counting phase. They propose a set of mechanisms and policies for controlling the memory layout of the hash tree based on the access patterns in support counting. Their schemes ensure that the nodes most likely to be accessed in a sequence lie close in physical memory as well, leading to good locality. They proposed an effective privatization mechanism, where each processor collects counts in a local array, followed by a sum-reduction, for reducing false sharing. Experiments on a 12-node SGI Challenge showed improvements of 50-60% over the base case. DIC-based Cheung et al. [14] have proposed the Asynchronous Parallel Mining (APM) algorithm, which is based on DIC. APM uses the global pruning technique of FDM to reduce the size of candidate 2 itemsets. This pruning is most effective when there is high data skew among the partitions. However, DIC requires that the partitions be homogeneous, as explained in above section. APM addresses this problem by treating the first iteration separately. APM logically divides the database into many small equal-sized virtual partitions, say l, where l is independent of the number of processors p, but usually l ≥ p. Let m be the number of items. APM gathers the local counts of the m items in each partition. This forms a l x m dataset, with l item support vectors in a m dimensional space. They group these l vectors into k clusters, such that inter-cluster distance is maximized and intra-cluster distance is minimized. Thus the k clusters or partitions are as skewed as possible, and they are used to generate a small set of candidate 2-itemsets. APM now prepares to apply DIC in parallel. The idea is to divide the database into p homogeneous partitions. Each processor independently applies DIC to its local partition. However, there is a shared trie among all processors, which is built asynchronously. APM stops when all processors have processed all candidates, whether generated by them or by anyone else, and when no new candidates are generated. For each processor to apply DIC on its partitions, it has to divide the local partitions into sub-partitions, say r. Further, DIC requires that both the p inter-processor partitions and the r intra-processor partitions are as homogeneous as possible. APM ensures that the p partitions are homogeneous by assigning the virtual partitions, from each of the k clusters of the first pass, in a round-robin manner among the p processors. Thus each processor gets an equal mix of virtual partitions from separate clusters, resulting in quite homogeneous processor partitions. To get intra-processor partition homogeneity they perform a secondary kclustering, i.e., they group the r partitions into k clusters and again assign elements from each of the k clusters to the r partitions in a round-robin manner. Experiments on a 12-node Sun Enterprise 4000 SMP indicate that APM outperforms a Count Distribution/CCPD like algorithm by a factor of 4 to 5. An interesting tradeoff in APM is that while data skewness is good for global pruning, it is detrimental to workload balance. 4.3.3 Hierarchical Systems An hierarchical system has both distributed memory and shared memory components, for example a cluster of SMP workstations. Hierarchical systems are becoming increasingly popular today, especially with the advent of multiprocessor desktops and recent advances in high speed networks. These clusters provide scalability and performance comparable to expensive machines, and at an attractive cost. In fact even the distributed memory machines like IBM SP2 can have 8-way SMP nodes. Another example is the SGI Origin NUMA hardware distributed shared memory system. In a hierarchical system one has to optimize inter-node communication and data decomposition, and also to optimize intra-node data locality and false sharing for each SMP node. Eclat-based Zaki et al. [59] proposed four algorithms, ParEclat, ParMaxEclat, ParClique, and ParMaxClique, targeting hierarchical systems. All four are based on their sequential counterparts in. In the discussion below we refer to a p-way SMP node as a host, and we assume that there are p hosts, for a total of np processors in the system. These methods assume that the database is in the vertical format, and partitioned among the

hosts in such a manner that each host gets an entire idlist for a single item, and the total length of local idlists is roughly equal on all hosts. Each host further splits the local idlists into p vertical partitions. Thus each processor in the system has its own vertical partition. All four algorithms have a similar parallelization, and only differ in the search strategy and equivalence class decomposition technique used. There are three main phases, the initialization phase which performs computation and data partitioning, the asynchronous phase where each processor independently generates frequent itemsets, and the reduction phase where final results are aggregated. The initialization phase works as follows: from the frequent 2-itemsets, the master host generates the parent classes using prefix or clique-based partitioning. These classes are then scheduled among all available processors using a greedy algorithm. Each class is assigned a weight based on its cardinality. The classes are then sorted on their weights and assigned, in turn, to the processor with least total weight. After parent class scheduling, tidlists are selectively replicated on each host, so that all item tidlists part of some assigned class on a processor are available on the host’s local disk. Only the hosts take part in this communication. In the asynchronous phase each processor has available the classes assigned to it, and the idlists for all items. Thus each processor can independently generate all frequent itemsets from its classes. No communication or synchronization is required. Further, all available memory of the system is used, no inmemory hash or prefix trees are needed, and only simple intersection operations are required for itemsets enumeration. The four algorithms differ depending on the decomposition and search strategy used. ParEclat and ParMaxEclat use prefix-based classes, but use bottom-up and hybrid search, respectively, and ParClique and ParMaxClique use smaller clique-based classes, with bottom-up and hybrid lattice search, respectively. They experimented on a 32-processor Digital Alpha cluster, with 8 4-way SMP hosts, connected by the fast Digital Memory-Channel network. Comparisons with a hierarchical implementation of Count Distribution/CCPD showed order of magnitude improvements of ParMaxClique over Count Distribution. 4.4 Summary of Parallel Algorithms While there might seem to be a plethora of information on parallel association rule mining presented above, without an impression of the larger picture, it is instructive to refer to Table 2. It shows the essential differences among the different methods reviewed above, and groups related algorithms together. As one can see there are only a handful of distinct paradigms. The other algorithms propose optimizations over these basic cases. For example, PEAR, PDM, NPA, FDM, FPM and CCPD are all similar to Count distribution. Similarly, SPA, IDD and PCCD are similar to Data Distribution and HPA, HPA-ELD are similar to Candidate Distribution. Hybrid Distribution combines Count and Data Distribution techniques. ParEclat, ParMaxEclat, ParClique, ParMaxClique are all based on their sequential counterparts. Finally, APM is based on the sequential DIC method, while PPAR is based on Partition. These parallel methods thus share the same complexity and properties of the sequential algorithms on which they are based (see Table 1). The ARD problem we have considered in this article is in fact the binary association problem, i.e., we have binary data, either an item is present or absent from a transaction. We can also think about the general case where the quantity of the items bought is also considered. This problem is called quantitative association mining. In general, we can have items that take values from a continuous domain, called numeric attributes, or items that take a finite number of non-numeric values, called categorical attributes. Applying ARD to such data typically requires binning or discretizing the continuous attributes into ranges, and then forming new items for each range of the numeric attributes or for each value of the categorical attributes. Another extension of ARD is called generalized association mining. Here the items are at the leaf levels in a hierarchy or taxonomy of items, and the goal is to discover association rules involving concepts at multiple (and mixed) levels, from the primitive item level to the root of the hierarchy. The computational complexity for both these generalizations of binary associations is significantly greater, and thus parallel computing is crucial for obtaining good performance.

Algorithm

Characteristics

Count Distribution PEAR

Apriori-based candidate prefix tree hash table for 2-itemsets, parallel candidate generation only master does sum-reduction local and global pruning, count-polling local and global pruning, skewness-handling shared-memory exchange full DB per iteration

PDM NPA FDM FPM CCPD Data Distribution SPA IDD PCCD Hybrid Distribution Candidate Distribution HPA HPA-ELD ParEclat ParMaxEclat ParClique ParMaxClique APM PPAR

ring-based broadcast, item-based candidate partitioning shared-memory (logical DB exchange) combines Count & Data Distribution selectively-replicated DB, asynchronous no DB replication, exchange itemsets replicate frequent itemsets Eclat-based MaxEclat-based Clique-based MaxClique-based DIC-based, shared-memory Partition-based, horizontal DB

Table 2: Parallel Algorithm Characteristics

5. Classification Predictive modeling targets predicting one or more fields in the data by using the rest of the fields. When the variable being predicted is categorical (to approve or reject a loan, for example), the problem is called classification. Classification is a well-recognized data mining operation that has been studied extensively in the fields of statistics, pattern recognition, decision theory, machine learning literature, neural network, etc. When the variable is continuous (such as expected profit or loss), the problem is referred to as regression. The simplest approaches include standard techniques for linear regression, generalized linear regression, and discriminant analysis. Methods popular in data mining include decision trees, rules, neural networks (nonlinear regression), radial basis functions, and many others. Because decision trees (also kwown as classification trees) are the most popular and well-developed classification approaches, the rest of this section will address the problems related to their construction. Let review their main advantages over other classification models § due to their intuitive representation, the resulting classification model is easy to assimilate by humans [7, 38]; § decision trees do not require any parameter setting from the user and thus are especially suited for exploratory knowledge discovery; § decision trees can be constructed relatively fast compared to other methods [52, 21]; § the accuracy of decision trees is comparable or superior to other classification models [40, 33]. For example, based on debt level, income level and employment type, you can use predictive modeling to predict the credit risk of a given customer. The classification algorithm determines the relationship of these attributes to the risk class in a training data set where the risk is known. Decision trees are a common and useful technique for predictive modeling. Figure 7 shows a) a set of training data that will be used to predict credit risk and b) a decision tree that might be created from this data.

Customer ID

Debt level

Income level

Employment type

Credit risk

1

High

High

Selfemployed

Bad

2

High

High

Salaried

Bad

3

High

Low

Salaried

Bad

4

Low

Low

Salaried

Good

5

Low

Low

Selfemployed

Bad

6

Low

High

Selfemployed

Good

7

Low

High

Salaried

Good

Figure 7. a) Sample Data b) Decision Tree In this trivial example, a decision tree algorithm might decide that the most significant attribute for predicting credit risk is debt level. The first split in the decision tree is therefore made on debt level. One of the two new nodes (debt level = high) is a leaf node, having three bad credit risks and no good credit risks. In this example, a high debt level is a perfect predictor of a bad credit risk. The other node (debt level = low) is still mixed, having three good credit risks and one bad. The decision tree algorithm then chooses employment type as the next most significant predictor of credit risk. The split on employment type gives two leaf nodes. When the scale of the problem expands like this, it is very difficult for a person to extract the rules to identify good and bad credit risks. The classification algorithm, on the other hand, can consider hundreds of attributes and millions of records to come up with the decision tree that describes rules for credit risk prediction.

5.1 Theoretical Foundation of Decision Trees The problem can be formally state as follows: Let X1, …, Xm, C be random variables where Xi has domain dom(Xi); we assume without loss of generality that dom(C) = {1, 2, k}. A classifier is a function d:dom(X1) x … x dom(Xm) → dom(C). Let P(Xl, Cl) be a probability distribution on dom(X1) x … x dom(Xm) x dom(C) and denote by t = a record randomly drawn from P, i.e., t has probability P(Xl, Cl) that ∈ Xl and t.C ∈ Cl. We define the misclassification rate Rd of classifier d to be P(d() ≠ t.C). In terms of the informal introduction, the training database D is a random sample from P, the Xi correspond to the predictor attributes and C is the class label attribute. A decision tree is a special type of classifier. It is a directed, acyclic graph T in the form of a tree. The root of the tree does not have any incoming edges. Every other node has exactly one incoming edge and may have outgoing edges. A special case of decision trees is the binary decision trees. Binary decision trees are the most popular class of decision trees. Thus we assume in the remainder of this section that each node has either zero or two outgoing edges. If a node has no outgoing edges it is called a leaf node, otherwise it is called an internal node. Each leaf node is labeled with one class label; each internal node n is labeled with one predictor attribute Xn called the splitting attribute. Each internal node n has a predicate qn , called the splitting predicate associated with it. If Xn is a numerical attribute, qn is of the form Xn ≤ xn , where xn ∈ dom(Xn); xn is called the split point at node n. If Xn is a categorical attribute, qn is of the form Xn ∈ Yn where Yn ⊂ dom(Xn) and is called the splitting subset at node n. The combined information of splitting attribute and

splitting predicates at node n is called the splitting criterion of n. If we talk about the splitting criterion of a node n in the final tree that is output by the algorithm, we sometimes refer to it as the final splitting criterion. We will use the terms final splitting attribute, final splitting subset, and final split point, analogously. We associate with each node n ∈ T a predicate fn: dom(X1) x … x dom(Xm) → {true, false} called its node predicate as follows: fn = true, if n is the root node fn = fp ∧ qp, if n is the left child of p fn = fp ∧ ¬qp, if n is the right child of p where p’s splitting predicate is qp. Informally, fn is the conjunction of all splitting predicates on the internal nodes on the path from the root node to n. Since each leaf node n ∈ T is labeled with a class label, it encodes a classification rule fn → c, where c is the label of n. Thus the tree T encodes a function T:dom(X1) x … x dom(Xm) → dom(C) and is therefore a classifier, called a decision tree classifier. Let us define the notion of the family of tuples of a node in a decision tree T with respect to a database D. (We will drop the dependency on D from the notation since it is clear from the context.) For a node n ∈ T with parent p, Fn is the set of records in D that follows the path from the root to n when being processed by the tree, formally Fn = {t ∈ D : fn(t) = true}. We can now formally state the problem of decision tree construction. Decision tree classification problem: Given a dataset D = {1, …, n} where the ti are independent random samples from a probability distribution P, find a decision tree classifier T such that the misclassification rate RT(P) is minimal. A decision tree is usually constructed in two phases: 1. the growth phase, an overly large decision tree is constructed from the training data; 2. the pruning phase, the final size of the tree T is determined with the goal to minimize RT. The growth phase, due to its data-intensive nature, it is a very time-consuming part of decision tree construction [38, 52, 21]. All decision tree construction algorithms grow the tree top-down in the following greedy way: At the root node, the database is examined and a splitting criterion is selected. Recursively, at a non-root node n, the family of n is examined and from it a splitting criterion is selected. This is the well-known schema for greedy top-down decision tree induction and is depicted in Figure 8, for both binary and k-way (general case) decision trees. The method described there is also known as the Hunt’s method [49]. There are two major issues that have critical performance implications in the tree-growth phase: 1) how to find split points that define node tests, and 2) having chosen a split point, how to partition the data. The prune phase generalizes the tree, and increases the classification accuracy on new examples, by removing statistical noise or variations. The tree growth phase is computationally much more expensive than pruning, since the data is scanned multiple times in this part of the computation. Pruning phase typically takes less than 1% of the total time needed to build a classifier. The pruning phase can easily be done offline on a serial processor as it is computationally inexpensive, and requires access to only the decision-tree grown in the training phase. In both sequential and parallel tree-growth, the primary problems remain finding good split-points and partitioning the data using the discovered split-points. In the remaining of this section we will focus on the tree-growth phase. It is worth to mention that there are proposals in which the two phases described above are interleaved. PUBLIC [50] is a top-down pruning algorithm in which the two phases are inter-leaved: stopping criteria are calculated during tree growth to inhibit further construction of parts of the tree when appropriate. Different strategies applied for growth or pruning phases led to an impressive number of decision tree construction algorithms in last two decades that are briefly in next section.

TDTree(RootNode n, partition D, split selection method CL) Apply CL to D to find the splitting criterion for n; if(n splits) then Use best split to partition D into D1, D2; Create children n1 and n2 of n; TDTree(n1, D1, CL); TDTree(n2, D2, CL); endif

TDTree(RootNode n, partition D, split selection method CL) Apply CL to D to find the splitting criterion for n; Let k be the number of children of n; if(k > 0) then Use best split to partition D into D1, …, Dk; Create k children n1 , …, nk of n; for all nk do TDTree(nk, Dk, CL); endif

Figure 8. Top-Down Tree Induction Schema (Output: decision tree for D rooted at n) a) Binary Decision Trees b) k-way Decision Trees

5.2 Sequential Decision-Tree Classifiers Almost all decision tree construction algorithms (including C4.5 [49], CDP [1], CART [7], CHAID [36], CLOUDS [5], FACT [35], ID3 and extensions [46, 47, 48, 11, 17], SLIQ and Sprint [38, 37, 52] and QUEST [34]) access the data using the Hunt’s schema. We will describe some ID3, C4/C4.5/C5 The C4.5 [49] algorithm is an improvement of its older ancestors ID3 or C4 [49]. The decision tree is grown using depth-first strategy. The algorithm considers all the possible tests that can split the data set and selects a test that gives the best information gain. For each discrete attribute, one test with outcomes as many as the number of distinct values of the attribute is considered. For each continuous attribute, binary tests involving every distinct value of the attribute are considered. In order to gather the entropy gain of all these binary tests efficiently, the training data set belonging to the node in consideration is sorted for the values of the continuous attribute and the entropy gains of the binary cut based on each distinct values are calculated in one scan of the sorted data. This process is repeated for each continuous attribute. SLIQ, SPRINT Recently proposed classification algorithms SLIQ [38] and SPRINT [52] avoid costly sorting at each node by pre-sorting continuous attributes once in the beginning. SPRINT builds the tree breadth-first and uses a one-time pre-sorting technique to reduce the cost of continuous attribute evaluation. For each attribute, it initially creates a disk-based attribute list consisting of an attribute value, a class label, and a tuple identifier or tid. Initial lists for continuous attributes are sorted by attribute value when first created. The lists for categorical attributes remain in unsorted order. As the tree is split into subtrees, the attribute lists are also split. By preserving the order of records in the partitioned lists, they do not require resorting. SPRINT uses the gini index [7] as a measure of split quality. The gini index is computed by scanning a node's attribute lists, and computing the class distributions on both sides of the split point. The attribute with the minimum gini value and the associated winning split point, is used to partition the data. Avoiding multiple attribute lists for nodes is an optimisation that speeds-up the execution time. In [37] it is shown that SLIQ/SPRINT achieve similar or better classification accuracy and produce smaller trees when compared to other classifiers like CART and C4. RainForest Framework (RF-Write, RF-Read, RF-Hybrid, RF-Vertical) In [23] Gehrke et al. presents a unifying framework for decision tree classifiers that separates the scalability aspects of algorithms for constructing a decision tree from the central features that determine the quality of the tree. This generic algorithm is easy to instantiate with specific algorithms from the literature (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, SPRINT and QUEST). In addition to its generality, in that it yields scalable versions of a wide range of classification algorithms, the approach also offers performance improvements of over a factor of five over the Sprint algorithm. The authors propose four new algorithms (RF-Write, RF-Read, RF-Hybrid, RF-Vertical) that exploit the benefits of the

AVC(Attribute-Value-Classlabel)-set. Rastogi and Shim developed PUBLIC, a MDL-based pruning algorithm for binary trees that is interleaved with the tree growth phase [50]. BOAT BOAT (Bootstrapped Optimistic Algorithm for Tree Construction) proposed by Gehrke et al. [22] constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. Any difference with respect to the ``real'' tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires an additional scan over subsets of the data. BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely re-building the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.

5.3 Parallel Decision-Tree Classifiers Several parallel formulations of classification rule learning have been proposed recently. They fall either in one of the two general parallel approaches (distributed-, respectively shared-memory systems) or try to offer more general solutions for both of them. Kufrin [32] proposed a framework for induction of decision trees suitable for implementation on shared- and distributed-memory multiprocessors or networks of workstations. The approach, called Parallel Decision Trees (PDT), is similar to the DP-rec scheme [10] and synchronous tree construction approach discussed in this paper, as the data sets are partitioned among processors. The PDT approach designate one processor as the ``host'' processor and the remaining processors as ``worker'' processors. The host processor does not have any data sets, but only receives frequency statistics or gain calculations from the worker processors. The host processor determines the split based on the collected statistics and notify the split decision to the worker processors. The worker processors collect the statistics of local data following the instruction from the host processor. The PDT approach suffers from the high communication overhead, just like DP-rec scheme and synchronous tree construction approach. The PDT approach has an additional communication bottleneck, as every worker processor sends the collected statistics to the host processor at the roughly same time and the host processor sends out the split decision to all working processors at the same time. 5.3.1 Distributed Memory Machines As for the association rule discovery, the main design issues in distributed memory systems concern the minimization of communication, and an even distribution of data for good load balancing. We look at several distributed memory decision tree construction algorithms below. Parallel SPRINT In [52], Shafer et al. proposed the parallel classifier SPRINT. The algorithm assumes a shared-nothing parallel environment where each of N processors has private memory and disks. The processors are connected by a communication network and can communicate only by passing messages. Examples of such parallel machines include GAMMA, Teradata, and IBM’s SP2. SPRINT achieves uniform data placement and workload balancing by distributing the attribute lists evenly over N processors of a shared-nothing machine. This allows each processor to work on only 1/N of the total data. The partitioning is achieved by first distributing the training-set examples equally among all the processors. Each processor then generates its own attribute list partitions in parallel by projecting out each attributes from training-set examples it was assigned. List for categorical attributes are therefore evenly partitioned and require no further processing. However, continuous attribute lists must now be sorted and repartitioned into contiguous sorted sections. For this a parallel sorting algorithm used. The result of this sorting operation is that each processor gets a fairly

equal-sized sorted section of each attribute list. Finding split points in parallel SPRINT is very similar to the serial algorithm. Processors scan the attribute list either evaluating split points for continuous attributes or collecting distribution counts for categorical attributes. In parallel algorithm, no extra work or communication is required while each processor is scanning its attribute-list partitions. We get full advantage of having N processors simultaneously and independently processing 1/N of the total data. Having determined the winning split points, each processor splits its own attribute-list partitions. The parallel implementation of SPRINT [52] and ScalParC [30] use methods for partitioning work that is identical to the one used in the synchronous tree construction approach discussed in this paper. Serial SPRINT [52] sorts the continuous attributes only once in the beginning and keeps a separate attribute list with record identifiers. The splitting phase of a decision tree node maintains this sorted order without requiring to sort the records again. In order to split the attribute lists according to the splitting decision, SPRINT creates a hash table that records a mapping between a record identifier and the node to which it goes to based on the splitting decision. In the parallel implementation of SPRINT, the attribute lists are split evenly among processors and the split point for a node in the decision tree is found in parallel. However, in order to split the attribute lists, the full size hash table is required on all the processors. In order to construct the hash table, all-to-all broadcast is performed, that makes this algorithm unscalable with respect to runtime and memory requirements. The reason is that each processor requires O(N) memory to store the hash table and O(N) communication overhead for all-to-all broadcast, where N is the number of records in the data set. ScalParC (Scalable Parallel Classifier) The recently proposed ScalParC [30] improves upon the SPRINT by employing a distributed hash table to efficiently implement the splitting phase of the SPRINT. In ScalParC, the hash table is split among the processors, and an efficient personalized communication is used to update the hash table, making it scalable with respect to memory and runtime requirements. The experimental results of classifying up to 6.4 million records on 128 processors of Cray T3D in just over a minute demonstrate the scalable behavior of ScalParC. Synchronus, Partitioned and Hybrid parallel tree construction approaches Srivastava et al. [55] proposed two basic parallel formulations for the classification decision tree construction and a hybrid scheme. The hybrid scheme combines good features of both of these approaches. It is assumed that N training cases are randomly distributed to P processors initially such that each processor has N/P cases. In Synchronous Tree Construction approach, all processors construct a decision tree synchronously by sending and receiving class distribution information of local data. Major steps for the approach are shown below: In the Partitioned Tree Construction approach, each leaf node n of the frontier of the decision tree is handled by a distinct subset of processors P(n). Once the node n is expanded into child nodes n1, n2, …, nk the processor group P(n) is also partitioned into k parts, P1, P2, …, Pk, such that Pi handle node ni. All the data items are shuffled such that the processors in group Pi have data items that belong to the leaf ni only. e whole decision tree is constructed by combining subtrees of each processor. The advantage of this approach is that once a processor becomes solely responsible for a node, it can develop a subtree of the decision tree independently without any communication overhead. There are a number of disadvantages of this approach. First disadvantage is that it requires data movement after each node expansion until one process becomes responsible for an entire subtree. The communication cost is particularly expensive in the expansion of the upper part of the decision tree. Second disadvantage is due to load balancing. The Synchronous Tree Construction Approach incurs high communication overhead as the frontier gets larger. The Partition Tree Construction Approach incurs cost of load balancing after each step. The Hybrid Scheme keeps continuing with the first approach as long as the communication cost incurred by the first formulation is not too high. Once this cost become high, the processor as well as the current frontier of the classification tree are partitioned into two parts.

5.3.2 Shared Memory Machines The main design issues in shared memory systems concern the minimization/elimination of false sharing, and the maintenance of good data locality. Pearson presented an approach that combines node-based decomposition and attribute-based decomposition [44]. It is shown that the node-based decomposition (task parallelism) alone has several problems. One problem is that only a few processors are utilized in the beginning due to the small number of expanded tree nodes. Another problem is that many processors become idle in the later stage due to the load imbalance. The attribute-based decomposition is used to remedy the first problem. When the number of expanded nodes is smaller than the available number of processors, multiple processors are assigned to a node and attributes are distributed among these processors. ID3/C4.5 based Narlikar [41] exploits two levels of divide-and-conquer parallelism in the tree builder: at the outer level across the tree nodes, and at the inner level within each tree node. He parallelized the original C4.5 algorithm using a lightweight, user-level implementation of Posix standard threads, the Lightweight Pthreads. They are used to express this highly irregular and dynamic parallelism in a natural manner. The task of scheduling the threads and balancing the load is left to a space-efficient Pthreads scheduler. Experimental results on large datasets, using an 8-processor Sun Enterprise 5000 with 2GB of main memory, indicate that the space and time performance of the tree builder scales well with both the data size and number of processors. In [10], a few general approaches for parallelizing C4.5 are discussed. In the Dynamic Task Distribution (DTD) scheme, a master processor allocates a subtree of the decision tree to an idle slave processor. This scheme does not require communication among processors, but suffers from the load imbalance. The DP-rec scheme distributes the data set evenly and builds decision tree one node at a time. This scheme suffers from the high communication overhead. The DP-att scheme distributes the attributes. This scheme has the advantages of being both load-balanced and requiring minimal communications. However, this scheme does not scale well with increasing number of processors. The results show that the effectiveness of different parallelization schemes varies significantly with data sets being used. SPRINT based Starting from parallel SPRINT, Zaki et al. [62] proposed two approaches to building a tree classifier in parallel: a data parallel approach and a task parallel approach. In the data parallelism approach three algorithms are proposed: the BASIC algorithm, the Fixed-Window-K (FWK) algorithm and the Moving-Window-K (MWK) algorithm which is the more sophisticated of the three. The P processors work in distinct portions of the datasets and synchronously construct the global decision tree. These approaches exploits the intra-node parallelism, i.e. that the available within a decision tree node. Attribute data parallelism, where the attributes are divided equally among the different processors so that each processor is responsible for processing roughly 1/P attributes. Record data parallelism, where each processor is responsible for processing for roughly 1/P fraction of each attribute list. Record parallelism is not well suited to SMP system since it is likely to cause excessive synchronization and replication of data structures. The task parallel algorithm (SUBTREE) exploits the inter-node parallelism; different portions of the decision tree are built in parallel among the processors. There is a master processor responsible for partitioning the processor set, whilst each processor executes the BASIC algorithm on its subtree. Experiments performed on a 4 processor SMP machine, with a PowerPC-604 at 112MHz processor shown that MWK's performance is mostly comparable or better than SUBTREE. Concatenated Parallelism Goil et al. [25] proposed the Concatenated Parallelism strategy for efficient parallel solution of divide and conquer problems. In this strategy, the mix of data parallelism and task parallelism is used as a solution to the parallel divide and conquer algorithm. Data parallelism is used until there are enough sub-tasks are generated, and then task parallelism is used, i.e., each processor works on independent subtasks. This

strategy is similar in principle to the partitioned tree construction approach discussed in this paper. The Concatenated Parallelism strategy is useful for problems where the workload can be determined based on the size of subtasks when the task parallelism is employed. However, in the problem of classification decision tree, the workload cannot be determined based on the size of data at a particular node of the tree. Hence, one time load balancing used in this strategy is not well suited for this particular divide and conquer problem. They implemented the algorithm on the CM-5 and promising experimental results were obtained.

6. Open Problems in Parallel and Distributed Data Mining Systems Having reviewed the extant methods we can conclude that despite recent advances in high performance DM algorithms, there are a number of open problems that need serious and immediate attention. These include: a) High Dimensionality Current DM methods are only able to hand a few thousand dimensions or items. In case of ARD the problem is that the second iteration which counts the frequency of all 2-itemsets essentially has quadratic complexity since one has to consider all pairs of items, and no pruning is possible at this stage. Some possible solutions include methods that only enumerate maximal patterns, those that use a hash-based pruning to reduce the candidate itemsets, or those that use global pruning. b) Large Size Databases continue to increase in size. Current methods are able to handle data in the tens of gigabytes range. It seems that current DM algorithms will not be suitable for the terabyte range. Even a single scan for these databases is considered expensive. It is an open problem to mine all frequent itemsets in a single pass. c) Data Location Today’s large-scale data sets are usually logically and physically distributed, and organizations that are geographically distributed need a decentralized approach to DM. The issues concerning modern organizations are not just the size of the data to be mined, but also its distributed nature. Most current work has only dealt with the horizontal partitioning approach. d) Data Skew One of the problems adversely affecting load balancing in DM algorithms is sensitivity to data skew. Most methods partition the database horizontally in equal-sized blocks. Randomizing the blocks is one solution, which is still not adequate, given the dynamic and interactive nature of DM. e) Dynamic Load Balancing All extant algorithms use only a static load balancing scheme based on the initial data decomposition, and they assume a homogeneous, dedicated environment. This is far from reality. A typical parallel database server has multiple users, and has transient loads. This calls for an investigation of dynamic load balancing schemes for DM. Dynamic load balancing is also crucial in a heterogeneous environment, which can be composed of meta- and super-clusters, with machines ranging from ordinary workstations to supercomputers. f) Multi-table DM, Data Layout and Indexing Schemes Almost no work has been done on mining over multiple tables or over distributed databases which have different schemas. Traditional mining over these multiple tables would first require us to create a large single table that is the join of all the tables. The joined table also has tremendous amounts of redundancy. Thus we need better methods for processing such multiple tables, without having to materialize a single large view. Also, little work has been done on the optimal or near-optimal data layout or indexing schemes for fast processing of parallel DM algorithms. g) Incremental Methods Everyday new data is being collected, and existing data stores are being updated with new data or purged of the old one. To-date there have been no parallel or distributed algorithms that are incremental in nature, which can handle updates and deletions without having to recompute associations over the entire database. This applies to both the frequent itemsets and the rules. h) Interactive Rule Management and Visualization A complete DM system is yet to be designed. Such a system would manage the raw tabular data, the frequent itemsets (as well as the association rules), the decision trees. Operations that need to be supported include interactively modifying minimum support and confidence, integrating constraints on the rule antecedent or consequent, updation of both the rules and the frequent itemsets based on the current needs of the user, and visualization of association rules, which has proven to be one of the major difficulties in ARD, since it is very difficult to visualize rules of

i)

length more than two. Parallel and distributed processing will be an intrinsic component of any DM system. Parallel DBMS/File Systems To-date all results reported have hand-partitioned the database, mainly horizontally, on different processors. There has been no study conducted in using a parallel file system for managing the partitioned database, and the accompanying striping, and layout issues. Recently there has been increasing emphasis on tight database integration of DM, but it has been confined to sequential approaches.

7. Conclusions KDD/DM provides the capability to discover new and meaningful information by using existing data. KDD quickly exceeds the human capacity to analyze large data sets. The amount of data that requires processing and analysis in a large database exceeds human capabilities, and the difficulty of accurately transforming raw data into knowledge surpasses the limits of traditional databases. Therefore, the full utilization of stored data depends on the use of knowledge discovery techniques. The usefulness of future applications of KDD is far-reaching. KDD may be used as a means of information retrieval, in the same manner that intelligent agents perform information retrieval on the web. New patterns or trends in data may be discovered using these techniques. DM may also be used as a basis for the intelligent interfaces of tomorrow, by adding a knowledge discovery component to a database engine or by integrating DM with spreadsheets and visualizations. It is anticipated that commercial database systems of the future will include DM capabilities in the form of intelligent database interfaces. Some types of information retrieval may benefit from the use of DM techniques. Due to the potential applicability of knowledge discovery in so many diverse areas there are growing research opportunities in this field. Many of these opportunities are discussed in [45], a newsletter which has regular contributions from many of the best-known authors of DM literature. A fairly comprehensive list of references and applicable websites are also available from the Nugget site. These sites are updated very frequently and have the most current information available. An international conference on KDD and DM is held annually. The annual KDD/DM conference proceedings provide additional sources of current and relevant information on the growing field of Knowledge Discovery in Databases. Although the parallel approaches to DM tasks increases the speed and scalability, the problems presented in last section demonstrate that parallel/distributed data mining system are still in their infancy, and a lot of exciting work remains to be done in system design, implementation, and deployment.

References [1] R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Eng., 5(6):914--925, December 1993. [2] Agrawal R. and Srikant R. Fast Algorithms for Mining Association Rules. In Proc. of the 20 th Conference in VLDB in 1994 [3] R. Agrawal, H.Mannila, R. Srikant, H. Toivonen and A. Inkeri Verkamo. Fast discovery of association rules. In U. Fayyad and et al, editors, Advanecs in Knowledge Discovery and Data Mining, pag. 307328. AAAI Press, Menlo Park, CA, 1996 [4] R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Trans. on Knowledge and Data Engg., 8(6):962–969, December 1996. [5] K. Alsabti, S. Ranka, and V. Singh. CLOUDS: Classification for large or out-of-core datasets. http://www.cise.ufl.edu/¸ranka/dm.html, 1998. [6] Brachman, R.J., and Anand, T. The Process Of Knowledge Discovery In Databases: A Human-Centered Approach. In Advances In Knowledge Discovery And Data Mining , eds. U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, AAAI Press/The MIT Press, Menlo Park, CA., 1996, pp. 37-57. [7] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984.

[8] S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In ACM SIGMOD Conf. Management of Data, May 1997. [9] Buntine, W. "A Guide To The Literature On Learning Probabilistic Networks From Data." IEEE Transactions on Knowledge and Data Engineering 8, 2 (Apr. 1996), 195-210. [10] J. Chattratichat, J. Darlington, M. Ghanem, Y. Guo, H. Huning, M. Kohler, J. Sutiwaraphun, H.W. To, and D. Yang. Large scale data mining: Challenges and responses. In Proc. of the Third Int'l Conference on Knowledge Discovery and Data Mining, 1997. [11] J. Cheng, U.M. Fayyad, K.B. Irani, and Z. Qian. Improved decision trees: A generalized version of ID3. In Proc. of Machine Learning, 1988. [12] D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distributed algorithm for mining association rules. In 4th Intl. Conf. Parallel and Distributed Info. Systems, December 1996. [13] D. Cheung and Y. Xiao. Effect of data skewness in parallel mining of association rules. In 12nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, April 1998. [14] D. Cheung, K. Hu, and S. Xia. Asynchronous parallel algorithm for mining association rules on shared-memory multi-processors. In 10th ACM Symp. Parallel Algorithms and Architectures, June 1998. [15] CRISP-DM (Cross-Industry Standard Process Model for Data Mining) – discussion paper March 1999 [16] Doherty P. Mining the WEB Structure, Web Mining: The E-Tailer’s Holy Grail. In DM Direct in January 2000 [17] Fayyad U.M. On the induction of decision trees for multiple concept learning. PhD thesis, EECS Department, The University of Michigan, 1991. [18] Fayyad U.M. “Data Mining and Knowledge Discovery: Making Sense Out of Data”, IEEE Expert, vol. 11(5): October 1996, pp. 20-25 [19] Fayyad, U.M., Piatetsky-Shapiro, G., and Smyth, P. From Data Mining To Knowledge Discovery: An Overview. In Advances In Knowledge Discovery And Data Mining , eds. U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, AAAI Press/The MIT Press, Menlo Park, CA., 1996, pp. 1-34. [20] Frawley, W.J., Piatetsky-Shapiro, G., and Matheus, C. Knowledge Discovery In Databases: An Overview. In Knowledge Discovery In Databases, eds. G. Piatetsky-Shapiro, and W. J. Frawley, AAAI Press/MIT Press, Cambridge, MA., 1991, pp. 1-30. [21] J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest - A framework for fast decision tree construction of large datasets. VLDB 1996. [22] Johannes Gehrke, Raghu Ramakrishnan, Venkatesh Ganti,Wei-Yih Loh – BOAT: Optimistic Decision Tree Construction, ACM SIGMOD 99 [23] Johannes Gehrke, Raghu Ramakrishnan, Venkatesh Ganti – “RainForest: A Framework for Fast Decision Tree Construction of Large Datasets”, Proceedings of the 24 th VLDB Conference, New York 1998 [24] Goebel M, Le Gruenwald. “A survey of Data Mining And Knowledge Discovery Software Tools”. In SIGKDD Explorations, June 1999 [25] S. Goil, S. Aluru, and S. Ranka. Concatenated parallelism: A technique for efficient parallel [26] divide and conquer. In Proc. of the Symposium of Parallel and Distributed Computing (SPDP'96), 1996. [27] Guyon, I., Matic, N., and Vapnik, V. Discovering Informative Patterns And Data Cleaning. In Advances In Knowledge Discovery And Data Mining, eds. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI Press/The MIT Press, Menlo Park, CA., 1996, pp. 181-203. [28] E-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD Conf. Management of Data, May 1997. [29] Jiawei Han, Jian Pei, Yiwen Yin. Mining Frequent Patterns without Candidate Generation, In ACM SIGMOD Conf. Management of Data, 2000 [30] M.V. Joshi, G. Karypis, and V. Kumar. ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. In Proc. of the International Parallel Processing Symposium, 1998.

[31] Hsu, C.N., and Knoblock, C.A. Using Inductive Learning To Generate Rules For Semantic Query Optimization. In Advances In Knowledge Discovery And Data Mining, eds. U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, AAAI Press/The MIT Press, Menlo Park, CA., 1996, pp. 425445. [32] R. Kufrin. Decision trees on parallel processors. In J. Geller, H. Kitano, and C.B. Suttner, editors, Parallel Processing for Artificial Intelligence 3. Elsevier Science, 1997. [33] Tjen-Sien Lim, Wei-Yin Loh, and Yu-Shan Shih. An empirical comparison of decision trees and other classification methods. Technical Report 979, Department of Statistics, University of Wisconsin, Madison, June 1997. [34] W.-Y. Loh and Y.-S. Shih. Split selection methods for classification trees. Statistica Sinica, 7(4), October 1997. [35] W.-Y. Loh and N. Vanichsetakul. Tree-structured classification via generalized disriminant analysis (with discussion). Journal of the American Statistical Association, 83:715--728, 1988. [36] J. Magidson. The CHAID approach to segmentation modeling. In Handbook of Marketing Research, 1993. [37] M. Mehta, J. Rissanen, and R. Agrawal. MDL-based decision tree pruning. In Proc. of KDD, 1995. [38] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996. [39] A. Mueller. Fast sequential and parallel algorithms for association rule mining: A comparison. Technical Report CS-TR-3515, University of Maryland, College Park, August 1995. [40] S. K. Murthy. On growing better decision trees from data. PhD thesis, Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, 1995. [41] Girija J. Narlikar. A Parallel, Multithreaded Decision Tree Builder, Technical report CMU-CS-98194, School of Computer Science, Carnegie Mellon University, December 1998 [42] J. S. Park, M. Chen, and P. S. Yu. An effective hash based algorithm for mining association rules. In ACM SIGMOD Intl. Conf. Management of Data, May 1995. [43] J. S. Park, M. Chen, and P. S. Yu. Efficient parallel data mining for association rules. In ACM Intl. Conf.Information and Knowledge Management, November 1995. [44] R.A. Pearson. A coarse grained parallel induction heuristic. In H. Kitano, V. Kumar, and C.B. Suttner, editors, Parallel Processing for Artificial Intelligence 2, pages 207--226. Elsevier Science, 1994. [45] Piatetsky-Shapiro, G., and Beddows, M. Knowledge Discovery Mine -- Data Mining And Knowledge Discovery Resources. World Wide Web URL:http://www.kdnuggets.com. [46] J.R. Quinlan. Discovering rules by induction from large collections of examples. In Expert Systems in the Micro Electronic Age, 1979. [47] J.R Quinlan. Learning efficient classification procedures. In Machine Learning: An Artificial Intelligence Approach, 1983. [48] J.R. Quinlan. Induction of decision trees. Machine Learning, 1:81--106, 1986. [49] Quinlan, J.R. C4.5: Programs For Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993. [50] R.Rastogi and K.Shim. PUBLIC: A Decision Tree Classifier that Integrates Pruning and Building. In Proc. of VLDB, 1998. [51] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In 21st VLDB Conf., 1995. [52] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In 22nd Int'l Conference on Very Large Databases, Bombay, India, Sept 1996. [53] T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for mining association rules. In 4th Intl. Conf. Parallel and Distributed Info. Systems, December 1996. [54] T. Shintani, M. Kitsuregawa, Parallel Mining Algorithms for Generalized Association Rules with Classfication Hierarchy, SIGMOD 98, Seattle, WA, 1998. [55] A. Srivastava, E. Han, and V. Singh, Parallel Formulation of Decision-‘Tree Classifier Algorithm, In Data Mining and Knowledge Discovery, Volume 3, Number 3, September 1999.

[56] Suh-Ying Wur and Yungho Leu. An Effective Boolean Algorithm for Mining Association Rules in Large Databases. In Proceedings of the 6th International Conference on Database Systems for Advanced Applications, 1998 [57] M. J. Zaki, M. Ogihara, S. Parthasarathy, andW. Li. Parallel data mining for association rules on shared-memory multi-processors. In Supercomputing’96, November 1996. [58] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997. [59] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithms for fast discovery of association rules. Data Mining and Knowledge Discovery: An International Journal, 1(4):343-373, December 1997. [60] Mohamed J. Zaki and Mitsunori Ogihara. Theoretical Foundations of Association Rules, Technical Report of Computer Science Department, Univeristy of Rochester [61] M. J. Zaki – Parallel and Distributed Association Mining: A Survey, IEEE Concurency, OctoberDecember 1999, pp. 14-25 [62] M. Zaki, C. Ho, and R. Agrawal, Parallel Classifier for Data Mining in Shared-Memory Multiprocessors.