Tuple Graph Synopses for Relational Data Sets

Tuple Graph Synopses for Relational Data Sets Joshua Spiegel Neoklis Polyzotis Univ. of California Santa Cruz Univ. of California Santa Cruz jspie...
Author: Jean Fletcher
1 downloads 0 Views 381KB Size
Tuple Graph Synopses for Relational Data Sets Joshua Spiegel

Neoklis Polyzotis

Univ. of California Santa Cruz

Univ. of California Santa Cruz

[email protected]

[email protected]

ABSTRACT This paper introduces the Tuple Graph (TuG) synopses, a new class of data summaries that enable accurate selectivity estimates for complex relational queries. The proposed summarization framework adopts a “semi-structured” view of the relational database, modeling a relational data set as a graph of tuples and join queries as graph traversals respectively. The key idea is to approximate the structure of the induced data graph in a concise synopsis, and to estimate the selectivity of a query by performing the corresponding traversal over the summarized graph. We detail the TuG synopsis model that is based on this novel approach, and we describe an efficient and scalable construction algorithm for building accurate TuGs within a specific storage budget. We validate the performance of TuGs with an extensive experimental study on real-life and synthetic data sets. Our results verify the effectiveness of TuGs in generating accurate selectivity estimates for complex join queries, and demonstrate their benefits over existing summarization techniques.

1. INTRODUCTION Consider a relational query optimizer, faced with the challenging task of optimizing a complex join query. In order to approximate effectively the cost factors of a candidate plan, the optimizer needs to obtain accurate estimates on the sizes of results that are generated at different operators, or equivalently, accurate selectivity estimates for the corresponding query expressions. These estimates are typically provided by data synopses (commonly referred to as “data statistics”) that approximate the underlying data distribution and can thus estimate the number of results generated by a query. The accuracy of these data synopses is therefore crucial for the effectiveness of the optimization process. The accuracy of a data synopsis depends heavily on its ∗ A modified version of this paper has appeared in Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD 2006, June 27–29, 2006, Chicago, Illinois, USA. Copyright 2006 ACM 1-59593-256-9/06/0006 ...$5.00.



capacity to capture in limited space the main correlations that exist among the distributions of joins and values in the data. Consider, for instance, a simple database with information on movies that consists of three tables, namely, Movies, Actors, and Cast. Cast has foreign keys to the other two tables, and essentially records in which movie each actor appeared along with their wages. In this simple database, an example correlation across joins might be that movies in the last decade have typically higher wages for their actors. In other words, there is a dependency between the distribution of different values through the chain of joins. Another example, may be that movies released in the 90s tend to have more actors than movies made in the 50s. Here, the number of join results is affected by a selection on the value of one of the participating tables. Table-level synopses, such as, histograms [15] or wavelets [13], are typically ineffective in capturing these complex joinbased correlations as they focus on the summarization of a single table at a time. This has led to the introduction of schema-level synopses, e.g., Join Synopses [2] and Probabilistic Relational Models [11], that enable accurate estimates by summarizing the combined join and value distribution across several tables. The proposed techniques, however, are not applicable to the class of relational schemata that contain many-to-many relationships. Such a schema is our toy movie database, where a single movie is associated with multiple actors and vice versa. (Observe that this many-to-many relationship is encoded with several one-tomany key/foreign-key joins.) This type of join relationship is very common in real-world applications, and hence providing effective summarization techniques for such schemata is an important and practical problem; at the same time, however, the presence of many-to-many joins greatly complicates the potential statistical correlations, introducing significant new challenges to the summarization problem. Motivated by these observations, we initiate the study of schema-level data synopses for data sets with arbitrary join relationships. More concretely, we introduce the class of Tuple Graph synopses (TuGs for short) that rely on graphbased models in order to summarize the combined join and value information of a relational database with such complex joins. The inspiration for our proposed solution comes from an unlikely source: the field of XML summarization. Conceptually, we adopt a “semi-structured” view of a relational database, where tuples and joins become nodes and edges respectively in a data graph, so that join queries now correspond to graph traversals. With this view in mind, we propose to summarize the structure of this data graph

in order to derive accurate selectivity estimates for query selectivities. To address the challenges of relational summarization, however, our work relies on novel techniques that form a clear departure from previously proposed XML models. As we show in this paper, our TuG synopses enable accurate selectivity estimates for a large class of practical join queries: queries with arbitrary join graphs (e.g., containing cycles and many-to-many relationships) and several selection predicates on different tables. To the best of our knowledge, the TuG model is the first schema-level technique to enable the combined summarization of joins and values for schemata of arbitrary join relationships. More concretely, the contributions of our work can be summarized as follows: • TuG Synopsis Model. We introduce the TuG synopses for summarizing the combined join- and value-distribution of a relational database. Our novel TuG model promotes joins and values to an equal status and thus allows a uniform treatment of their potential correlations in the underlying data distributions. We demonstrate several key properties of the TuG model and develop a systematic estimation framework for approximating the selectivity of queries with arbitrary join graphs. • TuG Construction. We introduce an efficient construction algorithm, termed TuGBuild, for building accurate TuGs given a specific storage budget. Our algorithm takes advantage of the unique properties of our TuG model, thus achieving the effective compression of statistical information while preserving the accuracy of estimation. A key feature is the use of disk-based processing techniques that enable TuGBuild to scale to large data sets under limited memory constraints. These ideas are of general interest, as they can be transferred to the XML domain enabling the scalable summarization of large XML data sets. • Experimental Evaluation of TuGs. We conduct an extensive empirical study to validate the effectiveness of our proposed TuG model. Our results on synthetic and reallife data sets verify the effectiveness of TuG summaries as schema-level data synopses. Moreover, our study demonstrates the scalability of our approach and its many advantages over previously proposed summarization techniques. The remainder of the paper is organized as follows. Section 2 provides a formal definition of the summarization problem and introduces some necessary notation. Section 3 presents in detail the proposed TuG model and its properties, while Sections 4 and 5 cover the TuG construction process. Section 6 presents the results of our experimental study for evaluating the effectiveness of TuG summaries. We cover related work in Section 7 and conclude with Section 8.

2. PRELIMINARIES Data Model. Our work focuses on the summarization of a relational database D with a set of relation names R = {R1 , . . . , Rn } and a global set of attributes A = {A1 , . . . , Am }. (The attributes of different relations form subsets of A.) Without loss of generality, we assume that each attribute has a unique value domain and overload Aj to denote the underlying domain as well. To capture the join relationships between tables, we define

the schema graph GS (R ∪ A, ES ) of the database as follows: (a) each node in GS represents a relation Ri or an attribute Aj , (b) an edge (Ri , Rj ), i 6= j, denotes the join between the two relations on a subset of their attributes, and (c) an edge (Ri , Aj ) denotes that the schema of Ri contains attribute Aj . We focus on equi-joins since they are prevalent in real-world applications, and we do not place other restrictions on the number of joins per relation or the type of each join, e.g., one-to-one, one-to-many, or many-to-many. Our assumption is that GS is part of the meta-data of D and can be either specified by the DBA or derived automatically, e.g., by interpreting the constraints of the schema and analyzing a sample workload. Similar to the schema graph, we define the data graph GD (TD ∪ VD , ED ) as the representation of the database instance. A node τ ∈ TD represents a tuple of some relation Ri , while a node ν ∈ VD represents a value of some attribute Aj . Given a tuple τ 0 of another relation Ri0 , the data graph includes the edge (τ, τ 0 ) if the edge (Ri , Ri0 ) is in the schema graph and the two tuples have the same values for the corresponding join attributes. Similarly, the data graph includes the edge (τ, ν) if (Ri , Aj ) is present in the schema graph and tuple τ has the specified value for attribute Aj . Example 2.1.: Figure 1(a) shows the schema graph for a data set that records information on movies. In this example, the Movie relation joins with two other relations (Directedby and Cast), and has two value attributes (year and genre). (The arrows in the schema graph point towards the relation containing the primary key.) Figure 1(b) shows a sample data graph for the same schema, where each tuple node is named by a matching lower case letter and a numerical identifier. (The importance of the shapes of nodes will become apparent later.) Overall, the data graph GD is a semi-structured interpretation of the flat relational data set based on the properties of the schema graph. As we discuss next, this formulation enables the interpretation of join queries as traversals over the data graph, and essentially forms the cornerstone of our proposed summarization framework. It should be stressed that the data graph is a purely conceptual tool that we use in the development of our technique, i.e., we do not assume that it is materialized in some form. Query Model. We focus on a general class of queries with arbitrary join graphs and table-level selection predicates. Similar to the schema and the data graph, we represent a query Q as a graph GQ (TQ , EQ ). Each node qi ∈ TQ (1 ≤ i ≤ |TQ |) represents a query variable that is bound to a specific relation table(qi ), while an edge (qi , qj ) ∈ EQ denotes a join predicate between the corresponding relations. We assume that join predicates are placed according to the schema graph GS and hence each query edge (qi , qj ) must correspond to a schema edge (table(qi ), table(qj )). Moreover, each query variable qi may be annotated with a conjunction of selections on the attributes of table(qi ). Figure 3 (a) shows an example query graph over the sample movie database, corresponding to the relational algebra expression σyear=2005 (M ) 1 C 1 σsex=0 f emale0 (A). We note that our query model does not place any restrictions on the shape of the join graph and can thus capture a large class of queries that are common in practice (including queries with cyclic join predicates). Based on the semi-structured interpretation of a relational

2005 2004

female

2000

male

Action Drama

(a)

(b)

(c)

Figure 1: (a) Schema Graph (b) Instance (c) Data Graph

data set, we can define the evaluation of a query Q as a complex traversal over the data graph GD . More precisely, we define an embedding of Q as a mapping from query variables to tuple nodes of GD , where each variable qi is mapped to a tuple of table(qi ) that satisfies the selection predicates, and each edge (qi , qj ) is mapped to a join edge between the corresponding tuples. It is straightforward to show that each embedding corresponds to a result-tuple of the query, and hence the set of unique embeddings determines the answer of Q. Problem Definition. We define the selectivity sel(Q) of a query Q as the cardinality of its result set. The problem that we tackle in this paper can thus be summarized as follows: Given a database D and a space budget B, construct a data synopsis of size at most B that can estimate the selectivity sel(Q) of any complex query Q. As we have mentioned earlier, this problem is key in the optimization of relational queries and is therefore of high practical value. Clearly, the needed selectivity sel(Q) can be expressed as the number of the embeddings of Q, which in turn depends on the structural characteristics of the data graph GD . This observation forms the basis of our proposed TuG solution. In short, the key idea is to approximate the structure of GD in a concise synopsis, and to estimate the number of embeddings by matching Q over the summarized graph. We discuss the specifics of this approach in the following section, where we describe in detail the TuG summarization model.

3. TuG SYNOPSES In this section, we introduce the proposed TuG synopses. We first present a formal definition of the TuG model, and then develop a systematic framework for approximating the selectivity of complex join queries over concise TuG synopses.

3.1 Model Definition At a high level, our proposed TuG framework employs an abstract partitioning model that groups the nodes of the data graph GD (TD ∪ VD , ED ) in disjoint partitions. The key idea is to store aggregate statistical information on a per partition basis and thus generate a concise summary that approximates the structure of GD (equivalently, the join and value distribution of the underlying database). More precisely, let P be a partitioning of TD ∪ VD . We say that P is a valid partitioning if the following property holds for every partition r ∈ P: either r contains tuples from the same relation Ri (denoted as table(r)), or it contains a single value ν ∈ VD of an attribute Aj (in this case, we assume that table(r) = Aj ). We refer to r as a tuple partition or a value partition respectively. To simplify our notation, we

will use r to refer both to the partition and the set of data graph nodes that it contains. (Hence, |r| is the size of the partition.) Consider two tuple partitions r and s. We say that r and s are linked if there are tuples τr ∈ r and τs ∈ s such that the edge (τr , τs ) appears in the data graph, i.e., τr joins with τs . Similarly, a tuple partition r is linked to a value partition s if there is a tuple τr ∈ r such that the edge (τr , ν) exists in the data graph, i.e., τr carries the value ν in attribute Aj ≡ table(s). In general, we will use data-edges(r, s) to denote the set of edges in the data graph that exist between two linked partitions. A TuG synopsis T G is a graph-based representation of a specific partitioning P, augmented with aggregate statistical information on linked partitions. More formally, a TuG is defined as follows: Definition 3.1. A TuG synopsis T G of the data graph GD (TD ∪ VD , ED ) is a graph GT G (P, ES ) such that (1) P is a valid partitioning of TD ∪ VD , (2) (r, s) appears in ES if r and s are linked, (3) each node r ∈ P records the common relation name table(r) and a counter tcount(r) = |r|, (4) each edge (r, s) ∈ ES records a counter jcount(r, s) = |data-edges(r, s)|. Essentially, a TuG encodes a partitioning of the data graph nodes along with aggregate information on their join- and value-based characteristics. For each partition, the synopsis records the common relation name and the size of the extent (the contents of the extent are not stored), while for each synopsis edge it records the number of data edges between the endpoints. Given the unified treatment of tuples and values in our model, an edge (r, s) in T G may record the join between two tuple-sets r and s, or the appearance of a value s in tuple-set r. Accordingly, jcount(r, s) may be the result size of r 1 s, or the number of tuples in r with value s. This uniform handling of joins and values is an important characteristic of our framework, as it facilitates the identification of correlations between and across the join and value distribution of the underlying data. Example 3.1.: Figure 2 shows an example TuG for the data set of Figure 1, where the correspondence between partitions and tuples is denoted by the enclosing shapes. In this example, node mα represents two Movie tuples, while node cα represents three Cast tuples. Their join is denoted by the linking edge and it includes jcount(mα , cα ) = 3 results. Moreover, mα is linked with the value partitions of the year and genre attributes, and the corresponding join counts denote the frequency of individual values in the tuples of mα .

y2005 1 y2004 1 y2000 2 gAction 11 gDrama

Į

3

Į

3

1

Į

2 1 1 2 ǃ

1 ǃ

sfemale smale

ǃ

y2005 1 y2004 1 y2000 2 gAction 11 gDrama

3

Į

Į

sfemale smale

1

3

3

DŽ

2 2 ǃ

ǃ

(a)

(b)

Figure 2: (a) TuG Summary S, (b) S after merging aα and aβ

We now discuss in more detail the properties of the proposed TuG model. Consider a specific TuG synopsis and a tuple partition r. Let s be a tuple partition linked to r and define jratio(r, s) = jcount(r, s)/tcount(r) as the join ratio from r to s. In this case, each tuple in r is assumed to join with jratio(r, s) tuples in s, independent of links to other partitions. Similarly, if s is a value partition, then jratio(r, s) is interpreted as the empirical probability of a tuple in r having value s in the corresponding attribute (again, independently of links to other partitions). The basic idea therefore is to equate all the tuples in r with a centroid, defined by the ratios jratio(r, s1 ), jratio(r, s2 ), . . . , jratio(r, sk ) to the linked partitions s1 , s2 , . . . , sk . This approach essentially assumes statistical independence across the join and value distributions of tuples in r, coupled with a uniformity assumption on the number of joining neighbors. Typically, such assumptions are unfitting for the purpose of accurate selectivity estimation. The key point, however, is that they become relatively valid if r represents tuples with similar join- and value-based characteristics, or equivalently, when the centroid of r is a good representative for the tuples in r. As we discuss later, this observation forms the main idea behind our algorithm for constructing accurate TuGs. Up to this point, we have assumed that a TuG summary stores exact information on the value distribution of attributes. To see this, observe that each tuple partition r is linked to individual values for a specific attribute Aj , and the corresponding edges carry the exact frequency of each value inside r. Since this approach can quickly become impractical, we replace value partitions and their corresponding edges with single-dimensional value summaries. More concretely, for each tuple partition r and attribute Aj that is related to table(r), the synopsis records a single-dimensional value summary vsum(Aj , r) that approximates the value distribution of Aj in the tuples of r. Returning to the example of Figure 2, node mα will record histograms vsum(year, mα , ) and vsum(genre, mα ) that summarize the distribution of year and genre values respectively in the corresponding Movie tuples. Our framework does not make any assumptions on the summarization techniques that implement these valuesummaries, except that they are appropriate for the type of the corresponding value domain. In this paper, we focus on the summarization of numerical and categorical values and rely on conventional histogramming techniques for their approximation. On a final note, we observe that the use of single-dimensional histograms corresponds again the same independence assumption that we have mentioned earlier. The key idea, however, remains the same: by grouping together tuples of similar statistical characteristics, our TuG construction algorithm promotes the validity of this assumption, thus enabling the accurate summarization of the underlying data.

3.2

Estimation Framework

Our estimation framework approximates the selectivity of a complex query Q by matching the query graph on the graph of the TuG synopsis. We formalize the matching process in terms of a query embedding h, defined as a mapping from query variables to summary nodes that respects the join and selection predicates in Q. Figure 3(b) shows an example with a simple query graph and its embedding on the TuG of Figure 2(b). Overall, a query embedding represents a sub-graph of the summary on which Q has a potentially non-empty result. We can thus derive the selectivity of Q by aggregating the selectivities of its individual embeddings. To compute the selectivity of a single embedding, our framework traverses the selected sub-graph and uses the recorded join ratios to estimate the selectivity of different predicates. The following example illustrates this idea. Example 3.2. : Consider the embedding of Figure 3(b) and in particular node aα . Given that tcount(aα ) = 3 and that jratio(aα , s.female) = 0.33, we can estimate that an average of tcount(aα ) · jratio(aα , s.female) = 1 tuples satisfy the first selection predicate. Each tuple in aα is assumed to join with an average of jratio(aα , cα ) = 3/3 = 1 tuples in cα , and hence the part of the embedding that corresponds to σ(A) 1 C will generate tcount(aα ) · jratio(aα , s.female) · jratio(aα , cα ) = 1 tuples. Similarly, the final estimate for the embedding is computed as sel(h) = tcount(aα )· jratio(aα , s.female)·jratio(aα , cα )· jratio(cα , mα ) · jratio(mα , y.2005).

ı

y2005

ı 1

Į

3

Į

3

Į

1

sfemale

Figure 3: (a) Query Graph Q (b) Embedding of Q on the TuG of Figure 2(a) A natural question is whether the computed selectivity estimate depends on the traversal order of the query embedding. A key feature of the TuG model is its guarantee that any traversal of the query embedding results in the same estimate (with the same accuracy of course). More formally, let h be an embedding and deg(r) be the number of incidental edges (in the embedding) for each summary node r ∈ h. The following proposition establishes the closed-form expression for the selectivity sel(h) that results from any traversal of h:

Proposition 3.1. For any traversal of an embedding h, the computed selectivity estimate can be expressed as follows: Q

v P u (r − s)2 u t r∈S radius(merge(S)) = |S|

jcount(r, s)

(r,s)∈h

sel(h) = Q

nodes in S and the new node s. In this paper, we employ the well known radius metric [17] defined as follows:

tcount(r)deg(r)−1

(1)

r∈h

While we have presented an example of Equation 1 for tree-join graphs, an interesting property is that it extends unmodified to the case of embeddings for arbitrary join graphs (i.e.,queries with cyclic join predicates). In the interest of space, we defer the complete details of this extension to the full version of this paper [16]. Overall, our proposed estimation framework relies on the independence assumptions of the TuG model in order to combine statistical information across different summary edges. The accuracy of a TuG summary is therefore linked to the validity of independence, which is in turn tied to the statistical similarity of tuples within each partition.

4. COMPRESSING TuG SYNOPSES Having described our TuG model, we shift our attention to the equally important problem of constructing accurate synopses. In this section, we introduce a basic operation that compresses the information stored in a TuG summary, and thus forms the basis of the TuGBuild construction algorithm (Section 5). We first describe the operation and its semantics, and then discuss a key result related to the lossless compression of TuG synopses.

4.1 Node-Merge Operation Our basic compression operation, termed merge, reduces the storage of a TuG T G by merging partitions in the underlying partitioning P. More formally, let S be a set of summary nodes in T G that correspond to tuple partitions of the same relation name Ri . The operation merge(S) substitutes the nodes in S with a single new node s that represents the union of the individual tuple partitions. Hence, the statistics of the new node are defined as follows: tcount(s) = P P r∈S jcount(r, t) for r∈S tcount(r), and jcount(s, t) = every neighbor t of the nodes in S. Figure 2(b) illustrates the result of a merge operation on the synopsis of Figure 2(a), where the merged set S involves the two actor partitions. An important issue is the effect of this localized compression on the accuracy of the computed selectivity estimates. Conceptually, merge(S) results in a coarser tuple partitioning, where the tuples of each original node r ∈ S are now represented with the new centroid of s. Hence, the difference between the centroids of r and s provides a good indication of the inaccuracy introduced by the merge for the tuples in r. More formally, let Ri be the relation name of nodes in S and let neighbors(Ri ) be its neighbors (relations and attributes) in the schema graph. We will use Ni to denote the total number of summary nodes that correspond to the neighbors in neighbors(Ri ). We map every node r0 of S ∪ {s} to a Ni -dimensional point, where each dimension corresponds to a unique summary node t of some neighbor in neighbors(Ri ), and the coordinate is equal to jratio(r0 , t). In other words, the multi-dimensional point of r0 indicates to which nodes r0 is linked and the corresponding average ratios. We quantify the inaccuracy of a merge in terms of the similarity in this multi-dimensional space between the

Here, (r − s)2 denotes the Euclidean distance between the points that represent r and s in the multi-dimensional space of join ratios. Essentially, the radius metric measures the “tightness” of a cluster that comprises the points in S and has s as its representative. A low radius, therefore, indicates that the merged nodes in S have similar join relationships and value distributions, which in turn implies that the aggregate new node s is a good approximation for every node r ∈ S. In effect, merge(S) results in a summary that generates similar estimates to the original synopsis.

4.2

Lossless Merge Operations

In this section, we examine lossless merge operations, i.e., operations that do not affect the accuracy of the resulting synopsis. As we discuss in Section 5, our construction algorithm relies on such operations in order to efficiently compress the data graph to a much smaller accurate summary. Intuitively, we expect a merge operation to be lossless if radius(merge(S)) = 0, i.e., nodes in S have the exact same join ratios to every other node in the synopsis. In our work, we prove a somewhat surprising result: merge(S) may be lossless even if the nodes in S do not match on every dimension of the underlying point space! This type of restricted similarity is captured by our novel concept of all-but-one similarity that we define next. Let T ∈ neighbors(Ri ) be a schema neighbor of Ri and let r and r0 be nodes of Ri . We say that r and r0 are similar with respect to T if they have the same join ratios to every summary node that corresponds to T . Intuitively, r and r0 cannot be distinguished based solely on their neighbor nodes of T and are thus characterized by some redundancy. We define all-but-one similarity by extending this property to specific subsets of neighbors(Ri ). More precisely, r and r0 are called all-but-one similar if they are similar with respect to |neighbors(Ri )| − 1 neighbors of Ri . Hence, r and r0 are all-but-one similar either if they have different join relationships to a single relation, or different frequency distributions for a single attribute only. This definition of similarity is therefore quite liberal. As our earlier discussion hinted, all-but-one similarity captures the redundancy of statistical information in a TuG synopsis. We formalize this property as follows: Theorem 4.1. Consider a TuG T G and two all-but-one similar nodes r and r0 . Let T G’ be the summary that results from merge({r, r0 }). Then, synopses T G and T G’ yield the same selectivity estimate for every complex join query Q. To prove the theorem, we show that the embeddings of a query Q over T G can be mapped to a set of equivalent embeddings over T G’. The intuition behind Theorem 4.1 can be described as follows. Consider a relation Ri that is linked to another relation Rk and an attribute Aj in the schema graph. Let r and r0 be two tuple partitions of Ri that have the same joins to Rk but different value distributions for Aj . (Hence, r and r0 are all-but-one similar.) Essentially, the different

4

4

6

3

2

6

6

4

3

6

(a) 1

(b) 2

6

1

2 6

9

12

6 4

4

1

34

56 6

9

1234

9

8

2

1

flexibility of all-but-one similarity achieves a more aggressive compression of statistical information, and is key in the scalable construction of accurate TuG summaries. Transferring our results to the XML domain is an interesting direction for future work.

56 12

9

2

(c) Figure 4: (a) Starting summary T G, (b) T G after merging A with respect to C (c) T G after merging A with respect to B, and then with respect to C.

value distributions for r and r0 are not correlated to their joins to Rk – these are identical. Hence, a merge({r, r0 }) will not result in the loss of a join-to-value correlation when the new node is linked to the union of neighboring nodes for r and r0 . In that sense, the new synopsis remains equivalent to the original summary. Example 4.1.: Figure 4 shows the application of Theorem 4.1 on a sample synopsis T G. Looking at Figure 4(a), we observe that nodes a2 and a6 are all-but-one similar with respect to neighbor C, as they join to the same C nodes with the same ratios. A similar observation holds for nodes a4 and a5 . We can thus perform two merge operations, as shown in Figure 4(b), without compromising the accuracy of the summary. After these operations, no nodes are all-butone similar. Similarly, we observe that nodes a1 and a2 of the original summary T G are all-but-one similar with respect to neighbor B, and the same holds for a3 and a4 , and for a5 and a6 . By performing the three lossless merges, we obtain the first summary of Figure 4(c). At this point, the merged nodes a12 and a34 have become all-but-one similar with respect to C. It is possible, therefore, to perform another lossless merge and derive the second synopsis of Figure 4(c). The previous example demonstrates clearly that it is possible to derive a variable number of lossless merges depending on the order in which neighbors are considered for allbut-one similarity. The reason is that merge operations affect the implied multi-dimensional space of join ratios and thus impact the existence of all-but-one similarity. This is an interesting property of our framework and at the same time a significant challenge, as it becomes necessary to select carefully the order in which all-but-one neighbors are considered. We revisit this issue in Section 5.1, where we introduce a heuristic for determining an effective sequence of merge operations. Overall, all-but-one similarity is a novel feature of our model that provides a unified (in terms of joins and values) characterization of statistical redundancy in a TuG synopsis. It is interesting to note that XML summarization models have also employed similar definitions of similarity, albeit with stricter conditions. In TreeSketches [14], for instance, similarity is defined in terms of all the neighbors of a node in the data graph, which is considerably more restrictive compared to the except-one subset of schema neighbors used here. As we show in our experimental study, the greater

5.

TuG CONSTRUCTION

In this section, we introduce the TuGBuild algorithm for constructing an accurate TuG synopsis within a specific storage budget. The proposed algorithm operates in three stages: in the first stage, it merges all-but-one similar nodes (Section 4.2) in order to compress the data graph to a much smaller lossless synopsis; in the second stage, it uses merge operations of low radius in order to further compress the summary while not sacrificing accuracy; and in the third stage, it substitutes the detailed value information with compressed single-dimensional histograms. Throughout this process, TuGBuild relies on efficient disk-based structures and scalable algorithms in order to handle large data sets under limited memory resources. Our experimental results (Section 6) verify that our techniques scale well in practice and are hence of interest for other graph-based summarization methods, e.g., in the XML domain, that operate entirely in main memory. The following sections describe the construction algorithm in more detail. We first examine the key problem of of identifying effective merge operations in a scalable fashion, and then discuss the specifics of the TuGBuild algorithm.

5.1

Scalable TuG Compression

As noted earlier, our TuGBuild algorithm derives an effective summary by applying merge operations of low radius to an initial accurate synopsis. An important problem therefore is identifying such operations in a scalable fashion, in order to maintain the efficiency of the construction process. We employ a solution that is based on the clustering of tuple partitions according to their join ratios to neighboring nodes. More concretely, our algorithm considers one relation Ri at a time, and computes a clustering of the multidimensional points that correspond to the join ratios of its nodes. Each identified cluster corresponds to a set of nodes with similar join fractions, and thus represents a candidate merge operation of low radius. After performing the implied merge operations for Ri , the algorithm considers another relation and repeats the same process. An important observation is that the clustering can be performed with well known scalable techniques, e.g., BIRCH [17] or subspace clustering[3], thus enabling our solution to scale gracefully to large data sets. A natural question concerns the order in which relations are considered in this iterative process. As hinted by our example of Figure 4 (Section 4.2), the order of merge operations can affect significantly the ability to find tight clusters. (Our original discussion concerned merges of all-butone similar nodes, but it can be extended to the case of general merges.) In our work, we adopt an intuitive heuristic that examines relations in decreasing order of their clustering potential, that is, the ability to form few and tight clusters for the corresponding summary nodes. The goal is to enable a high compression of the TuG summary (few clusters), while keeping the error of approximation low (tight clusters). We now describe a metric CV(Ri ) that quantifies the clus-

tering potential for a specific relation Ri . As mentioned previously, the goal is to rank relations based on the CV metric and thus determine the order in which they are considered for clustering. The proposed CV(Ri ) metric is based on the following intuitive observation: the nodes of Ri will cluster well if there is a tight clustering with respect to every schema neighbor in neighbors(Ri ). (This idea is similar to subspace clustering [3], where clusters in low dimensional spaces are used to compute clusters in higher dimensions.) Our approach is to estimate the quality of clustering Ri with respect to each schema neighbor, and to compute CV(Ri ) by combining the individual estimates. To quantify the effectiveness of clustering Ri on a specific schema neighbor T ∈ neighbors(Ri ), we consider the variability of join ratios from nodes of Ri to nodes of T . More concretely, let t ∈ T G be a node corresponding to T and define Nt to be its neighbors that correspond to Ri . We consider a conceptual merge operation merge(Nt ) resulting in a single node rt , and use the aggregate node rt in order to measure the variance of join ratios from Nt to t. More formally, we introduce a localized CV(Ri , t) metric as follows: CV(Ri , t) = P

|Nt | − 1 (jratio(r, t) − jratio(rt , t))2

r∈Nt

The numerator of this expression is indicative of the “savings” in storage space, since |Nt | nodes are substituted with the single node rt . Similarly, the denominator is indicative of the error of approximation, as each centroid in Nt is now substituted with the centroid of rt . The ratio captures the space saved per unit of introduced error, and is thus suggestive of the effectiveness of clustering Ri with respect to the single node t. We define the overall metric CV(Ri ) as the average of localized metrics CV(Ri , t) across all nodes t of T and across all neighbors T ∈ neighbors(Ri ). Intuitively, we expect a high CV(Ri ) value if the nodes of Ri have similar join ratios to every schema neighbor T ∈ neighbors(Ri ), implying that an actual clustering of Ri will yield few clusters that are tight. Clearly, CV(Ri ) is only a heuristic metric that may overestimate the potential of Ri . Note, however, that the construction algorithm uses CV(Ri ) only to guide the selection of relations for clustering and not to compute the actual clusters. We revisit this point in Section 6, where we verify experimentally the effectiveness of the CV(Ri ) metric in guiding the construction process.

5.2 Construction Algorithm We now proceed to describe the specifics of the TuGBuild construction algorithm, shown in Figure 5. TuGBuild receives as input the relational database D, a storage budget BS , and a value budget BV , and returns a TuG summary T G that uses BS space units for its graph structure and BV value units for the compressed value summaries vsum(Aj , r). As we have discussed earlier, the algorithm operates in three stages: the first stage (lines 1–7) computes a reference summary that maintains the key statistical correlations of D while being much smaller in size than the data graph GD ; the second stage (lines 9–18) compresses this summary by applying a sequence of carefully chosen merge operations; finally, the third stage (lines 19–31) substitutes value-related information with compressed histograms. The following sec-

tions discuss each stage in detail. Building a Reference TuG. The outcome of the first stage (lines 1–7) is a TuG that is much smaller than the original data D, and at the same time preserves the key statistical correlations between and across the distribution of joins and values. This is a key step in making the construction algorithm scalable, as it reduces the size of the input for the subsequent stages of the build process. Initially, TuGBuild transforms each numerical value domain to a small collection of tight numerical ranges (line 1). The intuition is that the values in each range are close in the underlying domain and can thus be treated as equivalent in the summarization process. (In our experiments, we identify 100 ranges for each domain using the max-diff heuristic [15].) This compression is essential in identifying all-but-one similar nodes, as it increases the similarity of tuples in terms of value information. The trade-off, of course, is the introduction of a small error in the approximation of values. After this step, TuGBuild compresses the resulting data graph by identifying and merging all-but-one similar nodes, using an iterative process (lines 3–7) similar to the one described in Section 5.1. In each iteration, TuGBuild considers each relation Ri and each except-one subset N of neighbors(Ri ), and ranks the (Ri , N ) pairs according to a CV(Ri , N ) metric. (CV(Ri , N ) is computed exactly as CV(Ri ) except that it only considers the schema neighbors in N .) The top pair (Rm , Nm ) is selected and a perfect clustering is computed for the nodes of Rm and for the dimensions that correspond to Nm . Clearly, each identified cluster comprises nodes that are all-but-one similar with respect to neighbors Nm . The algorithm performs one merge operation for each such cluster, and repeats the process until every relation has been examined at least once and it is not possible to identify new clusters. Note that TuGBuild may select the same pair (Ri , N ) multiple times, as merges on the neighbors of Ri may increase the overlap of Ri nodes in terms of their joining nodes and thus render them all-but-one similar. Compressing Join Information. The goal of the second stage (lines 9-18) is to further compress the TuG summary with additional node-merge operations. Since the algorithm has exhausted lossless merges in the first stage, TuGBuild resorts to lossy merge operations, using the radius of the merge in order to control the loss of accuracy (Section 4). As shown in Figure 5, the selection of merge operations proceeds in two nested loops. The outer loop (lines 9–18) controls a clustering threshold ctJ that determines the maximum radius metric of the selected merge operations (Section 4.1). The idea is to start with a low threshold for the initial merges, and to gradually increase it if it is necessary to compress further. The inner loop (lines 11–16) generates and applies the node-merge operations using the approach outlined in Section 5.1. More precisely, the algorithm selects the relation Rm with the highest CV(Rm ) value, and computes a clustering of its nodes such that each cluster has radius less than ctJ . The algorithm performs one merge operation for each identified cluster, and continues the loop until no merge operations can be identified for the current threshold. At that point, the outer loop resumes and the threshold is increased. In our experiments, we have used αJ = 0.05, i.e., the clustering threshold increases by 5% on every iteration. To perform the clustering that identifies node-merge op-

Procedure TuGBuild(D, BJ , BV ) Input: Database D; join budget BJ ; value budget Bv Output: TuG T G begin /** Construct reference synopsis **/ 1. Merge value nodes in D in tight ranges 2. Initialize T G with one node per tuple 3. while change do 4. Compute CV (Ri , N ), Ri ∈ D, N ⊂ neighbors(Ri )∧ |N | = |neighbors(Ri )| − 1 5. (Rm , Nm ) := argmaxRi ,N (CV (Ri , N )) 6. cluster(Rm , Nm , 0) /** Clusters of radius 0 **/ 7. Perform a merge for each identified cluster 8. end /** Compress Join Information **/ 9. ctJ := ct0J 10. while (join space of T G) > BS do 11. while change do 12. Compute CV (Ri ), Ri ∈ D 13. Rm := argmaxRi (CV (Rm )) 14. cluster(Rm , neighbors(Rm ), ctJ ) 15. Perform a merge for each identified cluster 16. done 17. ctJ := αJ · ctJ /** Increase clustering threshold **/ 18. done /** Generate compressed value summaries **/ 19. ctv := ct0v 20. preverr := inf; curerr := inf 21. loop 22. preverr := curerr 23. for each relation Ri and related attribute Aj do 24. cluster(Ri , Aj , ctv ) 25. Create a distinct histogram for each cluster 26. done 27. Allocate BV units of space among created histograms 28. curerr := compute value err() 29. ctv := αV · ctv 30. while curerr < preverr 31. Substitute value partitions and edges with best histograms 32. return T G end

Figure 5: Algorithm TuGBuild

erations (line 14), TuGBuild employs a modified version of the well known BIRCH algorithm [17]. We have chosen BIRCH as it can generate an effective clustering in a single pass over the data, thus enabling our algorithm to scale to large data sets. To handle the high dimensionality of our point space, we have modified BIRCH to use fixedsize randomized sketches in order to approximate the large coordinate vectors of the internal CF-Tree nodes [17]. In particular, our modified BIRCH algorithm uses count-min sketches [7] to approximate the linear sum component of a cluster’s statistics and to essentially estimate the distance metric for the insertion of new points. Leaf clusters, on the other hand, have typically a lower number of non-zero coordinates and are stored exactly. As always, the complete details can be found in the full version of this paper [16]. Compressing Value Information. The goal of the third and final stage of the TuGBuild algorithm (lines 19–31) is to substitute the detailed value distributions in the synopsis with compressed value summaries in each node. To make effective use of the allotted space budget BV , TuGBuild takes advantage of the redundancy that may exist in the data and creates a distinct value summary per group of

nodes that have similar value characteristics. More concretely, consider a relation Ri and a related attribute Aj . TuGBuild computes a clustering of Ri based solely on the coordinates that correspond to Aj , thus identifying clusters of nodes that have similar value distributions for Aj . For each such cluster g, the algorithm creates a single summary vsum(Aj , g) that summarizes the union of the corresponding value distributions, and shares it among all the nodes in the cluster. The selection of these clusters is performed adaptively, using a threshold ctV that controls the value-based similarity of nodes within each cluster. The selection of ctV presents an interesting trade-off: a low threshold implies higher similarity and hence more clusters, but also more summaries that share the fixed space budget BV . TuGBuild explores this trade-off with an iterative strategy (lines 19–30), that initializes ctV to a low value ct0V and increases it gradually in order to discover more effective solutions. For a specific threshold, TuGBuild computes the clustering for each relation Ri and related attribute Aj using ctV to control the radius of the generated clusters (line 24). After distributing the fixed space budget Bv to the initialized summaries (line 27), TuGBuild computes an error metric (function compute value error) that characterizes the accuracy of the overall value approximation. If the current error constitutes an improvement over the previous iteration, then the threshold is increased and another iteration is performed. Overall, the goal is to achieve an effective partitioning of summary nodes so that the individual value summaries vsum(Aj , g) are allocated enough storage while summarizing value-distributions of adequate similarity. Several details of this stage depend on the approximation methods that are used to implement the value-summaries. Two key issues are the distribution of the alloted space budget BV to the value summaries that correspond to a specific threshold ctV , and the computation of the approximation error. In this paper, we rely on histograms for value summarization and use well-known relational techniques [8] for determining the number of buckets for each histogram. Conversely, we adopt the squared error of point frequencies as the error of approximation. As always, the complete details can be found in the full version of this paper [16].

6.

EXPERIMENTAL STUDY

In this section, we present the results of an extensive empirical study that we have conducted in order to evaluate the performance of the proposed TuG framework. Overall, our results on real-life and synthetic data sets verify the effectiveness of TuG summaries and demonstrate their benefits over previously proposed (table- and schema-level) summarization techniques.

6.1

Experimental Methodology

This section describes the techniques that we have used in our study, the data sets and related workloads, and the metrics for evaluating the performance of summarization techniques. Techniques. We base our experimental study on the following summarization techniques: - TuGs. We have completed a prototype implementation of the TuG framework that we introduce in this paper. For TuG construction, our prototype uses a custom implemen-

Number of relations Tuples in largest relation Tuples in smallest relation Size of text files

TPCH 8 6,000,000 5 1GB

IMDB 8 2,702,114 68,975 139 MB

Avg. result size of positive queries Number of join predicates Number of selection predicates

Table 1: Data Set Characteristics

TPCH 634,557 4-8 1-7

IMDB 49,590 4-6 1-5

Table 2: Workload characteristics

tation of the BIRCH algorithm with the modifications described in Section 5. Unless otherwise noted, we set the initial clustering thresholds to ct0J = ct0V = 1.5 and increase them by αJ = αV = 5% on every iteration.

key/foreign-key joins to encode these relationships. Based on these characteristics, we apply TuGs and histograms on both data sets, Join Synopses on TPC-H only, and Wavelets on IMDB only.

- Join Synopses: We compare the performance of TuGs against the Join Synopses technique of Acharya et al. [2]. A Join Synopsis (also known as a Join Sample) is essentially a uniform random sample of the join of a specific table along key/foreign-key dependencies. As we discuss in Section 7, however, this technique can only be used on schemata that do not contain many-to-many relationships and hence we use it only on a subset of our experiments.

Query Loads. We evaluate the accuracy of each summarization technique using synthetic query loads. We generate positive workloads of tree-join queries by selecting connected subgraphs of the schema graph and attaching random selection predicates on the selected attributes. We ensure that each query has at least 3 join predicates and at least one selection predicate and generate 350 queries in each case. The characteristics of the positive workload for the two data sets are listed in Table 2. We use the same methodology to generate negative workloads with similar characteristics. While we focus on tree-join queries in our presentation, we have also experimented with workloads of cyclic join queries. Our results have been qualitatively the same and we omit them in the interest of space.

- Multi-dimensional Wavelets. For schemata where Join Synopses cannot be applied, we compare the performance of TuGs against the wavelet-based techniques of Chakraborti et al. [5]1 . The key idea is to create one multi-dimensional wavelet summary per table, and to evaluate the relational operators of a query (e.g., select, project, join) entirely in the wavelet domain. Wavelet summaries are table-level synopses and can thus be applied to any type of database schema. - Histograms. We use single-dimensional histograms as the baseline summarization technique of our study. Instead of using a custom histogram implementation, we have opted to use the selectivity estimates that are generated by the histogram-based statistics of a commercial system (referred to as system X). We instruct the system to create detailed statistics on the join and value attributes of all tables and then use the EXPLAIN command to obtain the optimizer’s estimates on query selectivity. To the best of our knowledge, system X employs the max-diff [15] heuristic for the creation of each histogram. Overall, our study uses two schema-level (TuGs, Join Synopses) and two table-level (Wavelets, Histograms) techniques. Histograms serve as our baseline technique and are not expected to perform well, as they do not capture any correlations across value distributions. Wavelets and Join Synopses, on the other hands, are essentially “multidimensional” techniques that capture the combined distribution of joins and values. They are, therefore, the main competitors to our proposed TuG synopses. Data Sets. We use two data sets in our study: (a) IMDB, a real-life data set that contains information on movies, and (b) a skewed version of the well-known TPC-H data set2 . The characteristics of the two data sets are shown in Table 1. IMDB contains many-to-many relationships between its central movies table and the remaining tables (actors, movie genres, and producers), while TPC-H contains only many-to-one relationships that originate in the central LineItem table. It should be noted that both data sets use 1 We are grateful to the authors of the paper for providing us with their source code. 2 We have used Vivek Narasaya’s TPC-H generator with z=1 for generating the data.

Evaluation Metric. Given a true selectivity sel and an c we use the absolute error AE(sel, sel) c = |sel| c estimate sel, for negative queries (sel = 0), and the absolute relative error c = |sel − sel|/max( c c sn) for positive queries. ARE(sel, sel) sel, Parameter sn is a sanity bound that avoids the artificial high errors of low-count queries. Following previous studies in summarization, we set sn to the 10-th percentile of true query counts. To compare a specific error metric on two summaries, we employ the Cumulative Frequency Distribution (CFD) of the metric over the queries of the workload. A point (x, y) in the CFD indicates that y% of the queries in the workload have an error that is less than or equal to x. (The interpretation of x depends of course on the choice of AE or ARE.) Using this comparator, a technique A is more effective than a technique B if the CDF of A dominates the CDF of B, i.e., a larger percentage of queries has a lower error. We have found this approach to yield more interesting comparisons instead of using a single statistic (such as, an average).

6.2

Experimental Results

We present now some of the experiments that we have conducted in our study. The specific subset captures the main traits of the wide range of results we obtained by varying the data, the query workloads, and the model parameters.

6.2.1

Effectiveness of Construction Algorithm

In the first set of experiments, we evaluate the effectiveness of the TuGBuild algorithm. We focus on three important aspects of the proposed algorithm: the use of the clusterability metric CV, all-but-one similarity, and total construction time. CV Metric. As described in Section 5, TuGBuild processes relations in decreasing order of their CV metric in order to maximize the compression of the summary while maintaining its accuracy. To evaluate the effectiveness of

this heuristic, we compare it against a strategy that processes relations in a random order. We focus on the first stage of the algorithm and measure the effectiveness of the two strategies in terms of the size (in number of nodes) of the generated reference summary. For the random strategy, we execute several experiments using different seeds for the random number generator. This results in a large number of distinct reference summaries, each one corresponding to a different sequence of choices for the clustered relation. (The sequence that is derived by CV is among the random sequences that have been generated.) Random Orderings Percentiles min 25% 50% 75% max 747 1,090 1,236 1,489 1,825

CV 780

Table 3: Size distribution for random reference summaries. Table 3 shows percentile information for the sizes of 1326 random reference TuGs for a scaled down version of our IMDB data set. The size of the CV-based reference TuG is shown in the last column. As our results indicate, the sequence of clustered relations affects significantly the size of the reference TuG. The largest random TuG, for instance, is more than twice as big as the smallest random TuG (1,825 nodes vs. 747 nodes). We observe that the CV metric is very effective in guiding the clustering process, and results in a small reference synopsis that is very close to the minimum size achieved by random orderings (780 nodes vs. 747 nodes for the minimum). All But One Similarity. As we have discussed earlier, TuGBuild relies on the novel concept of all-but-one similarity in order to perform lossless merges on the data graph GD . Here, we evaluate the impact of this choice by comparing it against the use of complete similarity for the generation of the reference synopsis. Table 4 shows the size of the reference summary for allbut-one and complete similarity on our two data sets. To put the measurements in context, we also list the size of the starting data graph. Our results demonstrate that all-butone similarity enables a substantially higher compression rate, outperforming full similarity by two orders of magnitude in both cases. In the IMDB data set, for instance, all-but-one similarity generates a reference summary of 33 thousand nodes, compared to 4.4 million nodes for full similarity. This level of compression is a key factor for the efficiency of the TuGBuild algorithm, as it reduces significantly the size of the input for the second stage of the build process. TuG Construction Time. In this experiment, we evaluate the efficiency of the complete TuGBuild algorithm. We consider all three stages of the build process and measure the execution time of TuGBuild in order to compress the original data graph down to a small synopsis of 5KB in storage.

TPC-H IMDB

Data Graph 8M 4.7M

Full Sim. 4.4M 4.5M

All-but-one Sim. 33K 65K

Table 4: Size of reference summary (in number of nodes) for full and all-but-one similarity.

All measurements are made on an otherwise unloaded 3GHz Pentium 4 machine. In this setting, TuGBuild requires 55 minutes for TPC-H and 85 minutes for the more complex IMDB data set. Given the relatively large size of our data sets, these results validate the efficiency of TuGBuild and the applicability of our techniques in practice.

6.2.2

TuGs vs. Existing Techniques In this set of experiments, we compare our proposed TuG synopses against previously proposed techniques for relational selectivity estimation. Based on the applicability of different techniques, we split our experiments in two groups: for TPC-H, we compare TuGs against Histograms and Join Synopses, while for IMDB, where Join Synopses cannot be applied, we compare TuGs against Histograms and multidimensional Wavelets. In all cases, the available storage budget is set equal to the size of the histograms created by System X (approximately 30KB for TPC-H and 20KB for IMDB). TPCH. Figure 6(a) shows the Relative Error CFD for histograms, Join Synopses, and TuGs for a workload of positive queries over the TPC-H data set. Our results clearly demonstrate the effectiveness of TuGs as accurate data synopses. Given the small space budget of 30KB, TuGs enable an estimation error of less than 30% for half of the queries in the workload. This level of accuracy is very effective if one considers the limited space budget (less than 0.1% of the original data size) and the complexity of queries in our workloads: 4-way to 8-way joins and up to 7 selection predicates. Compared to Join Synopses, TuGs enable accurate estimates for a larger part of the workload. As an example, TuGs yield an estimation error of less than 30% for 181 queries in the workload (55% of the total number of queries), compared to 128 queries (39% of the total) for Join Synopses and the same level of accuracy. (Histograms perform consistently worse than both TuGs and Join Synopses.) Essentially, Join Synopses are susceptible to high estimation errors when the sampling rate is low compared to the complexity of the data distribution. TuGs, on the other hand, summarize the structure of the complete data graph and can thus capture effectively in limited space the key statistical traits of the data. It is interesting to note that Join Synopses perform better than TuGs for errors greater than 100%. This is an artifact of the estimation algorithm of Join Synopses, which returns an estimate of 0 tuples (100% of error) when the sample does not yield any results for the query. Table 5(a) shows the absolute error percentiles for the three methods, for a workload of negative queries over TPCH. The clear winner in this case is Join Synopses, which by definition compute a perfect estimate of 0 tuples for any negative query. Our TuG synopses continue to perform well, yielding an absolute error of less than 9 tuples for 50% of the tested queries. We observe that TuGs may generate estimates of high absolute error (up to 238K tuples), but these are restricted to a few outlier queries in our workload. IMDB. For our IMDB data set, we compare the performance of TuGs against Histograms and multi-dimensional Wavelets. Recall that we do not apply Join Synopses on IMDB, as they cannot model the many-to-many relationships that are found in the schema. We note that Wavelet construction timed out on the full IMDB data set and hence

100

100 TuG Histograms Join Synop.

70 60 50 40 30 20

80 70 60 50 40 30 20

10

10

0

0 20

40

60

80

100

120

140

Avg. Rel Error (%)

90 Queries in Workload (%)

80

0

100 TuG Histograms

90 Queries in Workload (%)

Queries in Workload (%)

90

80 70 60 50 40 30 20 10

TuG 20K Wavlet 5.5MB

0 0

20

40

60

80

100

120

140

0

20

Avg. Rel Error (%)

(a)

40

60

80

100

120

140

Avg. Rel Error (%)

(b)

(c)

Figure 6: Performance of TuGs vs. existing techniques: (a) TPCH (b) IMDB (Full) (c) IMDB (Scaled-Down).

Method TuG Histograms Join Synop.

0% 0 2.0 0

25% 0.1 1,493.8 0

Percentiles 50% 75% 8.1 104.9 14,412 120,491 0 0

100% 238,766 5,866,480 0

Method TuG Histograms TuG Wavelets

(a)

Percentiles - Full IMDB 0% 25% 50% 75% 100% 0 0.1 0.8 5.8 382.2 3 258.2 1081.4 4583.1 20218.2 Percentiles - Scaled Down IMDB 0 0.1 0.3 1.1 12.7 0 0.1 0.4 2.0 23.7

(b)

Table 5: Absolute Estimation Error for Negative Queries: (a) TPC-H, (b) IMDB. we present a separate set of results on a scaled-down version of IMDB. For the smaller data set, we retained 20% of the wavelet coefficients for each table resulting in a Wavelet summary of 5.5MB in size. We have experimented with Wavelet summaries of smaller size, but we found that the accuracy of estimation deteriorated quickly below this large space budget. Figure 6(b) shows the CFD of relative estimation error for Histograms and TuGs, for positive queries over the IMDB data set and a space budget of 20KB. Figure 6(c) shows the same metric for Wavelets and TuGs, for the scaleddown version of IMDB. In both cases, TuGs outperform the competitor techniques by a large margin. For the full IMDB data set, TuGs enable an estimation error of less than 30% for half of the tested queries, whereas the error of the Histogram-based estimates is more than 140%. A similar picture appears in the scaled-down IMDB data set, where TuGs consistently yield more accurate estimates than multi-dimensional Wavelets. As an example, TuG estimates have less than 40% of estimation error for 90% of the workload, while Wavelets can provide this level of accuracy for only 25% of the tested queries. As we discuss in Section 7, the reason for the bad performance of Wavelet summaries (and to some extent, of Histograms) is the presence of key/foreign-key joins in the underlying schema. This type of joins essentially results in frequency matrices that include the keys of relations, and which are difficult to summarize effectively due to their sparseness. Table 5(b) shows the absolute estimation error for the three techniques, for a workload of negative queries. TuGs continue to outperform the other two techniques, yielding low-count estimates that are close to the true selectivity. We observe that Wavelets enable competitive estimates to TuGs, while Histogram-based estimates are again very in-

accurate for negative workloads. Overall, our results demonstrate the benefits of our proposed TuG model over existing techniques. As shown, TuGs enable more accurate estimates in almost all cases, while being applicable to a large class of relational data sets.

7.

RELATED WORK

Relational data synopses have been investigated from the early years of database development and previous studies have introduced a host of relevant techniques. Our review of prior work on this topic uses an abstract classification of the developed techniques in table-level synopses and schemalevel synopses. Table-level synopses, such as, histograms [1, 8, 15], wavelets [5, 13], sketches [4, 9], and table samples [10, 12], approximate the joint frequency distribution of values that appear in a single table. (Hence, the optimizer estimates the selectivity of a join query by combining information from several synopses.) Histograms and wavelets summarize the frequency matrix of the relation, and are most effective when the matrix contains relatively few contiguous regions comprising similar frequencies. These techniques therefore do not perform well for key/foreign-key joins, where the inclusion of a key attribute results in a frequency matrix with widely dispersed non-zero frequencies. Similar arguments can be made for independent table samples, which have been shown to be ineffective for key/foreign-key join queries [6]. Sketchbased techniques [4, 9] have been originally developed for approximate query answering over streaming data, and can be conceptually applied as single-pass summaries over finite data sets. These works, however, assume continuous queries with fixed selection predicates, and it is not clear if they can support effectively the ad-hoc queries that are common over

finite data sets. Schema-level synopses, such as, Join Synopses [2] (also known as Join Samples) and Probabilistic Relational Models (PRMs) [11], approximate the joint distribution of both joins and values on a subset of tables. As a result, selectivity estimation is based on information from a single schemalevel synopsis that covers the query. Join Synopses [2] store uniform random samples of table joins, and compute estimates by scaling up the selectivity of the query over the sample. PRMs [11], on the other hand, rely on a Bayesian network that approximates the joint frequency distribution across multiple tables and enables the inference of selectivity estimates. The two techniques can only be applied to join queries that do not cover many-to-many relationships, i.e., queries that contain a well defined “central” relation with emanating foreign-key dependencies. In the opposite case, the use of these techniques would require the application of generalized independence assumptions that are typically unrealistic and thus limit the accuracy of the generated estimates. Our proposed TuG synopses fall in the second category of schema-level summaries. Contrary to existing schema-level techniques, TuGs can model data sets with complex join relationships (including many-to-many relationships and cycles) and thus cover a larger class of real-world schemata and query loads.

8. CONCLUSIONS In this paper, we initiate the study of schema-level synopses for relational data sets with complex join relationships. We introduce a new class of summaries, termed TuG synopses, that rely on a graph-based model in order to capture accurately the key statistical traits of the underlying data distribution. We detail the TuG model and its properties, and describe an efficient algorithm for constructing accurate TuGs for a limited space budget. Our experimental results verify the effectiveness of our approach and demonstrate its benefits over existing summarization techniques.

9. REFERENCES

[1] A. Aboulnaga and S. Chaudhuri. Self-Tuning Histograms: Building histograms without looking at data. In Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, 1999. [2] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join Synopses for Approximate Query Answering. In Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, 1999. [3] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, 1998. [4] N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, 1996. [5] K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate Query Processing Using Wavelets. In Proceedings of the 26th Intl. Conf. on Very Large Data Bases, 2000. [6] S. Chaudhuri, R. Motwani, and V. Narasayya. On Random Sampling over Joins. In Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, 1999. [7] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1), 2005.

[8] A. Deshpande, M. Garofalakis, and R. Rastogi. Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data. In Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, 2001. [9] A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. Processing Complex Aggregate Queries over Data Streams. In Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, 2002. [10] S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal Sampling for Skew-Resistant Join Size Estimation. In Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, 1996. [11] L. Getoor, B. Taskar, and D. Koller. Selectivity Estimation using Probabilistic Models. In Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, 2001. [12] R. J. Lipton, J. F. Naughton, D. A. Schneider, and S. Seshadri. Efficient sampling strategies for relational database operations. Theoretical Computer Science, 116(1 & 2), 1993. [13] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-Based Histograms for Selectivity Estimation. In Proceedings of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, 1998. [14] N. Polyzotis, M. Garofalakis, and Y. Ioannidis. Approximate XML Query Answers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2004. [15] V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. In Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, 1996. [16] J. Spiegel and N. Polyzotis. Graph-Based Synopses for Relational Selectivity Estimation. Technical report, Univ. of California Santa Cruz, 2006. [17] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, 1996.