Kotagiri Ramamohanarao

Technical Report 93/12

Department of Computer Science The University of Melbourne Parkville, Victoria 3052 Australia

Abstract

The join operation is one of the most frequently used and expensive query processing operations in relational database systems. One method of joining two relations is to use a hash-based join algorithm. Hash-based join algorithms typically have two phases, a partitioning phase and a partition joining phase. We describe how an optimal multi-attribute hash (MAH) indexing scheme can be used to reduce the average cost of the partitioning phase of any hash-based join algorithm, by eliminating the partitioning phase entirely for many of the most common join queries. We demonstrate that the technique can be extended to include multiple copies of the data le, each with a dierent organization of the MAH indexing scheme, and that this further reduces the average cost of performing the partitioning phase of the hash join algorithm. We describe a relatively inexpensive method for determining a good MAH indexing scheme. Our experiments show that the schemes found using this method are usually optimal and performs the partitioning phase of the hash join algorithm at least three times faster than using the standard approach. We show that a signi cant change in the query pattern is required for a reorganization of the data le to be necessary, and show that reorganizing the data le is an inexpensive operation.

1 Introduction Much work has been done in constructing indexing schemes to cluster relations to increase the performance of partial match queries. The indexing technique we will consider is multi-attribute hash (MAH) indexing. Examples of this research include the work of Aho and Ullman [1], Lloyd [19], Moran [23] and Ramamohanarao, Shepherd and Sacks-Davis [28]. MAH indexing has been used in preference to indexing schemes such as B-trees because these schemes are primary key indexing schemes and do not perform well when multiple, non-primary, keys are required in an operation. Several other non-primary key methods of performing partial match queries have been studied. In addition to multi-attribute hashing, eective solutions include the grid le of Nievergelt, Hinterberger and Sevcik [25], the BANG le of Freeston [8], the multidimensional binary search tree (k-d tree) of Bentley [2], the multilevel grid le of Whang and Krishnamurthy [33] and the k-d-B-tree of Robinson [30]. The technique we will describe can be applied to many of these schemes. Only a small number of the retrieval based operations of a database will only be partial match queries. Other common operations include projection, join, division, intersection, union and dierence. The join is one of the most common of these expensive operations. It typically involves reading and writing two relations more than once, the exact number of times depending on the join method used and the nature of the relations themselves. In this paper we concentrate on the join operation. However, in Section 9 we brie y discuss other operations, such as intersection and union, and how selection aects our scheme. The multi-attribute hash indexing scheme operates by forming a hash key for a record from hash values generated from the attributes of the record. Hash functions are applied to each attribute of a record, each returning a bit string. Several bits are taken from each bit string to form the hash key, such that the total number of bits taken is equal to the number of bits required to index the hash le. For example, consider a relation with eight attributes and a le size of 216 pages. The hash key is composed of 16 bits taken from amongst the eight attributes. Two bits from each attribute could be used to make up the hash key. Conversely, all 16 bits could be allocated from one attribute. It is desirable that a data le be highly dynamic and thus be able to change sizes by orders of magnitude without signi cantly aecting the performance of the database system. This important property is easily met by a MAH indexing scheme. It has excellent dynamic properties, as described by Ramamohanarao and Lloyd [27]. As the cost of mass storage devices decreases, it is becoming feasible to store multiple copies of data, each with a dierent clustering organization. We will show that multiple copies of a data le can be used to increase the performance of the average join operation. This is similar to the use of multiple les to improve the performance of partial match retrieval presented by Ramamohanarao, Shepherd and Sacks-Davis [28]. In the next section we brie y describe multi-attribute hashing. In Section 3 we describe the join operation and the hash join algorithm. In Section 4 we analyse the cost of the partition phase of the hash join algorithm. In Section 5 we compare the performance of using an optimal indexing scheme with the hash join algorithm and the standard hash join algorithm. In Section 6 we describe various approaches to bit allocation, in Section 7 present computational results which show how these techniques perform, and in Section 8 we determine how our method performs when the probability distribution of the joins change. In Section 9 we discuss how skew data distributions, selections, other database operations and data le reorganizations may be handled, and in the nal section we present our conclusions. Appendix A describes the notation used in this paper.

1

2 Multi-attribute hashing A record consists of a number of elds, each of which may contain some data. An attribute is a eld which may be speci ed in a query. One method used to index the records in a hash le is multi-attribute hash indexing, as described by Rivest [29]. In multi-attribute hashing a single hash key is calculated for each record in a data le from a number of attributes. If a record has n attributes, A1 ; : : :; An, there are n corresponding hash functions, h1 ; : : :; hn, such that each hash function maps a value for an attribute Ai to a single bit string of length li in the form b1i : : :bli . The lengths of the bit strings of dierent attributes of a relation may be dierent. For example, if an attribute represented a binary value, such as a person's sex, it would have only one bit. Conversely, if an attribute was a character string, the hash value for that attribute may return a bit string with many bits in it. We de ne the constraining number of bits of an attribute to be the length of the bit string generated by the hash function for that attribute. A single hash key for a record is constructed by combining bits from the n bit strings. The hash key is used to de ne the page in the le in which the record is stored. The number of bits taken from the bit string of each attribute, di, must be between zero and its constraining number of bits. If the constraining number of bits of an attribute is used to make up the hash key for a record, we say that the attribute has been maximally allocated. The total number of bits taken from all bit strings must be equal to the length of the hash key required to identify each page in the data le. If the size of the data le is 2d , each value P of di must satisfy the inequality 0 di il and the sum of the di's must satisfy the equation ni=1 di = d. The number of bits from the hash function of each attribute used to make up the hash key (the set of di 's), is called the bit allocation. The bit allocation, and the location of the bits within the hash key, determines various properties of the data le. If the le is required to be dynamic, the bits from each attribute are usually interleaved so that the optimal performance is achieved for each le size. This has been discussed by Lloyd and Ramamohanarao [20,27]. An advantage of the multi-attribute hash indexing scheme is that it is based upon a dynamic data structure whose maintenance costs are the same as that of the primary key hashing algorithm upon which it is based. Primary key hashing schemes which may be used include the linear hashing schemes of Larson [17] and Litwin [18], the extendible hashing scheme of Fagin, Nievergelt and Strong [7], the adaptive hashing of Hsiao and Tharp [12], and variations of both these schemes, such as those suggested by Ramamohanarao and Lloyd [20,27]. A problem with MAH indexing occurs when attributes are correlated. If bits are used from the bit strings correlated attributes to form the hash key for records, bits in the hash key will be correlated. All bits composing the hash key should be independent. This can result in an uneven distribution of records to pages in the data le which we term a skew data distribution. Also, since records are stored in a hash le, skew data distributions can increase the access time for records if the hash functions do not compensate for them. The eect that skew data distributions have on our approach is discussed in Section 9.

3 Join operations and the hash join algorithm The join operation is the most frequently executed non-linear operation in relational database systems. Three classes of join algorithm have been described and analysed in the literature. These are the nested loop, sort-merge and hash join algorithms. Mishra and Eich have surveyed join processing [22]. Several dierent join algorithms based on hashing have been proposed. The rst was the GRACE hash join algorithm of Kitsuregawa, Tanaka and Moto-oka [16]. DeWitt et al. have described a hybrid-hash join algorithm [6]. Both of these algorithms were shown to perform better than the sort-merge join algorithm. Since these two papers a number of variations on these algorithms have been proposed, such as the superjoin of Thom, Ramamohanarao and Naish [32]. Some of the problems of unequal bucket sizes have been addressed by Kitsuregawa, Nakayama 2

and Takagi [15,24]. In this paper we use the GRACE hash join algorithm as our standard hash join algorithm, however the hybrid-hash join algorithm, or any of the other hash join algorithms, may equally be used on the resulting data les. The GRACE hash join algorithm works by partitioning the two relations being joined into a number of pages, each of which can then be joined in memory. The partitioning of a relation is accomplished by making several passes over the data le. Initially, the data le consists of a single partition, the relation itself. During each pass a hash function is applied to the join attributes, splitting each input partition into multiple output partitions. If the resulting output partitions do not t into main memory then the process is repeated on the output partitions. If large amounts of main memory are available, one pass is usually sucient. If the size of main memory is B pages, les up to B 2 pages in size may be partitioned in only one pass. Once both relations have been partitioned, the matching partitions of each relation are joined. One partition from one relation is read entirely into memory and each record is inserted into a hash table. The matching partition from the other relation is then read in, one page at a time, and each record is joined with matching records in the hash table. The same hash function must be used on both relations. The purpose of a MAH index is to partition a data le. If two relations being joined use the same hash functions and the join attribute of each relation contributes bits to the hash key of their relation, then the partitioning implicit in the MAH index can be used as an implicit partitioning operation on each le. This can eliminate the need for the partitioning pass of the hash join algorithm. Consider the join operation P(A; B)B1=CQ(C; D): A simple join algorithm using MAH indexing to join the relations is given in Figure 1. We assume that the number of bits allocated to the join attribute of relations P and Q are dP and dQ, and that the remaining bits in the indexes of P and Q are given by kP and kQ . The function partition partitions the relation on the attribute provided, increasing the number of bits provided to that attribute to the number given by the third parameter. The function input returns the pages from the speci ed relation matching the speci ed hash key. The hash key is composed of values for the hash function of each attribute concatenated by a dot. A indicates that all values for the hash value for that attribute should be retrieved, therefore multiple pages should be returned. The function memoryjoin joins sets of pages of memory and writes out the result. For the remainder of the paper we will use PSHJ to denote the partitioning phase of the standard hash join algorithm. We will also use PIHJ to denote the partitioning phase of our hash join algorithm which uses a MAH index to reduce the amount of work done during the partitioning phase of the algorithm. The process of using a data structure to partition the data producing a more ecient join algorithm is a feature of the partitioned join described by Ozkarahan and Ouksel [26] and the superjoin described by Thom, Ramamohanarao and Naish [32]. The partitioned join is applied to a multidimensional data structure not unlike a multilevel version of the grid le of Nievergelt, Hinterberger and Sevcik [25]. The superjoin is applied to a multikey hash le. While similar to our approach, neither of these two methods attempt to optimize the performance of the average join query. We will show that a signi cant improvement in performance can be achieved by using an index which optimizes the performance of the average join query when compared with the standard method of performing a join query. The costs described in the next section represent the cost of the partitioning phase of the hash join algorithm for one relation in terms of the number of disk pages read and written. We ignore the cost of the partition joining phase because the cost of this phase is xed, depending only on the size of the input and output relations. In this phase each relation must be read once and the output of the join written once.

3

# An example implementation of the hash join algorithm, which uses a MAH index # to reduce partitioning costs. # For simplicity, we assume that P is the outer relation and Q is the inner relation # and that dP dQ . procedure join(P , Q, PB , QC ) if kP blg(B ? 2)c then d dP # Each initial partition of P ts into memory. P0 P Q0 Q else if kP > blg(B ? 2)c and kP + dQ ? dP blg(B ? 2)c then d dQ # The partitions of P have to be extended. P 0 partition(P , PB , d) Q0 Q else

d dP + kP ? blg(B ? 2)c P 0 partition(P , PB , d) Q0 partition(Q, QC , d) end if for index1 0 to 2d ? 1 do buer input(P 0, index1 )

# The partitions of both relations have to be extended. # For each partition of P 0 . # For each block of Q0 and partition of P 0 .

0 to 2kQ +dQ ?d ? 1 do block input(Q0, index1 index2) memoryjoin(buer, block)

for index2

end end end procedure

Figure 1: A simple hash join algorithm implementation using a multi-attribute hash index.

4 Cost of the partition phase In this section we describe the cost of the partitioning phase of the hash join algorithm which uses a MAH index to reduce the cost of partitioning. We rst describe the cost of the partitioning phase of the standard hash join algorithm for a data le, assuming the data le has no index. If B is the number of pages in the main memory buer, and 2d is the number of pages in the le being partitioned, the cost of partitioning a data le into les of size B ? 2 (during the partition joining phase one page is reserved for input and one for output) is C = 2 2d logB?1 (2d =(B ? 2)) disk pages transferred. That is, on each pass there will be two disk pages transferred (one read and one written) for each of the 2d pages in the data le. On each pass there canonly be B ? 1 output les because one page is reserved as an input buer. Thus, there will be logB?1 (2d =(B ? 2)) passes. The maximum size of the le which results in one pass is (B ? 1)(B ? 2), so that, if 2d B ? 2, then no partitioning is necessary. If the data le has a MAH index, then all of the bits in the index contributed by the attributes in the join operation can be used to partition the data. For example, if a relation P(A; B) is involved in a join based on attribute A, then there can be a separate partitioning phase for each distinct index value of A. If there are 2dA distinct index values for attribute A there will be 2dA partitions of size 2d?dA . A data le partitioned during the partitioning phase of the hash join algorithm must be compatible with a data le partitioned using a MAH index. During the partitioning phase, we try to partition the data so that the resulting partitions are the same size. We assume that the records are evenly distributed. Thus, the number of partitions constructed must be a power of 2. Therefore, instead of forming B ? 1 partitions during each pass, we form 2blg(B?1)c partitions. Similarly, the nal size of each partition will be 2blg(B?2)c pages instead of B ? 2 pages. It follows that the ideal size for the main memory buer is that B = 2i +2, where i is an integer. 4

We will assume that this is the case. However, if more memory is available, it need not be wasted. For example, if the output of this join operation is to be used as the input for another join, the additional memory can be used to partition the output relation by having multiple output pages. Conversely, by using the split pointer approach of linear hash indexing the number of partitions created need not be a power of two and all pages can be utilized to form dierent partitions. However, using this approach, the partitions would not be the same size and the cost function given below would need to be slightly modi ed. Let Q be the set of attributes used in a join operation q. Let i = blg(B ? i)c and i = 2i . The cost of partitioning a data le for a join operation q is

2 00 1 13 Y C(q) = 2 2d 66log @@ 2di A = A77 7 6 3 2 0 i621Q Y = 2d 66lg @ 2di A = ? = 77 7 6 i62Q 2

1

+1

1

2

1

If blg(B ? 1)c = blg(B ? 2)c, then 1 = 2 and

3 2 X C(q) = 2d 66 di= ? 177 7 6i62Q +1

2

(1)

The average cost of partitioning over all join operations is CP =

X i2P

pi C(i)

(2)

where P is the set of all possible join operations involving the relation, and pi is the probability of a join query being join i. To minimize the average partitioning cost we P wish to nd the values of di which minimizes Equation 2 while maintaining the constraint that i di = d. The results obtained by doing this are presented in Section 5. In Section 7 we present the results of testing a number of methods which attempt to nd optimal bit allocations in a reasonable period of time. If it is important that join operations are as fast as possible, one method which can be used to increase the performance of the partitioning phase of the join is to have multiple copies of the data le and index each copy with a dierent bit allocation scheme. During the partition phase, the copy which provides the lowest cost for that particular partitioning operation should be used. For the purposes of calculating the cost for operations on one relation it is not important whether the copies are all stored on one disk, or are spread over many disks. However, in practice, spreading all the data over many disks will improve the performance of the database system. In practice, having multiple copies of a data le does have the disadvantage that all insertion and update operations are duplicated across multiple les. However, if the performance of query operations, such as the join operation, is more important than insertion and update operations, applications may nd this technique valuable, particularly if the improvement in query performance is great. If there are multiple copies of a data le, and the copy which will provide the best performance for each join operation is used for that join, the average cost of the partition phase of a join becomes X CP = pi 1min C (i) (3) j m j i2P

where m is the number of copies of the data le and Cj (i) is the cost of the partitioning phase for the ith join operation using the jth le copy and is given by Equation 1. 5

15000

Cost

Cost

15000

10000

5000

10000

5000

0

0 NONE

OPT

NONE

Minimization method

Minimization method

Number of attributes = 5, B-2 = 64

Number of attributes = 5, B-2 = 64

Distribution 1

OPT

Distribution 2

Figure 2: Optimal partitioning costs using the standard hash join algorithm versus the optimal bit allocation.

5 Results In this section we compare the performance of PIHJ which utilizes an optimal MAH index with the performance of PSHJ for the average join query. This will enable us to determine whether using an optimal MAH index to support join queries is worthwhile. In Section 6 we will describe techniques to nd the optimal MAH index.

5.1 Computational results.

To produce the results, we generated a set of random probability distributions for join operations involving a relation. The number of attributes of each relation involved in join operations was varied between 2 and 7. That is, each possible combination of attributes was randomly assigned a probability. Some distributions were from equal distributions, some were biased in favour of various attributes. A constraining number of bits was assigned to each attribute. Equation 2 was used to determine the average cost of the partition phase of all joins involving the relation. We assumed that the page size is 4 kbyte and that the size of each relation is 32 Mbyte, thus d was 13. The number of unreserved pages in the buer, B ? 2, ranged between 32 and 256. The costs are measured as the number of disk pages read or written. Twenty sets of results were generated. Figure 2 provides examples of the eectiveness of our method. OPT denotes the cost of the partitioning phase of the average join operation using the MAH index with the optimal bit allocation. NONE denotes the cost of the partitioning phase of the average join operation using the standard hash join algorithm which does not use an index. The bit allocation which produces the optimal value of Equation 2 was determined by an exhaustive search of all possible bit allocations. In Section 6 we discuss how optimal, or nearoptimal, bit allocations may be found without the need to examine each possible bit allocation. Distribution 1 in Figure 2 provides a typical example of the improvement achieved using the optimal bit allocation and MAH index to aid in partitioning the data le. That is, using our method we expect the partitioning phase of the average join to often be at least 10 times faster than using the standard hash join algorithm. Distribution 2 in Figure 2 provides an example of the smallest improvement achieved using the optimal bit allocation and MAH index. That is, using our method we expect the partitioning phase of the average join to always be at least 3 times faster than using the standard hash join algorithm. The results also showed that the average cost of partitioning using the optimal bit allocation decreases as the amount of memory available for the main memory buer increases. This is shown in Figure 3, in which only the cost of the partitioning phase for the optimal bit allocation is shown for various buer sizes. The cost using the standard method also decreases as the buer size increases. However, the cost of partitioning for each join is the same, therefore the average cost is 6

800

Optimal Cost

600

400

200

0 5

6

7

8

lg (B-2)

Number of attributes = 7

Figure 3: Optimal costs when the buer size varies (Distribution 3). Distribution 1 : Cost Ratio Distribution 2 : Cost Ratio Method Calculated Experimental Calculated Experimental OPT 1 1 1 1 NONE 38.1 39.2 10.6 11.4 Table 1: Experimental results. xed for a given buer size. It will only decrease when the number of passes required to partition all joins decreases. Therefore, the average cost does not decrease for each increase in buer size.

5.2 Experimental results.

We have seen that by using MAH and an optimal bit allocation we ought to be able to reduce the cost of the partition phase of the average join. We would now like to know if these improvements are possible in practice. We generated two distributions and timed how long both the PSHJ operation, and the PIHJ operation which used the optimal MAH index, took to reduce the amount of partitioning involved. The experiments were performed on an unloaded Sun Sparc2 IPX with 28 Mbyte of main memory. Elapsed time was recorded. The disk page size used was 56 kbyte because the Sun extent-based le system uses this as the unit of transfer to the disk [21]. The size of each le was 28 Mbyte, so d was 9, and 64 pages (3.5 Mbyte) of main memory were used. Table 1 shows the results for both relations. The cost ratio is calculated as the ratio of the time taken by the average join using each method to the time taken by the average join using the optimal bit allocation. The experimentally obtained times are presented with the ratio of the expected costs, which were calculated using Equation 2. These results show that the ratio of the expected costs do correspond to the ratio of the costs which are achieved in practice. In conclusion, both the calculated and experimental results have shown that using the optimal bit allocation can result in a signi cant increase in performance. Figure 2 and Table 1 have shown that an increase in performance of at least ten times over the standard hash join algorithm are achievable. Thus, from a performance perspective, we believe it is worthwhile to use our method. 7

Optimized Cost

1500

1000

500

0 1

2

3

4

File copies

Number of attributes = 5, B-2 = 32

Figure 4: Optimized costs when the number of le copies varies (Dist. 1).

5.3 Computational results using multiple copies of a data le

In Section 4 we described how multiple copies of data les may be used to reduce the cost of the average join query. We now describe computational results generated which help determine the degree of improvement possible using multiple le copies. When multiple copies of a data le are involved, it is impractical to exhaustively search all possible bit allocations to nd the optimal bit allocation. In Sections 6 and 7 we describe and compare several algorithms for nding a good bit allocation which is optimal or near-optimal. The results for multiple copies of the data le were produced in the same way as the results in Section 5.1, except that the bit allocation used was the best one found by these algorithms and may not have been optimal. Figure 4 provides an example of the improvement in performance which can be achieved using multiple le copies. The results show that using a second le results in a signi cant increase in performance. The smallest improvement in performance we observed was a factor of about three. When more les or a large main memory buer is used, a greater improvement in performance is achieved. When a large main memory buer and more than two les are used, the cost of partitioning becomes eectively zero, even when seven attributes are involved. For our tests, the improvement in cost between three and four copies was so small that using the fourth copy was not cost eective. However, if a very large number of attributes were involved, using four or more copies may become cost eective.

6 Finding optimal bit allocations for MAH indexes If the probability distribution of join queries involving a relation is stable, an optimal bit allocation will only need to be found once for each relation, when its data le is being created. In Section 8 we discuss the eect of a changing probability distribution. As the bit allocation only needs to be found once, nding an optimal bit allocation is more important than the time taken to nd it. However, it is still important to nd the bit allocation as quickly as possible. It is easy to see that for a reasonable sized le with a reasonable number of attributes, there are a large number of possible indexes which could be created. For large databases it is computationally infeasible to calculate the average cost all of the possible combinations in a reasonable amount 8

of time. We would like a procedure which enables us to determine a good index in a reasonable amount of time, without attempting to nd the average cost of a large proportion of the potential indexes. This index should be as close to optimal as possible. To this end, we tested a number of approaches for nding a good bit allocation. The simplest algorithms derive bit allocations directly from a probability distribution or number of attributes in a relation and are very fast. The more complex methods have been successful in the past in a number of similar problem domains.

6.1 Naive algorithms

The following naive algorithms were tested to see if they generate an optimal bit allocation. While we did not expect that any of these would nd the optimal bit allocation for all distributions, they are fast to calculate and could potentially provide an adequate level of optimization. Additionally, some of these are the standard algorithms used by schemes proposed in other papers. EVEN: An equal number of bits are allocated to all attributes, with excess bits going to the rst attributes occurring in a relation. This is the simplest method and does not depend on the query probability distribution. It is also the standard method suggested, and used, to divide a data space between attributes for many dierent data structures. These include the grid le of Nievergelt, Hinterberger and Sevcik [25], and the multidimensional binary search trees (k-d tree) of Bentley [2,3], in addition to les with multi-attribute hash indexes. HIGH: All bits in the hash key are allocated to the attribute with the highest probability of appearing in a join operation. If this number of bits is greater than the constraining number of bits for that attribute, then the constraining number of bits are allocated to that attribute and the remaining bits are allocated to the attribute with the second highest probability of appearing in a join operation. PROB1: Up to d ? blg(B ? 2)c bits (or its constraining number of bits, whichever is lower) are allocated to the attribute with the highest probability of appearing in a join operation. The remaining bits are then allocated to the next most probable attribute, up to a maximum of d ? blg(B ? 2)c bits (or its constraining number of bits), then the next attribute, and so on. This is a simple extension of the HIGH algorithm which utilizes the observation that no partitioning is required for joins involving a given attribute if d ?blg(B ? 2)c bits are allocated from that attribute. Therefore, allocating more than d ? blg(B ? 2)c bits is wasteful.

Theorem 1 A bit allocation produced by PROB1 will be the optimal allocation if all joins are on one attribute only, each join requires at most one pass to partition the data, and the constraining number of bits for each attribute is at least d ? blg(B ? 2)c. Proof. Consider a bit allocation for a probability distribution matching the constraints

speci ed above satis es PROB1. Let query i be the join query containing attribute i. As each join requires at most one pass to partition the data, the cost C(i) may be given by C(i) =

2d 0

+1

if di < d ? blg(B ? 2)c if di d ? blg(B ? 2)c

Let A be the set of attributes for which C(i) = 0 for each attribute i 2 A. As d ?blg(B ? 2)c bits are required to be given to attribute i for C(i) = 0, the maximum number of attributes for which C(i) may be zero is jAj. Therefore,

bd=(d ? blg(B ? 2)c)c) if blg(B ? 2)c < d jAj = min(n; n if blg(B ? 2)c d 9

Assume that there is another bit allocation with a set A0 of attributes with C(i) = 0 which has a lower cost than the bit allocation generated by PROB1. jA0j jAj. The new average partition cost may be given by CP0 = CP +

X i2A

pi2d+1 ?

= CP + 2d+1

X i2A

X

i2A

pi ?

pi 2d+1

X ! 0

i2A

pi :

0

P p> For the new bit allocation to have a lower cost than that provided by PROB1, i2A i P P P p if jA0j p p . However, the de nition of PROB1 guarantees that i2A i

i2A i

0

i2A i

0

jAj. Therefore, the bit allocation derived from PROB1 is optimal. PROB2: Like PROB1, up to d ? blg(B ? 2)c (or its constraining number of bits, whichever

is lower) are allocated to the attribute with the highest probability of appearing in a join operation. The join queries involving this attribute are now eliminated. Up to d ?blg(B ? 2)c bits are now allocated to the attribute with the highest probability of appearing in the remaining join operations. This process is repeated until all bits have been allocated. If all joins are based on one attribute only this method is identical to PROB1. Thus, the result of this method will be the optimal allocation if the conditions of Theorem 1 are satis ed.

6.2 Minimal marginal increase

Minimal marginal increase (MMI) was used successfully by Lloyd and Ramamohanarao [20] to determine the optimal bit allocation for partial match queries when the attributes are not independent. Subsequently, it has been used for similar problems. For example, it was used independently by both Chen, Chang and Lee [5] and Harris and Ramamohanarao [11] for determining the optimal bit allocation for partial match range queries. The method of minimal marginal increase commences with no bits allocated to any of the attributes. It works in d steps, where d is the number of bits in the hash key of the relation. At each step a bit is allocated to a single attribute. To determine which attribute to allocate the bit to, the bit is allocated to each attribute in turn and the value of the cost function to be minimized, Equation 2, is calculated. The bit is permanently allocated to the attribute which resulted in the smallest value of the cost function. This process can be extended to include multiple relations or multiple les. At each step the bit is added to the attribute across all relations or les which minimizes the value of the cost function. Previous results using the MMI algorithm has been mixed. Lloyd and Ramamohanarao [20] showed that in their domain that MMI would nd the optimal solution if the probabilities were independent, and would give very good, usually optimal, solutions for arbitrary probabilities. In other domains, such as determining the optimal bit allocation for partial match queries using multiple les of Ramamohanarao, Shepherd and Sacks-Davis [28], MMI has not found near-optimal solutions.

6.3 Simulated annealing

Simulated annealing (SA) has been used in, or examined for use in, a number of similar problem domains. These include query optimization, by Ioannidis and Wong [14]; join query optimization, by Swami [31] and Ioannidis and Kang [13]; bit allocation for partial match range queries, by Harris and Ramamohanarao [11]; and bit allocation for partial match queries using multiple les, by Ramamohanarao, Shepherd and Sacks-Davis [28]. The method of simulated annealing works by performing T trials and returning the bit allocation which results in the minimal value for the cost function from amongst these trials. Each 10

trial operates as follows. Initially, a random bit allocation is generated and the value of the cost function, Equation 2, is calculated for that bit allocation. The next phase involves S iterations in which bits are perturbed, commencing with the initial random bit allocation. In each iteration the bit allocation is perturbed by decrementing the number of bits allocated to one randomly selected attribute by one bit and incrementing the number of bits allocated to another randomly selected attribute by one bit. The value of the cost function for this new bit allocation is then calculated. If this value is less than the best value of the cost function for this trial, or if the value of a cooling function is true, this bit allocation is used as the basis of the next iteration. If it is not used for the next iteration, the previous bit allocation is used as the basis of the next iteration. In addition to the maximum number of iterations, S, the number of iterations is constrained a limit, L, which speci es the maximum number of iterations permitted without nding a bit allocation with a better value for the cost function. That is, there may be at most L consecutive iterations using the same bit allocation as a basis. If this number is exceeded the trial ceases, returning the best bit allocation found during the trial. The cooling function tests whether a randomly generated number is less than the value of an inverse exponential function, with the iteration number as a parameter. The purpose of the cooling function is to allow the simulated annealing algorithm to accept bit allocations with worse costs in early trials, but only accept bit allocations with better costs in later trials. This allows the simulated annealing algorithm to escape from local minima in early trials and continue searching for the global minima. Like MMI, this algorithm can be extended to include multiple relations or multiple les. At the beginning of each trial a random bit allocation is generated for each relation or le. During each trial a random relation or le is chosen prior to choosing the two attributes whose bits are perturbed. Both the two attributes must be from the same relation or le. Simulated annealing, although more expensive, is a more robust algorithm for nding optimal solutions than MMI. Since determining an optimal solution is rarely required, that is, only when the le is rst created, it is often an appropriate method to use. One of the disadvantages of the simulated annealing algorithm is that the bit allocation it nds is only optimized for a given le size (number of bits). If the le size were to double (increase by one bit), a new bit allocation determined by simulated annealing might allocate a dierent number of bits to all attributes. This would mean that the le would have to be reorganized. While it may be acceptable to do this periodically, it would be better not to. Instead of calculating a new bit allocation and reorganizing the data le, MMI can be used to allocate additional bits on top of an initial bit allocation determined using simulated annealing. By working incrementally the MMI algorithm provides an optimized bit allocation for all le sizes from 21 to 2d . The simulated annealing algorithm does not have this feature, instead determining a bit allocation for one le size. If the le is required to increase in size above 2d , MMI be used to determine which attribute the next bit should be allocated to. By applying the MMI algorithm in reverse (that is, maximal marginal decrease) starting with the optimized allocation of d bits and successively removing a bit, an optimized bit allocation for les of size 2d to 21 can be achieved.

6.4 Multiple copies of a data le: naive algorithms

While the simulated annealing algorithm described in the last section may be used with minimal changes to try to nd an optimal bit allocation for multiple le copies, the naive algorithms of Section 6.1 cannot be applied without change. The naive algorithms we tested with multiple le copies operate as follows.

PROB1: This works in a similar way to the single copy version of PROB1. The attribute with the highest probability of appearing in a join operation has d ? blg(B ? 2)c bits (or its constraining number of bits, whichever is lower) allocated to it in the rst copy. The attribute with the second highest probability of appearing in a join operation has d ? blg(B ? 2)c 11

Algorithm T S L SA1 10 1000 100 SA2 100 100 100 SA3 500 100 100 SA4 50 1000 100 SA5 100 500 100 SEED 1 1000 100

Table 2: Values of simulated annealing constants. bits allocated to it in the second copy. This process continues until the m most probable attributes have d ? blg(B ? 2)c bits allocated to them for each of the m copies. The remaining bits, up to d ? blg(B ? 2)c bits, of each copy are then allocated to the next most probable attributes in turn. This process continues until all bits are allocated in all copies. PROB2: This works in a similar way to the single copy version of PROB2. It operates in the same way as the multiple copy PROB1, except that the joins involving attributes with bits already allocated to them in one copy are removed when calculating the most probable attribute. EVEN1: One copy contains an equal number of bits allocated to each attribute. The remaining copies have bits allocated using exactly the same process as PROB1. EVEN2: One copy contains an equal number of bits allocated to each attribute. The remaining copies have bits allocated using exactly the same process as PROB2. We know that if d ? blg(B ? 2)c bits are allocated to each attribute, then the cost of the partitioning phase of every join query will be zero. This is an optimal bit allocation. As the number of copies of the data le with dierent bit allocations increases, the likelihood of this arrangement being possible increases. Thus, as the number of les approaches the number of attributes the cost of the optimal bit allocation will approach zero.

7 Computational results We produced a set of computational results in an attempt to answer the following questions: 1. Which bit allocation method provides the best bit allocation? 2. Which bit allocation method provides the most cost eective bit allocation? That is, which provides the best bit allocation when the time taken to determine it is taken into account? 3. Is the best bit allocation the optimal bit allocation?

7.1 One data le.

To produce the results, the same set of random probability distributions were used under the same conditions as in Section 5.1. Three versions of the simulated annealing algorithm were tested. Each had dierent values of T, S and L, and are SA1, SA2, SA3 in Table 2. The sets of constants SA4 and SA5 from Table 2 were only required when multiple copies of the data le were tested, these results are presented in the next section. The rst three sets of constants are the same as those used in the past by Ramamohanarao, Shepherd and Sacks-Davis [28] and Harris and Ramamohanarao [11]. In addition, a hybrid method, SEED, was tested. SEED is a single simulated annealing trial with the initial bit allocation set to the best the bit allocations returned by HIGH, PROB1 and EVEN. 12

2500

Cost

2000

1500

1000 OPT 500

3

ED SE

SA

2 SA

I

1 SA

B2

M M

O

B1 O

PR

IG

H PR

H

EV

EN

0

Minimization method

Number of attributes = 5, B-2 = 128

Figure 5: Performance of bit allocation algorithms (Dist. 1). The twenty sets of results were generated on a Silicon Graphics 4D/340 and the costs are measured as the number of disk pages read or written. Figures 5 and 6 contain examples of the results. The results show that the simulated annealing algorithms SA1, SA2, SA3 and SEED usually nd the optimal bit allocation, however the other algorithms often do not. The algorithms SA1 and SEED failed to nd the optimal bit allocation on one and two occasions (out of twenty) respectively, Figure 5 shows one of these. However, on these three occasions the bit allocation was nearly optimal. None of the MMI, HIGH, EVEN, PROB1 or PROB2 algorithms consistently found the optimal bit allocation. The relative performance of HIGH, PROB1 and PROB2 was consistent. Algorithm PROB1 always had a lower cost than HIGH, occasionally it was signi cantly lower. Both of the gures demonstrate this. The cost of the bit allocation generated by algorithm PROB2 was usually the same as that of PROB1. On two occasions PROB2 had a lower cost than PROB1 and on two other occasions PROB2 had a higher cost. The occasions on which it had a higher cost occurred when the constraining number of bits of the most probable attribute was less than d ?blg(B ? 2)c. This meant that the queries involving that attribute were incorrectly assumed to cost zero when the second most probable attribute was chosen. The relative performance of EVEN and PROB1 varied. There were occasions in which one was clearly superior to the other, as in Figure 5. However, there are distributions in which both EVEN and PROB1 are signi cantly worse than optimal, as in Figure 6. From this we conclude that we cannot use these naive algorithms alone to determine an optimal bit allocation. We have now answered the rst and third question posed at the start of this section. To answer the second question we must examine the times taken by each algorithm. Figure 7 shows the time taken by the algorithms for a distribution with seven attributes. Figure 7 shows that the time taken to maximise the chances of nding the optimal bit allocation, by using SA2 or SA3, is high. For example, for seven attributes it is faster to exhaustively search all possible bit allocations than use SA3. Similarly, for ve attributes an exhaustive search is faster than both SA2 and SA3, but not SA1. This is because it is possible for the multiple simulated annealing trials to test the same bit allocations a number of times while trying to nd an optimal 13

15000

Cost

10000

OPT 5000

ED

3

SE

2

SA

1

SA

I

SA

M

M

B2 O

PR

B1

H PR

O

IG H

EV

EN

0

Minimization method

Number of attributes = 5, B-2 = 64

Figure 6: Performance of bit allocation algorithms (Dist. 2).

Time (sec)

1500

1000

500

0 MMI

SA1

SA2

SA3

SEED

PROB2

OPT

Minimization method

Number of attributes = 7, B-2 = 64

Figure 7: Time taken by bit allocation algorithms (Dist. 3).

14

400

Cost

300

200

100

4

5 SA

3

SA

2

SA

1

SA

SA

B2

ED SE

B1

O PR

2

O PR

EN

EN EV

EV

1

0

Minimization method

Number of attributes = 7, copies = 2, B-2 = 64

Figure 8: Performance of bit allocation algorithms (Dist. 3). solution. As the number of attributes increases, the time taken to exhaustively search all bit allocations increases dramatically while the time taken by the simulated annealing algorithm does not increase by the same degree. The time taken by the MMI and PROB2 algorithms is much less than the three simulated annealing algorithms, SA1, SA2 and SA3. However, the bit allocations they nd often have a higher cost than the simulated annealing algorithms. As we stated in Section 6, an optimal bit allocation only needs to be found once. Therefore, allowing a large amount of time to nd an optimal bit allocation, by using SA2 or SA3, may be acceptable. The most cost eective algorithm to use is SEED, simulated annealing seeded with the better of the two allocations, PROB1 and EVEN. Even when the time taken to calculate PROB1 is taken into account, it takes very little time to calculate and determines a bit allocation which is optimal or near-optimal.

7.2 Multiple copies of data les

We produced a set of computational results for multiple les to help answer the rst of the two questions above. As we described in Section 5, it is impractical to calculate the optimal bit allocation using an exhaustive search for multiple le copies. Hence, we cannot determine whether the best bit allocation algorithm nds the optimal bit allocation. The results were produced in the same way as the results of the single le version in previous section. In addition to the four naive algorithms described in Section 6, the simulated annealing algorithm was tested with ve dierent sets of the parameters T, S and L. These are shown as SA1 to SA5 in Table 2. A multiple copy version of the SEED algorithm, which uses the best of the four naive algorithms as its rst bit allocation, was also tested. Figures 8 and 9 provide examples of the performances of the bit allocation algorithms. The performance of the simulated annealing algorithms towards nding an optimal bit allocation depended on the complexity of the problem (the number of copies and the number of attributes involved) and the size of the main memory buer. In general, the greater the number of attributes and the greater the number of copies, the higher the cost of the faster algorithms (for example, SA1, SA4). Similarly, as the size of the main memory buer increased, the percent15

Cost

1000

500

5

4

SA

3

SA

2

SA

1

SA

SA

ED

B2 O

SE

B1 PR

2

O PR

EN

EN EV

EV

1

0

Minimization method

Number of attributes = 5, copies = 3, B-2 = 128

Figure 9: Performance of bit allocation algorithms (Dist. 2). age dierence between the best and worst bit allocations produced by the simulated annealing algorithms increased. The percentage dierence between the average cost of each bit allocation and the best one varied tremendously, from 1% to 1400% to not nding the optimal bit allocation when its average cost was zero, such as in Figure 9. The performance of the naive algorithms varied tremendously. The two PROB algorithms typically achieved a lower cost than the EVEN algorithms, and often several times lower, as in Figure 9. However, there were distributions in which both EVEN algorithms had lower costs than the PROB algorithms, such as in Figure 8. For distributions in which the optimal cost was zero, both PROB algorithms always found an optimal bit allocation, such as in Figure 9. For the distributions in which the optimal cost was non zero, PROB2 always had a cost at least as low as PROB1, such as Figure 8. On several occasions the cost was much lower. On ve occasions (out of eighty), PROB2 found a bit allocation with a lower cost than the best of the ve simulated annealing algorithms. On other occasions, the simulated annealing algorithms produced bit allocations with a signi cantly lower cost than all of the naive algorithms, including the seeded simulated annealing algorithm. The seeded simulated annealing algorithm did not usually nd a lower cost than the algorithms it was seeded with. On only two occasions did it nd a better bit allocation than the one it was seeded with, one of these is shown in Figure 8. The time taken by each of the simulated annealing algorithms is shown in Figures 10 and 11. Figure 10 demonstrates the relative time taken by each of the algorithms in comparison with each other. Note that the time taken by the seeded simulated annealing algorithm includes the time taken by all of the naive algorithms. Figure 11 demonstrates that the relative time taken by each simulated annealing algorithm increases linearly as the number of le copies increases.

7.2.1 Bit allocation design strategy

To maximise the chances of nd an optimal bit allocation for multiple les we recommend using a simulated annealing trial seeded with the better of the PROB2 and EVEN2 naive algorithms in conjunction with another simulated annealing algorithm. Distributions which have a zero cost will be likely to be found very quickly by the naive algorithms. We recommend choosing one of 16

Time (sec)

60

40

20

0 SA1

SA2

SA3

SA4

SA5

SEED

Minimization method

Number of attributes = 5, copies = 2, B-2 = 32

Figure 10: Time taken by bit allocation algorithms for multiple le copies (Distribution 1).

30

Time (sec)

20

10

0 1

2

3

4

File copies

Number of attributes = 5, B-2 = 32

Figure 11: Time taken by bit allocation algorithms as the number of le copies varies (Distribution 1).

17

the simulated annealing algorithms which tests many bit allocations, such as SA3, as the second algorithm. However, if this is not deemed cost eective, one of the faster algorithms could be used. The time taken by the algorithms demonstrate that a near-optimal bit allocation may be found in a feasible amount of time for multiple le copies, remembering the bit allocation is only found once, when the data le is built. As the performance improvement is very high, reducing the partitioning cost to almost nothing in some cases, we believe that using multiple copies of data les is worthwhile in situations in which the additional cost of maintaining the multiple copies is acceptable.

8 Change of join query probability distributions In the previous sections of this paper we have assumed that the probability of each combination of attributes being in a join is known. This is necessary to determine the bit allocation which minimizes the cost of the partitioning phase of the join algorithm. In practice, the probability will not be known exactly, or it will change over time. To determine how our method performs when the probability distribution changes we obtained another set of computational results in which the optimal bit allocations already obtained were tested using versions of the original distributions with the probabilities changed. The probabilities were changed from their original values using the formula p0i = pi ((1 ? s=100) + 2s=100 random()) where s is the percentage change. For example, an original probability of 0.1 and a change of 40% would be randomly changed to between 0.06 and 0.14. All the probabilities are then normalised so that their sum is one. In order to study the robustness of the original optimal solution, a number of computational results were obtained for values of s of 10, 20, 40 and 80. Assume that we have an original probability distribution P with an optimal bit allocation A. We create a changed probability distribution P 0 using the technique described above. The changed probability distribution has an optimal bit allocation A0 . Let C(P; A) be the average partitioning cost using bit allocation A with probability distribution P. Our results report the cost ratio C(P 0; A)=C(P 0; A0 ). The cost C(P 0; A0) is the cost of the optimal bit allocation of the changed probability distribution. The cost C(P 0; A) is the cost of the optimal bit allocation of the original probability distribution using the changed probability distribution and may be not be optimal in the changed probability distribution. Examples of the results are shown in Figures 12 and 13. Each of the cost ratios were determined using optimal bit allocations. The results indicate that for distributions distorted by up to 80% the optimal bit allocation for the original distribution performs as well, or nearly as well, as the optimal bit allocation for the changed distribution. When the variation in the distribution was up to 20% the cost dierence varied between 0% and 0.5%, with most distributions having no cost dierence. When the variation in the distribution was up to 80% the cost dierence varied between 0% and 16.9%, again, with the majority of distributions having no dierence. As the buer size increases, the maximum cost dierence decreased for any given skew value. As the number of attributes decreases, the maximum cost dierence also decreases for a given skew value. Figure 12 is typical of the situation in which the buer size is large with a small number of attributes. Figure 13 is typical when the buer size is small with a larger number of attributes. The fact that there is no signi cant increase in the costs even when the probability distribution is changed by up to 80% may imply that the optimal bit allocation does not depend on the probability distribution. To disprove this we tested the original optimal bit allocations with a new random probability distribution in which the probability of a query was inversely proportional to the number of attributes involved in that query. The random distribution is shown as RANDOM in Figures 12 and 13. When the distribution is random the cost ratio can be large, as in Figure 12, or small, as in Figure 13. Thus, the original 18

Cost ratio

2

1

0 10%

20%

40%

80%

RANDOM

Change percentage

Number of attributes = 5, B-2 = 256

Figure 12: Cost ratio for changed distributions (Distribution 1).

1.5

Cost ratio

1.0

0.5

0.0 10%

20%

40%

80%

RANDOM

Change percentage

Number of attributes = 7, B-2 = 32

Figure 13: Cost ratio for changed distributions (Distribution 3).

19

optimal bit allocations do not provide optimal bit allocations for all distributions. We have shown that once an optimal index has been determined for a data le a reorganization of the le is rarely required, only when the probability distribution changes signi cantly.

9 Other issues Some of the questions which may be asked when considering whether or not to use this method are 1. How can skewed data distributions be handled in a hash indexed environment? 2. What eect do selections have on the performance of performing a join using the optimal index? 3. Can this method be applied to other relational operations? 4. If required, how expensive is a data le reorganization?

9.1 Skew data distributions

One of the problems encountered when using hash functions in indexing is that of a skew distribution of records to pages. In analysing the costs in the previous sections we have assumed that there is a one-to-one relationship between the pages speci ed by a multi-attribute index and the number of pages containing records. We now examine what happens when this assumption does not hold. While most hash join methods assume that the distribution of records to buers is even, and thus do not perform as well under a skew distribution, a couple of methods have been proposed which do not assume this. These include the work by Kitsuregawa, Nakayama and Takagi [15,24]. They show that their methods perform better than the hybrid-hash join of DeWitt, et al. [6] for skew distributions. Any of these methods may be used in place of the GRACE hash join strategy described above without aecting our method of determining the optimal bit allocation. Our method attempts to determine an optimal bit allocation for supporting the hash join method. It does this by attempting to ensure that the partitioning phase is not required for many of the most probable join operations. Instead, each index partitioned part of the relation is joined using another join algorithm, such as the nested loop. This results in the cost of the partitioning phase being zero. This does not preclude using any other join method which may be faster than the nested loop for the data in each prepartitioned relation. Several, including Bratbergsengen [4] and Nakayama, Kitsuregawa and Takagi [24], have reported that the hash-based variant of the nested loop method is the fastest possible for small multiples of the memory buer size (typically up to four or ve times). However, if the distribution of data is skewed so that more than that amount is in the partition to be joined, one of the other join methods may be used, possibly only in a single partition. In calculating the cost of the hash join operation we have used an approximation of the cost on one relation. This is because we cannot provide the exact cost without knowing the details of the other relation. For example, if the whole of one relation can be contained in the memory buer, we do not need to partition the other relation at all, a single pass over both is all that is required. Thus, we provide an upper bound on the cost of performing each join operation in the approximation. While skew data distributions will aect the magnitude of our results, the relationship between them will not be altered. That is, the fact that the average cost using an optimal bit allocation is lower than that of the standard hash join algorithm will remain.

9.2 Select-join operations

A common sequence of relational operations is to join using the result of a selection or, equivalently, select on the result of a join. In practice, these operations can be executed together as a single operation. 20

In this situation, the multi-attribute hash index can be used in the same way that it is for partial match retrieval queries. That is, the index can be used to reduce the number of pages which are examined when performing the join operation if the attributes selected have some bits in the index. The selection can then take place within pages which do match the selection the rst time they are read, which will usually be during the partitioning phase. Equation 1 does not represent the cost of performing the select-join operation. Thus, an optimal bit allocation determined using Equation 1 may not be optimal if the selection operations are taken into account. A solution to this problem is to take the selections into account by extending Equation 1 to include the selection operation. For a data le with a MAH index and a select-join query, all of the bits in the hash key contributed by the attributes associated with the selection may be used to reduce the size of the data le which must be joined. The attributes associated with the join can then be used to partition the remaining le. For example, if the relation P(A; B; C) is involved in a selection on attribute B and a join on attribute A, then the partitioning phase should be performed for each separate index value of A such that the index value of B matches the selection. If there are 2dA and 2dB distinct values in the index for attributes A and B respectively, there will be 2dA partitions of size 2d?dA ?dB involved in the join. The cost of selecting and partitioning a data le for a join operation q becomes

! 2 00 Y 1 13 Y 2di A = A77 C(q) = 2 2d = 2di 66log @@ 6 7 i62Q[S i2S 0 3 1 2 P Y di A 2 = ? = 77 = 2 d? i S di 66lg @ 7 6 i62Q[S 2

1

1+

2

1

2

1

If blg(B ? 1)c = blg(B ? 2)c, then 1 = 2 and C(q) = 21+

P

3 2 X i S di 6 66i62Q[S di= ? 1777 62

2

(4)

where Q is the set containing the attributes used in the join operation, and S is the set containing the attributes used in the selection operation. This can be used as the cost in Equation 2. We can now try to nd an optimal bit allocation using the techniques we described in Section 6. Using a MAH index to organise the data le contributes no additional cost to the performing of the select-join operation. The cost cannot be greater than using the standard method of performing the select-join operation because the MAH index can be ignored and the data le assumed to have no index.

9.3 Other relational operations

The average cost of performing other database operations such as intersection, union, dierence and division can all be improved using indexes constructed in a similar manner to that of the join. Like the join, each of these operations involve nding records in which attributes in two relations are the same. The performance of this aspect of each operation may be improved by utilizing an index and the partitioning operation. For example, the intersection of two relations can be implemented as a partition phase, and then a comparison phase in which the attributes in the relevant partitions are tested for equality. This is the same process as the hash join operation described above. Similarly, the union operation can use a partitioning phase to reduce the amount of work needed to eliminate duplicates because each duplicate attribute must be in the same partition. We have discussed these other operations in more detail in another paper [10]. 21

The technique of determining the best attribute to allocate bits to can also be used by other data structures. For example, the optimal dimension to split on by the multidimensional binary search tree of Bentley [2,3], or the grid le of Nievergelt, Hinterberger and Sevcik [25] could be determined in a similar way, once the cost functions are modi ed to re ect the cost of using these data structures. Additionally, it would also be compatible with join algorithms based on these structures, such as those of Harada, et al. [9].

9.4 Data le reorganization

Although we have shown, in Section 8, that an optimal bit allocation performs very well even after the query probabilities are changed by up to 80%, if the query distribution changes substantially, the data le should be reorganized to achieve the best performance. Fortunately, it is, in general, an inexpensive operation. The cost of reorganizing the data le is the same as that of partitioning the data le during partitioning phase of the hash join algorithm. Thus, under normal circumstances, a single pass over the data le is all that is required. On a single pass, blg(B ? 1)c bits in the hash key can be changed. Thus, if c bits must be changed to transform the old index into the new index, dc=blg(B ? 1)ce passes are required. Consider an index with n attributes with di bits allocated to the ith attribute. This must be reorganized so that the ith attribute has d0i bits allocated to it. The number of bits which must change, that is, be taken from one attribute and given to another, is given by n X c = 21 jd0i ? dij: i=1

This requires only one pass if c blg(B ? 1)c. For example, consider a data le with 214 pages. Assume that it contains an index in which d1 = 3, d2 = 2, d3 = 4 and d4 = 5. We wish to reorganize the index so that d01 = 7, d02 = 6, d03 = 1 and d04 = 0, thus c = 8. Providing that the main memory buer contains more than 28 pages, this requires only one pass over the data le. If each page is 8 kbyte in size, only 2 Mbyte of memory would be required to reorganize the 128 Mbyte data le in one pass.

10 Conclusion We have described a hash join algorithm which takes advantage of a multi-attribute hash index on a data le to reduce the cost of the partitioning phase of the join algorithm. We have also described how to nd a clustering scheme which minimises the average cost of the partitioning phase of the join algorithm. We have described how our method may be applied to other common database operations, such as intersection and union. Additionally, we have suggested that the same technique can also be used to create better indexes for other data structures, such as the grid le. The results in this paper may be summarised as follows. Using an optimal bit allocation results in a signi cant increase in performance over the standard hash join algorithm. An optimal bit allocation can usually be found quickly. Using this method is cost eective, and thus should always be used. The join query probability distribution must change signi cantly for an optimal bit allocation to no longer be near-optimal. Reorganization of the data le usually only takes one pass over the data le. For relatively stable join query patterns a reorganization will rarely be required. Using multiple copies of data les, each with a dierent bit allocation, results in a signi cant increase in performance, even over an optimal bit allocation for one data le. This technique 22

should be used when improvement gained is more important than the added costs of the insertion and update operations. An optimal bit allocation can usually be determined by using the better of two naive techniques, EVEN and PROB1 as a seed for a trial using the simulated annealing algorithm. An optimal bit allocation will typically provide an order of magnitude of improvement to the average time taken by the partition phase of the join compared with the standard join algorithm. When many attributes are involved the improvement can be several orders of magnitude when appropriate buer sizes are used. For example, in Figure 2 the average cost of the partitioning phase for Distribution 1 was over 10 times faster than when there is no index. This improvement can be achieved because the index eliminates the need for the partitioning phase of some joins, thus the average time taken by a join is reduced. By determining an optimal bit allocation, the partition phase of the joins with the greatest probability are eliminated, reducing the average join cost. When multiple copies of the data le are used, each with a dierent bit allocation, the increase in performance is at least another order of magnitude. For example, in Figure 4 when two le copies were used, the average cost of the partitioning phase of the hash join algorithm almost 5 times faster than with one copy. When three le copies were used, the average cost of partitioning phase of the hash join algorithm was almost 20 times faster than with one copy. We have shown that an optimal bit allocation is robust with respect to changing the probability of each join operation. Even when each join operation is randomly changed by up to 80% of its original value the optimal bit allocation of the original distribution results in an bit allocation as good, or nearly as good, as an optimal bit allocation for the changed distribution. Therefore, reorganization of a data le is rarely required once the optimal bit allocation has been determined. We have also shown that when reorganization is required, it is a relatively inexpensive operation and usually only requires one pass over the data le.

23

Appendix A Notation Multi-attribute Hashing

Ai hi bij n

ith attribute hash function corresponding to attribute Ai jth bit of a bit string generated by hash function hi number of attributes in a relation

d di B pi

total number of bits in the MAH index of a le number of bits allocated to attribute Ai number of pages in the main memory buer probability of query i

T S L

number of trials number of iterations maximum number of unsuccessful iterations

Cost Formulae

Simulated Annealing

Multiple Files

Cj (i) cost of partitioning ith join using jth le copy m number of copies of a data le for each relation Changing Distributions

s

maximum percentage change to each probability

c

number of bits which must be changed

Data File Reorganization

Acknowledgements The authors would like to thank Justin Zobel for his comments on earlier drafts of this paper.

References [1] A. V. Aho and J. D. Ullman. Optimal partial-match retrieval when elds are independently speci ed. ACM Transactions on Database Systems, 4(2):168{179, June 1979. [2] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509{517, September 1975. [3] J. L. Bentley. Multidimensional binary search trees in database applications. IEEE Transactions on Software Engineering, SE-5(4):333{340, July 1979. [4] K. Bratbergsengen. Hashing methods and relational algebra operations. In U. Dayal, G. Schlageter, and L. H. Seng, editors, Proceedings of the Tenth International Conference on Very Large Data Bases, pages 323{333, Singapore, August 1984. [5] C. Y. Chen, C. C. Chang, and R. C. T. Lee. Optimal MMI le systems for orthogonal range queries. Information Systems, 18(1):37{54, 1993. [6] D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R. Stonebraker, and D. Wood. Implementation techniques for main memory database systems. In B. Yormark, editor, Proceedings of the 1984 ACM SIGMOD International Conference on the Management of Data, pages 1{8, Boston, MA, USA, June 1984. 24

[7] R. Fagin, J. Nievergelt, and H. R. Strong. Extendible hashing|a fast access method for dynamic les. ACM Transactions on Database Systems, 4(3):315{344, September 1979. [8] M. Freeston. Grid les for ecient Prolog clause access. In P. M. D. Gray and R. J. Lucas, editors, Prolog and Databases Implementations and Future Directions, chapter 12, pages 188{ 211. Ellis Horwood, 1988. [9] L. Harada, M. Nakano, M. Kitsuregawa, and M. Takagi. Query processing method for multiattribute clustered relations. In D. McLeod, R. Sacks-Davis, and H. Schek, editors, Proceedings of the Sixteenth International Conference on Very Large Data Bases, pages 59{70, Brisbane, Australia, August 1990. [10] E. P. Harris and K. Ramamohanarao. Storage of relations using optimal multi-attribute hash indexes. In preparation. [11] E. P. Harris and K. Ramamohanarao. Optimal dynamic multi-attribute hashing for range queries. BIT, to appear. [12] Y. Hsiao and A. L. Tharp. Adaptive hashing. Information Systems, 13(1):111{127, 1988. [13] Y. E. Ioannidis and Y. C. Kang. Randomized algorithms for optimizing large join queries. In H. Garcia-Molina and H. V. Jagadish, editors, Proceedings of the 1990 ACM SIGMOD International Conference on the Management of Data, pages 312{321, Atlantic City, NJ, USA, May 1990. [14] Y. E. Ioannidis and E. Wong. Query optimization by simulated annealing. In U. Dayal and I. Traiger, editors, Proceedings of the 1987 ACM SIGMOD International Conference on the Management of Data, pages 9{22, San Francisco, California, USA, May 1987. [15] M. Kitsuregawa, M. Nakayama, and M. Takagi. The eect of bucket size tuning in the dynamic hybrid GRACE hash join method. In P. M. G. Apers and G. Wiederhold, editors, Proceedings of the Fifteenth International Conference on Very Large Data Bases, pages 257{ 266, Amsterdam, The Netherlands, August 1989. [16] M. Kitsuregawa, H. Tanaka, and T. Moto-oka. Application of hash to data base machine and its architecture. New Generation Computing, 1(1):66{74, 1983. [17] P.- A. Larson. Linear hashing with partial expansions. In Proceedings of the Sixth International Conference on Very Large Data Bases, pages 224{232, Montreal, Canada, October 1980. [18] W. Litwin. Linear hashing: a new tool for le and table addressing. In Proceedings of the Sixth International Conference on Very Large Data Bases, pages 212{223, Montreal, Canada, October 1980. [19] J. W. Lloyd. Optimal partial-match retrieval. BIT, 20:406{413, 1980. [20] J. W. Lloyd and K. Ramamohanarao. Partial-match retrieval for dynamic les. BIT, 22:150{ 168, 1982. [21] L. W. McVoy and S. R. Kleiman. Extent-like performance from a UNIX le system. In Proceedings of the USENIX 1991 Winter Conference, pages 33{43, Dallas, Texas, USA, January 1991. [22] P. Mishra and M. H. Eich. Join processing in relational databases. ACM Computing Surveys, 24(1):63{113, March 1992. [23] S. Moran. On the complexity of designing optimal partial-match retrieval systems. ACM Transactions on Database Systems, 8(4):543{551, December 1983. 25

[24] M. Nakayama, M. Kitsuregawa, and M. Takagi. Hash-partitioned join method using dynamic destaging strategy. In F. Bancilhon and D. J. DeWitt, editors, Proceedings of the Fifteenth International Conference on Very Large Data Bases, pages 468{478, Los Angeles, CA, USA, August 1988. [25] J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid le: An adaptable, symmetric multikey le structure. ACM Transactions on Database Systems, 9(1):38{71, March 1984. [26] E. A. Ozkarahan and M. Ouksel. Dynamic and order preserving data partitioning for database machines. In Proceedings of the Eleventh International Conference on Very Large Data Bases, pages 358{368, Stockholm, Sweden, 1985. [27] K. Ramamohanarao and J. W. Lloyd. Dynamic hashing schemes. The Computer Journal, 25:478{485, 1982. [28] K. Ramamohanarao, J. Shepherd, and R. Sacks-Davis. Multi-attribute hashing with multiple le copies for high performance partial-match retrieval. BIT, 30:404{423, 1990. [29] R. L. Rivest. Partial-match retrieval algorithms. SIAM Journal on Computing, 5(1):19{50, March 1976. [30] J. T. Robinson. The k-d-B-tree: a search structure for large multidimensional dynamic indexes. In Proceedings of the SIGMOD Conference, pages 10{18, Ann Arbor, MI, USA, April 1981. [31] A. Swami. Optimization of large join queries: combining heuristic and combinatorial techniques. In J. Cliord, B. Lindsay, and D. Maier, editors, Proceedings of the 1989 ACM SIGMOD International Conference on the Management of Data, pages 367{376, Portland, Oregon, USA, June 1989. [32] J. A. Thom, K. Ramamohanarao, and L. Naish. A superjoin algorithm for deductive databases. In Y. Kambayashi, editor, Proceedings of the Twelfth International Conference on Very Large Data Bases, pages 189{196, Kyoto, Japan, August 1986. [33] K.-Y. Whang and R. Krishnamurthy. The multilevel grid le | a dynamic hierarchical multidimensional le structure. In International Symposium on Database Systems for Advanced Applications, pages 449{459, Tokyo, Japan, April 1991.

26