Performance evaluation of hash joins on chip multiprocessors

Performance evaluation of hash joins on chip multiprocessors Spyros Blanas ABSTRACT The join operation is computationally the most expensive operatio...
Author: Emory Mathews
2 downloads 0 Views 137KB Size
Performance evaluation of hash joins on chip multiprocessors Spyros Blanas

ABSTRACT The join operation is computationally the most expensive operation the relational model supports. It has therefore received significant research attention and various algorithms have been proposed. Joining via hashing was introduced over two decades ago and has been shown to possess interesting properties which allow it to parallelize efficiently in parallel database systems. With the advent of chip multiprocessors, which feature multiple cores per chip, there has been significant research interest in making databases run faster by exploiting the parallelism in common database operations. Surprisingly, proposals to enhance database performance on modern CPUs have focused on minimizing instruction and level-2 cache misses, typically via prefetching. The scalability of existing algorithms on multiple cores has been largely overlooked. To answer this imporant question and get additional insight into the factors limiting scalability, I evaluate the performance of three hash join algorithms on three radically different architectures: an 8-core, 4-way multithreaded Sun UltraSPARC T1, a dual quad-core Intel Xeon system and a commodity HyperThreaded Intel Pentium4.

1.

INTRODUCTION

Until recently, higher clock rates and increased cache sizes made programs run faster with each new generation of CPUs. This speedup could be obtained without having to modify the program’s source code or even recompile it. However, higher clock rates have started to become problematic due to increased heat dissipation and instruction-level parallelism cannot be easily exploited further. Designers have shifted their attention to chip multiprocessors, which feature many processing cores per chip. These cores are usually simpler and slower than traditional uniprocessors. Unfortunately, this means that existing programs will run slower on these processors, unless they are rewritten to take advantage of multiple cores.

The most important metric for database system performance has always been maximum transactional throughput. In order to maximize throughput the database designer has to exploit the inherent parallelism of database operations. There are a number of ways in which one can utilize the parallelism in database operations for speeding up computation. A ubiquitous database operation is producing the inner join of two or more relations on an attribute with the same value. Producing the inner join is a costly operation, since it involves scanning all tuples of the two tables and matching all tuples with the same value. Although there are numerous algorithms for joining two relations on a uniprocessor system, they can be summarized to the following three ideas: 1. Nested loops implementation simply iterates over all tuples in a nested loop and joins the tuples which match. Since this will involve multiple I/O operations for each tuple, it is usually prohibitively expensive. 2. Sort-merge algorithm first sorts each table on the join attribute and then merges the results together, omitting entries whose keys do not match. This algorithm bulk loads parts of the table in memory and therefore involves less I/O than nested loops. 3. Hash-based algorithms apply a hash function on the join attribute for both tables and create buckets of tuples. Due to hashing, it is guaranteed that matching tuples will end up in the same bucket and therefore only two buckets have to be searched at a time. Hashbased algorithms typically outperform sort-merge algorithm, but can only be used with equality joins. In this project, I will attempt to implement and analyze the hash join algorithm for relational databases on chip multiprocessors.

2. RELATED WORK 2.1 Exploiting parallelism Attempts to exploit parallelism in database management systems can be summarized as follows: Operation parallelism : Each processing unit is assigned a different task. The granularity of the task can vary, ranging from all transactions executed over a database connection, different

transactions or even SQL statements without dependencies within a transaction. Pipelining : Each processing unit does a different part of the work of each query execution, and data flows from one stage to the next. For example, input parsing, query optimization and execution are done on different processing units and the SQL command is the input to the parser stage and the execution stage outputs the resulting tuples. Partitioning : The input set is partitioned in segments and each segment is assigned to a processing unit. The result is obtained by merging the outputs of all processing units. For example, a relation is partitioned in k equal chunks and each one is assigned to one processing element. The processing element outputs this chunk in sorted order. The output is obtained by merging all sorted chunks together. Ailamaki et al. [1] conclude that a modern processor does useful work roughly 50% of the time when running a database management system. The other half of the time is mostly spent waiting for data from memory. Chen et al. [3] argue that a standard GRACE-like hash-join algorithm spends nearly 75% of its CPU time on cache misses and experiment with two prefetching algorithms for speeding up hash joins. Hardavellas et al. [9] discuss the limitations and opportunities the new chip multiprocessors present to database designers. Zhou et al. [15] investigate three different algorithms for improving data cache miss rate on simultaneously multithreaded (SMT) machines. They argue that their workahead set data structure, which aggressively preloads data in cache, yields better results than operation parallelism and is easier to implement than data partitioning approaches. Cieslewicz and Ross [4] experiment with different aggregation algorithms on a chip multiprocessor and provide an explanation of the factors affecting aggregation performance. Colohan et al. [5] propose intra-transaction parallelism to improve overall transaction performance on chip multiprocessors. Harizopoulos et al. [11] propose simultaneous pipelining for improving overall throughput of the database management system. This design achieves better resource utilization and permits work reuse. However, Johnson et al. [12] argue that similar approaches to work sharing are harmful to performance for modern chip multiprocessors.

2.2 Hash join algorithm Hash join algorithms for uniprocessors for main-memory databases have been studied extensively [7] and a plethora of hash-join variations exist and are in widespread use. Parallel computation of hash joins has been first studied in the context of parallel database systems, for both shared-nothing architectures [6] and shared-memory symmetric multiprocessors [14]. Gedik et al. [8] propose a join algorithm on streams of data using Cell BE processor, a heterogeneous multi-core archi-

tecture which has limited and fully software-managed local memory at each co-processor. They dynamically partition the input in column-oriented join windows which are assigned to each co-processor. The performance of their solution can be partially attributed to the user-managed local memory which allows memory transfers to overlap with computation, at the expense of increased programming complexity.

3. IMPLEMENTATION DETAILS 3.1 Prototype For the performance evaluation, I created a prototype which is about 3000 lines of C++ code. The prototype reads a configuration file which specifies the data to be joined, the join attributes and other algorithm-specific parameters. The input files are flat files, formatted in comma-separated values. During the initialization phase, the input tables are read in memory and are organized in row-store order, in groups of tuples called chunks. The chunks are chained together, forming a single-linked list. I chose this representation because it is more realistic when joining a file which is read from the disk: a sequential scan would bring pages of tuples in memory, starting from the head of the file and traversing to the end. At the end of the initialization phase, the locks and the buckets for the hash table are allocated, the processing threads are spawned and the benchmarking starts. Each spawned thread acquires the lock protecting the shared input file, reads the first chunk of available data, sets the pointer to the next bucket and releases the lock. It then proceeds to execute the build phase of the algorithm specified in the configuration file. All threads are synchronized by waiting at a barrier, before continuing with the probe phase. The probe phase reads chunks from the input table in a similar fashion. All resulting tuples are fully materialized in thread-specific local storage. After the join algorithm has finished, the threads are destroyed and the benchmarking stops. The results from the probe phase are formatted and copied at an output file for validation.

3.2 Algorithms The hash join algorithm has two distinct phases, the build and the probe phase. The build phase works on the smaller input table and constructs the hash table. The build phase works as follows: for each tuple in the table, the hash function is applied to this tuple’s join attribute. The hash function returns the bucket where the tuple will be stored. The tuple is then copied into the bucket, creating a new bucket and chaining it after this one if there’s not enough space. The probe phase works on the larger input table. For each tuple, the hash function returns the bucket where all the matching tuples lie. This tuple’s join attribute is then compared with the join attribute of every tuple in the bucket chain. If a match is found, the tuples are joined together. Hash join algorithms have been known to parallelize well, primarily due to the lack of dependencies: In the relational model each tuple can be processed independently of all the

others. Also, the probe phase works only on a single bucket chain and requires a linear scan of all the data in the bucket. Those properties of the hash join algorithm have made it the standard join algorithm in parallel database systems [6]. In my prototype, each bucket chain is associated with a pthread mutex. When a thread wants to append data in this bucket chain, it first locks the mutex, then appends the data and finally releases the mutex. In this way, I avoid race conditions by threads contending to append data in the same bucket. For the probe phase of the algorithm, the structure is read-only. Thus, no locks are required. I will also experiment with a different version of the basic hash join algorithm described above. In this version, named buffered hash join, I join tuples in bulk in order to lower the amortized processing cost of the join operation. The idea is that by synchronizing many joins together, only the first access will miss in the cache and the rest of the tuples can be joined for a fraction of the normal cost. The motivation for this hash join version is similar to the idea behind STEPS [10], which schedules similar tasks to run concurrently in in order to avoid multiple main memory access requests. Finally I also experiment with a column-oriented hash join. In this column-oriented algorithm, each tuple is represented by a tuple containing its join attribute and a pointer to the rest of the data. Those generated tuples are smaller than the input tuples and increase the utilization of the data brought in the cache. Column-oriented join also avoids excessive memory copying between the original table and the destination bucket.

3.3 Limitations The C++ prototype I developed for my evaluation has a number of shortcomings. Firstly, there is no support for variable width types, which are typically used for storing data which are more complicated than simple numbers. For example, SQL supports variable width text fields in the form of the VARCHAR type. Another major shortcoming is that no pipelining between joins is supported. The current prototype just operates on two input relations and produces one output. Supporting pipelining would require significant modifications to our prototype, including redesigning configuration file syntax. However, implementing pipelined joins would allow us to measure the performance of the hash-join algorithm on more complex queries, similar to the ones appearing in the TPCH benchmarks. Moreover, in the current version of the prototype I don’t attempt to do any architecture-specific improvements. One example would be minimizing the cache-coherence traffic on Intel Clovertown machines. Also, currently the spawned threads run completely independently. Maybe if threads were working synergistically, they could take advantage of the shared level-2 cache on UltraSPARC T1 machines and increase overall performance. In general, our prototype may result in suboptimal performance because it is not using a real hash function. The simplistic hash function I programmed just maps ranges of

Query Q1 Q2

Type inner outer inner outer

2,036 6,001,125 227,597 1,222,276

Size tuples tuples tuples tuples

each each each each

36 bytes 24 bytes 8 bytes 50 bytes

Table 1: Input dataset characteristics. equal size to buckets. That is, if there are k buckets and the key values are in the range of [min, max], the i-th bucket would store all tuples with keys from max−min ∗(i−1), incluk ∗ i, exclusive. This rudimentary hash funcsive, to max−min k tion works well because the tuples generated by the TPC-H benchmark have keys which are uniformly distributed for all the queries our workload consists of. Finally, my prototype is currently supporting only singleattribute joins. This can be changed easily, but will require significant changes in the code. The modifications will mainly involve changing the representation of the join attribute from a single number to an array of join attributes, which in turn means that all join functions accepting a single integer must be changed to accept a pointer to the beginning of the array and a length.

4. DATASETS TPC-H is a decision support benchmark, created and maintained by the Transaction Processing Performance Council. It consists of a suite of business oriented ad-hoc queries. This benchmark is modeled after decision support systems that examine large volumes of data and execute queries with a high degree of complexity. The TPC-H benchmark doesn’t attempt to measure the performance of the components of the database execution engine, it just reports the peak throughput for a specific query. In order to generate input datasets, I chose to work with query 5, which is a join-heavy query. PostgreSQL produced the execution plan which is shown in Figure 1. The line widths have been modified to reflect the number of tuples that are processed in each stage. The exact characteristics of the queries are shown in Table 1. The two queries have very different characteristics: Q1 is very asymmetric, has a very short build phase but a long probe phase. Q2 is more balanced, having a ”narrow” inner table, of only 8 bytes.

5. RESULTS 5.1 Sensitivity to parameters Figure 2 shows the effect of chunk size on performance. Small chunk sizes (

Suggest Documents