CUSTOM QUERY Co-PROCESSING on FPGAs

B-Tech Thesis CUSTOM QUERY Co-PROCESSING on FPGAs B.Tech Project submitted in partial fulfillment of the requirements for the degree of Bachelors of ...
Author: Mavis Marsh
0 downloads 2 Views 2MB Size
B-Tech Thesis

CUSTOM QUERY Co-PROCESSING on FPGAs B.Tech Project submitted in partial fulfillment of the requirements for the degree of Bachelors of Technology in Electronics and Communication Engineering

JAIRAJ BHATTACHARYA 200630012 [email protected] Gaurav Shah 200630011 [email protected] MUKUND R 200630022 [email protected] Sagar Gupta 200630030 [email protected]

International Institute of Information Technology Hyderabad - 500 032, INDIA MONTH YEAR IIIT Hyderabad

0

B-Tech Thesis

Copyright © Jairaj Bhattacharya, Mukund, Gaurav Shah, Sagar 2009 All Rights Reserved

IIIT Hyderabad

1

B-Tech Thesis

International Institute of Information Technology Hyderabad, India

CERTIFICATE

It is certified that the work contained in this project, titled Custom Query CoProcessing on FPGAs by Jairaj Bhattacharya, R Mukund, Gaurav Shah and Sagar Gupta has been carried out under my supervision for a partial fulfilment of degree.

Signature_______________________________________________

Dr. R Govindarajulu Dr. Suresh Purini Dr. Chidamber Kulkarni

IIIT Hyderabad

2

B-Tech Thesis

Dedicated to our dear Parents and friends

IIIT Hyderabad

3

B-Tech Thesis

Acknowledgements I am very thankful to my advisors, Dr. R Govindarajulu and Dr. Suresh Purini for giving me complete freedom and flexibility to work on this topic. Both of them have been very encouraging and motivating and the intensity of encouragement has always increased with time. I am greatly indebted to them. I owe a huge debt of gratitude to, Dr. Chidamber Kulkarni of Xilinx (India) for his invaluable help and guidance in the Project. I am also very thankful to Bharghava and Abinesh in the CVEST (Centre for VLSI and Embedded Systems) department. Those small discussions were very helpful and provided me a really valuable information.

IIIT Hyderabad

4

B-Tech Thesis

Abstract The rapid increase in the data volumes for the past few decades has intensified the need for high processing power for database and data mining applications. Researchers have actively sought to design and develop new architectures for improving the performance. Recent research shows that the performance can be significantly improved using either (a) effective utilization of architectural features and memory hierarchies used by the conventional processors, or (b) the high computational power and memory bandwidth in commodity hardware such as network processing units (NPUs), and graphics processing units (GPUs). Our Project was focussed on an extensive research on how a Custom Query Co-processor can be designed with data management in mind to help build heterogeneous chips for Query processing applications. Our project highlights the intra and inter-query parallelism that can be exploited in dedicated architectures. Implementation of basic model and detailed analysis have been done for simple operators to check the correctness of designed Architectures.

IIIT Hyderabad

5

B-Tech Thesis

List of Contents

1) Introduction............................................................................................ [8] 2) Preliminary Work.................................................................................... [8] 2.1) Investigative Research......................................................................... [9] 2.2) Q-Pipe.................................................................................................. [9] 2.3) Database Servers on chip multiprocessors......................................... [11] 2.4) A Case for Staged Systems.................................................................. [14] 3.1) Implementation of a High-Level Model.............................................. [15] 3.2) Analysing of High-Level Model........................................................... [17] 4.0) Hardware Implementation................................................................. [20] 4.1) System Architecture........................................................................... [22] 5.0) Implementation and Results.............................................................. [32] 5.1) RTL Schematics .................................................................................. [37] 6.0) Conclusion........................................................................................ [39] 7.0) References......................................................................................... [40]

IIIT Hyderabad

6

B-Tech Thesis

List of Figures 1.0) Break Down of Execution Time........................................................... [13] 2.0) Flow Graph of Project......................................................................... [16] 3.0) Soft Explorer Data-Flow...................................................................... [17] 4.0) Data Re-partition................................................................................ [18] 5.0) Power Table Graph............................................................................. [19] 5.1) Power Fetch Graph............................................................................. [19] 6.0) Pre-Processor Model.......................................................................... [20] 7.0) Query Plan Screen Shot...................................................................... [21] 7.0) System Architecture........................................................................... [22] 7.1) From Packet...................................................................................... [23] 7.2) Where Packet.................................................................................... [24] 7.3) Select Packet..................................................................................... [26] 7.4) Data-Flow........................................................................................... [27] 8.0) Engine Architecture Flow................................................................... [29] 8.1) Engine Architecture Graph................................................................. [30] 8.2) Simulation of Encoder-Decoder......................................................... [33] 8.3) Simulation of Operator Module......................................................... [34] 8.4) Synthesis of From Decoder................................................................ [37] 8.5) Synthesis of From and Join Engine.................................................... [38]

IIIT Hyderabad

7

1. Introduction

B-Tech Thesis

Processing query is a fundamental operation for most modern computing applications, especially for those that rely on Internet to reach the end user. Current state-of-the-art solutions rely on a cluster of computers to perform concurrent processing of individual queries. Thus building efficient query processing architectures is central to building efficient search engines. Databases are widely used in nearly each and every field of computing today ranging from network applications to embedded applications. The two important factors which judge any application today are its speed and power consumption and there is constant work to improve both these parameters in the field of query processing. Google provides results to millions of queries every second. For most this seems like a problem that is solved. However there are many challenging problems in query processing, for example, how to index and search through hours and hours of audio or video for that matter. There are multiple approaches to tackling this problem. We propose to investigate the architectural aspects of implementing efficient search engines by digging into the hardware to develop enhanced query processing capability. Our goal is to design and implement a custom query processing engine as an accelerator to the traditional processor that will improve the efficiency of query processing capability on current systems. We propose to do this through the use of FPGA. In addition to performance related efficiency we would also focus on decreasing the price performance ratio on server based query processing systems.

2.0 Preliminary Work We proposed to use FPGAs as the key building block for testing and performance analysis of our query processing architecture. A typical FPGA consist of arrays of CLBs (Configurable logic Blocks) and routing channels with Block RAMs for memory. In a typical query we can say that two queries are almost independent of each other hence we can execute more than one query at a time in parallel. Multiple queries can

IIIT Hyderabad

8

B-Tech Thesis

be run at the hardware level in parallel using the multi-core architecture providing huge performance benefits. In addition to multiple query processors to exploit query level parallelism, one can also exploit the fact that each query can be factored in to multiple stages of operations and exploit the parallelism due to pipeline each query in to multiple stages (eg: SIMD – single instruction multiple data), which we call intra-query parallelism. Thus, we have two degrees of parallelism to exploit. First query level parallelism, by having multiple query processors in a single FPGA, and second, intraquery parallelism, by pipelining each query in to multiple stages of operations. Together these techniques should result in enhanced figures of merit.

2.1 Investigative Research In this field although no direct work has been done in the area of Hardware based Query co-processor, there is a lot of material available wherein the idea and techniques for the same are discussed. We have studied these papers in detail and present a short summary on each of them in the following paragraphs.

2.2 QPipe: A simultaneously Pipelined Engine

Relational Query

This paper[1] was authored by Prof Stavros Harizopoulos and Prof Anastassia Ailamaki belonging to the Computer Science Department of Carnegie Mellon University and Prof Vladislav Shkapenyuk of the Computer Science Department of Rutgers University. They introduce two concepts, namely on-demand simultaneous pipelining (OSP) and QPipe. OSP helps to maximize the data and work sharing across parallel query executions and pipelines the operator’s output simultaneously to multiple parent nodes thereby enabling dynamic operator sharing. Qpipe is a relational engine which supports OSP. Each relational operator is encapsulated in a micro-engine serving query tasks from a queue, naturally exploiting all data and work sharing opportunities. IIIT Hyderabad

9

B-Tech Thesis

Qpipe exhibits a speedup of 2x on a 4-disk Pentium 4 server system and its throughput increases as more clients are added. The reason for this improved performance is that the memory pages brought in by one query are shared and reused by all the queries currently in execution. In the ideal case, this should be done for all relational operators and not just table scans and this capability is termed as OSP. The current challenge is to implement this OSP with as minimum overhead as possible. We face many challenges in implementing OSP on a standard query engine, which are: A) Efficient detection of overlapping operators and eligibility. B) Coping with variations in the input and output speeds of the queries. C) Optimize restrictions on query execution order to maximize sharing opportunities. If we use the aforementioned “one query, many operators” concepts the overhead incurred would negate any performance benefits. A buffer pool manager keeps track of the memory pages being read and being brought in, decides on when to update the pages back in the memory and replaces old pages by the LRU policy. Materialized view selection is typically applied to workloads known in advance, in order to speed-up queries that contain common sub-expressions. The results from the queries which have been processed are stored in a cache which enables faster response for queries which reuse that data. Multiplequery optimization identifies common sub-expressions in query plans during optimization, and produces globally-optimal plans. [1] then speaks about the challenges in simultaneous pipelining in detail. For the runtime detection of overlapping queries the engine must track each and every query which is currently being executed. The engine has to compare the relational operators of the input query with the current set of operators which are in execution. This run-time detection should be as efficient as possible and scale well with the number of active queries. In the case of multiple-scan consumers, when new scan requests for the same table arrive repeatedly and dynamically share a single scan, a large number of partial scans will then be active on the same relation. Sometimes, the query plan is generated such that the order of the scanned tuples is of importance and cannot be modified. Also, there is the case of the engine getting caught in deadlocks during parallel query processing.

IIIT Hyderabad

10

B-Tech Thesis

The current query processing model can be termed as “query-centric”, as the engine designs conform to the “many operators, one query” standard. The system consists of an optimizer which takes the input query and outputs a desired and optimum query plan to perform the query processing. The query plan is constructed in the form of a tree which consists of the input points in the form of leaves and all the relational operators in the form of nodes. Each query is assumed to be independent of one another and is executed in parallel by the engine. Therefore, the common data across query executions is provided by a buffer pool of memory and this pool maintains this information through a replacement policy. In the implementation of QPipe, the data to be processed is introduced to the system in the form of packets, and is distributed to various µEngines each of which handles a specific relational operator. The µEngine queues the incoming packets, which contain all information about the data to be processed and the location of storage of the output data, and then processes them one by one. The individual work of all the µEngines is linked by a producer-consumer relationship as the output of one µEngine acts as the input of another. The support for OSP is implemented in the following manner. Whenever a new packet enters the queue of a µEngine, the whole queue is checked for overlapping operators. There are two elements that are common to all μEngines: the OSP Coordinator and the Deadlock Detector. The OSP Coordinator does the ground-work for integrating an incoming query into the “in-progress” query if their operators form an overlap while the Deadlock Detector ensures a deadlock-free execution of simultaneously pipelined schedules. In conclusion, they have provided a few techniques and concepts to facilitate the efficient parallelization of query processing by using the concept of data reusability and simultaneous pipelining. We were very influenced by the ideas presented and would include them in our design.

2.3 Database Servers on Chip Multiprocessors This paper[2] was authored by Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju G. Mancheril, Anastassia Ailamaki and Babak Falsafi of the Database Group and Computer Architecture Lab (CALCM) of Carnegie Mellon University.

IIIT Hyderabad

11

B-Tech Thesis

According to their claims, till now the main source of bottleneck for database systems was the off chip data stalls so the major effort was directed towards bringing more and more data into the caches. At the same time, because of the everincreasing level of integration in processor chips, the newer processors have larger (but slower) data caches. We find that the major bottleneck of current software is data cache stalls, with L2 hit stalls rising from oblivion to become the dominant execution time component in some cases. With recent developments in semiconductor technologies, the size of on chip data caches has increased exponentially to 16MB in Dual-Core Intel Xeon 7100, and 24MB in Dual-Core Intel Itanium 2. This trend shows no sign of change as computer designers feel that larger caches are proportional to increase in performance. But the disadvantage with larger cache sizes is that they have higher latencies as on-chip L2 latency has increased more than 3-fold during the past decade — e.g., from 4 cycles in Intel Pentium III (1995) to 14 cycles in IBM Power5 (2004). They recognize that chip multiprocessor designs can be split into two different families or "camps." The fat camp has wide-issue out-of-order processors which addresses data stalls by exploiting instruction-level parallelism (ILP) (e.g., Intel Core Duo, IBM Power 5). The lean camp has heavily multithreaded in-order processors to hide data stalls across threads by overlapping data access latencies with useful computation (e.g., Sun UltraSparc T1). The database applications are divided into saturated workloads, in which idle processors always find an available thread to run, and unsaturated workloads, in which processors may not always find threads to run, thereby exposing data access latencies. The software used is FLEXUS, is a cycle-accurate full-system simulator that simulates both user-level and operating system code, for accurate simulations of chip multiprocessors and symmetric multiprocessors running unmodified commercial database workloads. They characterize the performance of database workloads on an LC camp and an FC camp with the UltraSPARC III ISA running the Solaris 8 operating system. The LC camp employs four 2-issue superscalar in-order cores. The FC camp employs four aggressive out-of-order cores. The cache times are assessed by Cacti 4.2, which is an integrated cache access time, cycle time, area, leakage, and dynamic power model. The workloads consist of OLTP (TPC-C) and

IIIT Hyderabad

12

DSS

(TPC-H)

B-Tech Thesis

benchmarks

running

on

a

commercial

DBMS.

Figure 5 shows that in six out of eight combinations of hardware and workloads, data stalls dominate execution time. They analyze the components of data cache stalls. Large on-chip L2 caches result in higher hit rates that shift stalls from memory to L2 hits, and increase the relative importance of L2 hit stalls. Integrating multiple processing cores on a single chip causes data accesses that result in long-latency coherence misses in an SMP system and are converted into L2 hits on the shared L2 cache of the CMP. So, the L2 hit stall component increases by a factor of 7, showing the difference in their results and prior research findings. Chip multiprocessors integrate multiple cores on a single chip, which promotes sharing of data through the common L2 cache, but the contention for the shared hardware resources negates the positive effects of on chip data sharing. They have then proposed a few measures to improve the performance of DBMS systems on CMPs. The first is to increase parallelism, which is a very easy task as databases are inherently parallel. The queries can be broken up and worked on concurrently. We see that all this conforms to the discussions in paper1 of which we explained earlier and gives added impetus to develop our engine on similar lines. They also propose improving data locality, like for threads bearing a producerconsumer relationship, execution would be faster if they were bind to the same core. The third one is to construct a staged database system wherein the work is carried out in “stages” as compared to monolithic requests. The query is decomposed into “packets” and then sent to different stages according to its requirements. The type of system may also bring in data locality by explicit scheduling of the packets. IIIT Hyderabad

13

B-Tech Thesis

2.4 A Case for Staged Database System

This paper[3] authored by the same group from CMU introduces a staged design for high-performance evolvable DBMS that are easy to maintain. They have proposed to break the database system into modules and to encapsulate them into self-contained stages connected to each other through queues. The demand for executing time-critical operations and supporting it is increasing in a base of millions of users. To cope with high demands modern database systems need to explore a work- centric multi threaded and multi-process execution model for parallelism. In the proposed design, every stage exclusively owns a data structure, allocates hardware resources, and makes its own scheduling decisions. Frequent switching between threads of the same program interleaves unrelated memory accesses, thereby reducing locality. They address memory performance from a single application’s point of view, improving locality across its threads. All stages are independent servers with its own queue, thread support, and resource management that communicate and interact with other stages through a welldefined interface. Stages accept packets, each carrying a query. They have adopted for two levels of CPU scheduling which are local thread scheduling within a stage and global scheduling across stages. They have divided in top level the actions that a database server performs into five query execution stages connect, parse, 7 BTPReport: 2 optimize, execute, and disconnect. The execute stage typically represents the largest part of a query’s lifetime and is further decomposed into several stages. Stages provide two basic operations, enqueue and dequeue for a queue with the incoming packets. This system works through the exchange of packets between stages. A packet represents work that the server must perform for a specific query at a given stage. Each stage has a pool of worker threads (the stage threads) that continuously call dequeue on the stage’s queue, and one thread reserved for scheduling purposes. More than one threads per stage help mask I/O events while still executing in the same stage (when there are more than one packet in the queue). If all threads happen to suspend for I/O, or the stage has used its time quantum, then a stage-level scheduling policy specifies the next stage to execute. Operators are grouped together which use a small portion of the common or shared data and code, and separate operators that access a large common code base or common data. In the execution engine implementation maps the different operators into five distinct stages. file scan “fscan” and index scan “iscan” ,for accessing stored IIIT Hyderabad

14

B-Tech Thesis

data sequentially or with an index, respectively, sort, join which includes three join algorithms, and a fifth stage that includes the aggregate operators (min-max, average, etc.). The “fscan” and “iscan” stages are replicated and are separately attached to the database tables. This way, queries that access the same tables can take advantage of relations that lie already in the higher levels of the memory hierarchy.

3.1 Implementation of a high level model

Other then studying research papers on related work we were also working in parallel on the Implementation of our Design. Our First task was to understand the present query processing architecture. In order to do that we made a high-level modelling of the present architecture and then correlate with a benchmarked Database management system. We worked on a query processor code which implemented operations of searching data from tables based on simple engines like “FROM” , “Where” and “Select”. This helped us design a framework for the existing sequential architecture. Once we finished with our implementation we started testing our design on SAM (SPARC Architecture Model) simulator on OpenSPARC T2. We then profiled our code to analyse the overheads and latencies in the existing code. This has helped us realize the intricate details in query processing and places where searching can be parallelized in Multicore - Multi threaded architectures.

The following flow chart gives a brief overview of the work done in the Implementation of high level modelling:

IIIT Hyderabad

15

B-Tech Thesis Study Database Benchmarks like TPC to model our sequential architecture

Frame High Level Sequential Architecture for Implementation

C CODE IMPLEMENTATION

Code parsing Engine for Query search with parsing ability of From, Where and Select

Coding of Cache models by accessing Schema and hashing tables

Analysis, Testing and Profiling of Code in SAS (Sparc Architecture Simulator) and identifying parallel Implementations for architecture based on related work studied

IIIT Hyderabad

Implementation of Query Plan analyser using where engine and filtering cache using from and select engine

16

B-Tech Thesis

3.2 Analyzing our High-level Model:

The high-level module which was written in C was then analysed for its power consumption through the soft explorer analysis tool. Soft Explorer is an automatic Power/Energy consumption estimation tool. The entry point of this tool can be both the assembly code generated by the compiler (or directly the ASM code written by the programmer) or directly the C code. This tool gives the global and local (for each loop) consumption and also the consumption repartition. It can also give the energy consumption of the external memory and the energy consumption repartition between the processor and the external memory.

The power consumption of the design can be split into four major components : 1. DMA Power 2. Processing Unit Power 3. Fetch Power 4. Clock Power

IIIT Hyderabad

17

B-Tech Thesis

We simulated the sequential C code on the CS5510 dsp processor to view the power/energy consumption in a serial query processor so that we can have a comparison for the pipelined verilog code which we show to have better performance. The power analysis of the sequential C code is displayed in the figure below:

This analysis shows that 52 percent and 41 percent of the power consumption is done by the clock and the fetch unit respectively making them the prominent factors. The next simulation was carried out to determine the share of the total power consumption and the fetch power consumption amongst the different data structures in the code.

IIIT Hyderabad

18

B-Tech Thesis

The following illustrates the power consumption breakout of the used data structures in the C code. This gives us hint of how the Data elements are used in the High-level modelling. Further Power of Fetch Data can be seen in the next simulation graph.

IIIT Hyderabad

19

B-Tech Thesis

4.0 Hardware Implementation: Our Implementation has narrowed down to the following Data-Flow Diagram as shown below. The distinction between the Pre-processor and the co-processor to identify the point of shifting of data to be handled in Hardware has been shown here in our report. We further would summarize implementation scheme as well as architecture for implementation in this part of the Report.

The process is shown here in the following picture how any SQL query can be compiled to generate its optimized query plan without executing the query. Once this is done, we proceed in the generation of packet for communication with the FPGA. The generation of the packet is done as a shell script in the packet itself. The packet is somewhat made as follows:

IIIT Hyderabad

20

B-Tech Thesis

Packets will be formed after this. The packets for query can be either dependent or independent. Here is an example of how we will be forming the query packets for communication.

IIIT Hyderabad

21

B-Tech Thesis

Depending on the Query plan the packets will be modified by the Encoder and decoder module.

4.1 System Architecture

IIIT Hyderabad

22

FROM ENGINE:

B-Tech Thesis

The primary function of the “from” engine is to fetch table data from the common memory to the local memory of the co-processor. This engine accepts instructions of the following packet from the pre-processor:

QI or Query ID is unique to a specific query and is assigned by the pre-processor. This is used by the co-processor to maintain coherency between different instruction segments of the same query. EI or Engine ID is unique to each engine and is “0000” for the “FROM” engine. The IFU makes use of this ID to ensure the correct routing of all instructions to their respective engines. OP ID or Operator ID for the FROM engine has a constant value as it consists of a single operator. Table Add(ress) is a 32 bit field which specifies the starting address in the common memory for the table data which has to be fetched into the local memory. Table Size is an x bit field composed of two sub-fields- Length of y bits and Width of z bits. These values indicate the size of the table that is to be retrieved in the form of the number of rows and columns. J specifies whether a join operation has to be performed on the table or not .

The data is processed and this engine generates another intermediate instruction for consumption by the where or join engine in the above format

IIIT Hyderabad

23

B-Tech Thesis

.

Table Add(ress) is a 32 bit field which specifies the starting address in the local memory of the table data. Table ID is a unique ID attached to each table that is stored in the local memory. All the other data fields are similar to the above case.

JOIN ENGINE: The join engine’s task is to take two SQL tables and apply the process of merge join on them to get a single resultant table which would be required by the other engines for further processing of the query.

It then transmits the information about the new table through an intermediate instruction with the above specified format. Temp Table Address is the temporary address of the newly formed table in the local memory. The other fields have been explained earlier.

IIIT Hyderabad

24

WHERE ENGINE:

B-Tech Thesis

The where engine receives a temporary table, which serves as its working data set, and a set of conditions, each specified in a separate instruction, which are to be applied on that data set. The pre-processor parses multiple “where” conditions in the input query and splits each one into a separate instruction to be passed to the “where” engine.

The OPID for the where engine gives the information about the operation which has to be performed on the attributes which finally filters the useful tuples from the temporary table . FILTER

ATR1 and ATR2 are the fields on which the operation is performed and the results gives us the constraints which when applied on the temporary table gives us set of tuples that are stored in memory in form of a temporary table . J bit tells whether the input temporary table is a resultant of join operation or not . The other fields have been explained earlier.

IIIT Hyderabad

25

B-Tech Thesis

This engine outputs a new temporary table. The data in this table is a subset of the main database table and satisfies all conditions set forth in the query. The information about the new table is encapsulated in an intermediate instruction of the above format and passed onto the “SELECT” engine.

SELECT engine: This engine is fed by the intermediate instructions formed by the “WHERE” engine. The task performed here is to retrieve those data fields from this new table which have been specified in the query. This information is extracted by the preprocessor and conveyed to this engine through an instruction of the format:

ATTRIBUTE FIELD specifies the attributes that have to be displayed as output and hence the select engine filters those attributes from the temporary table whose whereabouts are send in form of a intermediate packet (where -> select ) .

AN specifies the number of attributes that have to be selected . IIIT Hyderabad

26

NO OP is no operation .

B-Tech Thesis

The other fields have been explained earlier. The above four engines are the crux of the architecture defined since they all are different units this helps us to exploit pipelining while solving even a single query which is also defined as intra-query parallelism . When there are multiple queries that are solved then the architecture advances to inter-query parallelism .

Dataflow:

In this section we shall see how a query is being processed by the custom query processer . The IFU (Instruction Fetch Unit) sends in the packets one by one to the IIIT Hyderabad

27

B-Tech Thesis

common bus whenever the common bus is free i.e. when the common bus gives ready to receive signal on being requested by the IFU . Since the query has been segmented into parts such that each engine could perform the specified task . The encoder and the decoder units are common to every engine architecture. The decoder decodes the packets and sends the information in them to operation unit which performs the specified task and then sends the information to the encoder unit which encodes the output in form of packet and forwards it to the common bus so that it can reach its destination engine , otherwise it simply displays the output . The first task of every query is to retrieve the tables on which the manipulations have to be done .This task is performed in the "from" engine which retrieves the data from the main memory and stores it in the local memory of the co_processor. The address of the this local memory is being forwarded in the form of packets to the "join" engine if it is required otherwise the packets are forwarded to the where engine. According to the query the "where" engine gets the address of the temporary table in the intermediate packets on which it has to do the manipulations from the "from" or "join" engine. "Join" engine comes into role if the data manipulation has to be done on two or more tables. The manipulations to be done are also being send in form of packets by the preprocessor forwarded to the engines by the IFU through the common bus . The decoder in the where engine decodes these packets and sends the information to the operation unit . The resultant tuples are stored in another temporary table in the memory and the address of this temporary table is forwarded to the "select" engine in form of intermediate packets. In the "select" engine the decoder has to decode two kind of packets , one that are intermediate packets formed by the "where" engine and the other send by the preprocessor which gives the information about the attributes that have to be selected among the resultant tuples manipulated in the "where" engine and hence display the output. The storage of temporary tables is done in the local memory of the preprocessor. This enables speedup while data has to be fetched with almost 100 percent hit rate. The input buffer and the output buffer are just the part of the local SRAM . The memory operations in the coprocessor are being synchronized by the memory controller . It is responsible for the data integrity . IIIT Hyderabad

28

B-Tech Thesis

Operator

Memory Controller: This is the memory interface of the engine. It will talk to the external memory. Its job is to fetch the data to the data buffer and then store back data to the memory. It also receives the address of the written output block in the memory to send it to encoder module. Data Buffer: It is the local memory provided to each engine to hide the memory access latency. There are two buffers inside input buffer and output buffer for storing input and output data respectively. Operator: This is the main code of the engine and will perform the desired operation. This will perform all the architecture operations like sequence scan, index scan, projection.

IIIT Hyderabad

29

B-Tech Thesis Encoder and Decoder Module: Decoder buffer will store the query coming from IFU or operator based on its dependent or independent nature. It is discussed in detail later in the document. Decoder will decode the query packets get the information of ENGINE CODE, query_id, table address, table size etc and send them to the respective modules. Encoder will make and store the new packet to send it to other engines.

The above figure is architecture for the execution of query by the query engines. Each of the signals are as follows: IFU receives queries in the form of packets for further processing by the query processor. It puts packets on the common bus. Common bus is read by all the engines, if the destination address field in the packet matches with the engine code then that packet will be stored in the decode queue of the engine. There are two decoding queues namely the dependent and the independent packet queue. Depending upon the nature of the packet it will be stored in the respective queue. If IIIT Hyderabad

30

B-Tech Thesis

the memory controller is free it will assert the ‘tx_status’ signal and wait for receiving data from the Encoder/Decoder unit. The Encoder/Decoder Unit will send the required data to the memory controller and simultaneously send the ‘meta data’ of the table to the operator. Once data buffer starts receiving the table data, the operator commences its operation, and writes the data back in to the output buffer.

1. Common Bus: It is a 32-bit Bus to transfer Packet from IFU to operator and from operator to other operator. It is a shared bus with handshaking as discussed below.

Handshaking Signals interfacing with IFU:

2. bus_rd_rq(Active LOW) : Initially all the engines will read the bus hence ‘bus_rd_rq’ is HIGH. Seeing a packet designated to itself, corresponding operator will de-assert the signal. It will asserted again when EOP(End Of Packet) is received.

3. bus_rd_grant(Active HIGH) : Seeing ‘bus_rd_rq’ low of a particular engine, IFU

will assert ‘bus_rd_grant’ HIGH of that engine and LOW to all other engines. It will be de-asserted when ‘bus_rd_rq’ of that engine is again HIGH. Handshaking Signals interfacing the external Memory Controller:

4. mem_rq(Active HIGH) : Mem. Controller will assert it when it has a table address and a table size to send to memory. Mem. Controller will wait for ‘mem_grant’ signal to be high. It will be de-asserted once all the data has been sent.

5. mem_grant(Active HIGH) : Memory controller seeing ‘mem_rq’ HIGH of a

engine based on availability of memory assert this signal. It will be de-asserted once ‘mem_rq’ is de-asserted.

6. mem_add_rq(Active HIGH) : When ‘op_block_done’ is HIGH it will assert this

signal to get the address and offset of the written output block in the memory from the memory controller. It will be de-asserted once ‘mem_add_grant’ is HIGH.

7. mem_add_grant(Active HIGH) : Memory controller seeing ‘mem_add_grant’

HIGH will assert this signal and send the address and offset of the stored block. It will be de-asserted again once ‘mem_add_rq’ is LOW.

8. Mem_data_valid(Active HIGH) : Memory controller will assert it to

communicate to data buffer of an engine that it is sending valid data. It will be deasserted temporarily if ‘buffer_full’ is HIGH and permanently when ‘ip_table_done’ is asserted.

IIIT Hyderabad

31

B-Tech Thesis

9. Buffer_full(Active HIGH) :Data buffer will assert it to indicate that input data

buffer is full. Seeing it HIGH memory controller will de-assert ‘mem_data_valid’ signal. It will be de-asserted when input buffer of the data buffer will go below a threshold level.

10. mem_wr_rq(Active HIGH) : Data buffer will assert this signal when its output

data buffer has entries more than a certain threshold. It will be de-asserted when data writing is finished.

11. mem_wr_grant(Active HIGH) :Seeing mem_wr_rq HIGH to grant access to

data buffer to write output data block based on availability memory controller will assert ‘mem_wr_grant’ signal. It will be de-asserted once ‘mem_wr_rq’ is deasserted.

12. ip_table_done(Active HIGH) : It will be asserted to send a signal to data buffer

that that table fetching is finished. It will be de-asserted when mem_data_valid is asserted.

5.0 Implementation and Results The Following Architecture was implemented in Verilog HDL. The Implementation procedure was carried out in two parts: a) IFU-ENCODER DECODER system This consisted of all the engines along with their encoder-decoder queues and memory controllers. The engines were then interfaced with the common IFU unit to complete this section of the design. b) Operator Architecture This part consisted of designing the core of each engine. The operator design required extra care so that latencies could be kept in check. To verify the functional correctness of the design blocks we generated an input file which provided the input ports with a bit-stream. By running the simulation tool on such a dataset we checked the data bits at various stages of processing. The outputs at each encoder and decoder were also verified. The following show the Functional Simulation results of the IFU- ENCODER and DECODER systems in GTK-Wave:

IIIT Hyderabad

32

B-Tech Thesis

The following module interfaces the IFU with the FROM, WHERE and SELECT encoder and decoder units. the FROM decoder unit has been implemented with a serial FIFO whereas WHERE and SELECT have been implemented with flexible FIFO systems which can break its link based on the query ID of the packet placed in the FIFO. The following system is connected via IFU and tested with clock cycles to give correct simulation outputs.

SIMULATION of IFU- ENCODER - DECODER Unit Then the Operator Module has been implemented along with its interface with SRAM, Memory controller and clock. The Module Engines have been tested for logic simulation and have given correct results. The Engines have been worked out in a clock based state diagram logic where control signals vary on clock edge. A screen shot of simulations of this module is shown here:

IIIT Hyderabad

33

B-Tech Thesis `

SIMULATION of OPERATOR MODULE

The RTL synthesis of both the modules have been done in Synopsis Design Tools and Xilinx ISE. The results calculate are shown below: FINAL RESULTS: This report gives us information about the logic as well as device utilization for our particular design on the Xilinx Virtex 2Pro FPGA. RTL Top Level Output File Name Top Level Output File Name Output Format Optimization Goal Keep Hierarchy

: : : : :

Design Statistics # IOs

: 133

Cell Usage : # BELS # BUF # INV # LUT3 # LUT3_D # LUT3_L # LUT4 # LUT4_D # LUT4_L # FlipFlops/Latches

: : : : : : : : : :

IIIT Hyderabad

fromdecoder.ngr fromdecoder NGC Speed NO

637 2 1 4 1 3 589 29 8 716 34

B-Tech Thesis

# FDE : 76 # FDRE : 640 # Clock Buffers : 1 # BUFGP : 1 # IO Buffers : 132 # IBUF : 67 # OBUF : 65 ========================================================= ================ The Design was simulated for package no 2vp2fg256-7 and the following reports show the utilization summary:

Device utilization summary: --------------------------Selected Device : 2vp2fg256-7 Number of Slices: 1408 30% Number of Slice Flip Flops: 2816 25% Number of 4 input LUTs: 2816 22% Number of IOs: Number of bonded IOBs: 140 95% Number of GCLKs: 16 6%

435

out of

716

out of

635

out of

133 133

out of

1

out of

---------------------------

Timing Analysis : Timing Detail: -------------All values displayed in nanoseconds (ns) ========================================================= ================ Timing constraint: Default period analysis for Clock 'clk' Clock period: 2.831ns (frequency: 353.280MHz) IIIT Hyderabad

35

B-Tech Thesis

Total number of paths / destination ports: 6057 / 1368 -----------------------------------------------------------------------Delay: 2.831ns (Levels of Logic = 2) Source: ins1/wrptr_2_1 (FF) Destination: ins1/mem_7_63 (FF) Source Clock: clk rising Destination Clock: clk rising Data Path: ins1/wrptr_2_1 to ins1/mem_7_63 Gate Net Cell:in->out fanout Delay Delay ---------------------------------------FDE:C->Q 6 0.370 0.581 LUT3_D:I0->O 1 0.275 0.349 LUT4:I2->O 64 0.275 0.717 FDRE:CE 0.263 ---------------------------------------Total 2.831ns (1.183ns logic, 1.648ns route) (41.8% logic, 58.2% route) ========================================================= ================

Mapping Analysis: Mapping Report -------------Logic Utilization: Number of Slice Flip Flops: 716 out of 2,816 25% Number of 4 input LUTs: 636 out of 2,816 22% Logic Distribution: Number of occupied Slices: 401 out of 1,408 28% Number of Slices containing only related logic: 401 out of 401 100% Number of Slices containing unrelated logic: 0 out of 401 0% Total Number of 4 input LUTs: 636 out of 2,816 22% Number of bonded IOBs: IIIT Hyderabad

36

B-Tech Thesis

133 out of 140 95% Number of BUFGMUXs: 1 out of 16 6%

5.1 RTL SCHEMATICS OF SYNTHESIS: These screenshots were generated by the Xilinx Synthesis Tool.

The IFU to FROM DECODER UNIT

IIIT Hyderabad

37

b) RTL One Level Down:

B-Tech Thesis

c) FROM AND JOIN OPERATOR SCHEMATIC :

IIIT Hyderabad

38

B-Tech Thesis

6.0 Conclusion The following work has given a direction towards implementing an architecture for accelerating query operations. It also provides definite results to support our case and probe into the details of customizing indexing , merging and hashing operations with hardware SM based operators. Further work in this direction would be done to make a scalable hardware to interface Post-GRE and try TPC benchmarks on the designed coprocessor.

7.0 References:

[1] QPipe: A Simultaneously Pipelined Relational Query Engine. Stavros Harizopoulos, Vladislav Shkapenyuk, Anastassia Ailamaki. SIGMOD 2005, June 1416, 2005, Baltimore, Maryland, USA. [2] Database Servers on Chip Multiprocessors: Limitations and Opportunities. Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju G. Mancheril, Anastassia Ailamaki and Babak Falsafi. 3rd Biennial Conference on Innovative Data Systems Research (CIDR), January 7-10, 2007, Asilomar, California, USA. [3] A Case for Staged Database Systems: S. Harizopoulos and A. Ailamaki. In proceedings of the First International Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, January 2003. [4] Database System Implementation: Hector Garcia Molina 2nd Edition [5]Database System Concepts: Silberschatz , KorthSudarshan [6] L. D. Shapiro. Join processing in database systems with large main memories. ACM Trans. Database Syst., 11(3):239{264, 1986 [7] K. Stocker, D. Kossmann, R. Braumandl, and A. Kemper. Integrating semijoin- reducers into state of the art query processors. In ICDE, pages 575{584, 2001. [8] K. A. Ross. Modeling the performance of algorithms on flash memory devices. In DaMoN, pages 11{16, 2008 }.

IIIT Hyderabad

39