GPU Processing in Database Systems

GPU Processing in Database Systems Max Heimel Nikolaj Leischner Volker Markl Michael Säcker Fachgebiet Datenbanksysteme und Informationsmanagement T...

Author: Johnathan Todd

14 downloads 0 Views 2MB Size

Report

Download PDF

Recommend Documents

Query Processing & Optimization. CS 377: Database Systems

GPU. SIMD Processing

GPU systems

Image Processing on GPU

GPU Systems

Graphics Processing Units (GPU) for HEP trigger systems

CMSC 411 Computer Systems Architecture Lecture 23 Graphics Processing Unit (GPU) Graphics Processing Units (GPUs)

LOAN PROCESSING LOAN DATABASE

Audio processing algorithms on the GPU

GPU Processing Methods for Machine Vision

Graphic Processing Units GPU (Section 7.7)

Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems

Parallel Database Systems: The Future of Database Processing or a Passing Fad?

Introduction to Database Systems. Database Systems Lecture 1 Natasha Alechina

On Indexing in Native XML Database Systems. On Indexing in Native XML Database Systems

DATABASE MANAGEMENT SYSTEMS

Database Management Systems

CS 377 Database Systems

Database Management Systems

CS5300 Database Systems

Distributed Database Systems

FUNDAMENTALS OF DATABASE SYSTEMS

5. DATABASE MANAGEMENT SYSTEMS

Database systems: Volume 1

GPU Processing in Database Systems

Max Heimel Nikolaj Leischner Volker Markl Michael Säcker

Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin 24.06.2011

http://www.dima.tu-berlin.de/ DIMA – TU Berlin

1

Agenda ■ ■ ■ ■ ■ ■ ■

Database Systems Background Overview of GPU Technology Architecture of a Hybrid CPU/GPU System Database Operations on the GPU Alternative Architectures Research Challenges of Hybrid Architectures Current GPU/CPU Hybrid DBMS Systems

24.06.2011

DIMA – TU Berlin

2

Database Query ■ SQL is declarative: specifies what data is needed, not how to get it SELECT FROM

WHERE

DISTINCT o.name,a.driver owner o, car c, demographics d , accidents a c.ownerid = o.id AND o.id = d.ownerid AND c.id = a.id AND c.make = 'Mazda' AND c.model = '323' AND o.country3 = 'EG' AND o.city = 'Cairo' AND d.age < 30 ;

Find owner and driver of Mazda 323s that have been involved in accidents in Cairo, Egypt, where the age of the driver has been less than 30

24.06.2011

DIMA – TU Berlin

3

Relational Database Operators owner ownerid

cars make

city

model

ownerid

1

Berlin

Honda

Accord

1

2

Potsdam

Toyota

Camry

1

3

Berlin

Mazda

323

2

Selection/Filtering σcity=„Berlin“(owner)

Projection πcity (owner)

Join cars |>7 NLJN /-----+-----\ 7 N/A SCAN FETCH | | 7 N/A SORT ISCAN | | 7 N/A HSJN ACCIDENTS /------+-----\ 162015 10 SCAN HSJN | /--+--\ 605999 14422 1000

DEMO. FETCH FETCH | 14422 ISCAN | N/A

| 1000 ISCAN | N/A

CAR

OWNER

DIMA – TU Berlin

■ Bottom up data flow graph ■ Strings together operators ■ Many QEPs for a query □ physical operators □ operator order □ pipelining vs. bulk transfer

■ Query optimization □ determines “best“ QEP

■ Cost model □ for each operator □ for operator combination

5

Agenda ■ ■ ■ ■ ■ ■ ■

Database Systems Background Overview of GPU Technology Architecture of a Hybrid CPU/GPU System Database Operations on the GPU Alternative Architectures Research Challenges of Hybrid Architectures Current GPU/CPU Hybrid DBMS Systems

24.06.2011

DIMA – TU Berlin

6

A Case for GPU architectures ■ Challenge in processor architecture □ More’s law is hitting a wall − − − −

power wall memory wall ILP wall …

□ Solution: Parallel processing

■ GPGPUS offer massively parallel processing □ Many more cores □ High GFLOPs and memory bandwidth □ Lower/cost per core □ Can be used to build and test massively parallel systems 24.06.2011

DIMA – TU Berlin

7

GFLOPS & GB/s ■ GFLOPS & GB/s for different CPU & GPU chips.. ■ Beware: theoretical peak numbers

Memory bandwidth (GB/s) 160 140 120

Firestream 9270 Tesla c1060

100 80

2006

600

Power7 4.04Ghz

400

Opteron Xeon E5320 2360SE 2007

2008

Firestream 9370 Tesla c2050

500

300 SPARC64 VIIIfx

40

0

Firestream 9370 Tesla c2050

Tesla c870

60

20

GFLOP/s

Opteron 6180SE Xeon X5680 Xeon W3540

200

2010

2011

2012

Power7 4.04Ghz

Opteron 6180SE Xeon X7460 Tesla c1060 Xeon X5680 Xeon E8870 Opteron 2435 Opteron Xeon W3540 Xeon E5320 2360SE 2007 2008 2009 2010 2011 2012 SPARC64 VIIIfx

100 0

2009

Firestream 9270

2006

Wikipedia, Intel, AMD, Nvidia, IBM 24.06.2011

DIMA – TU Berlin

8

What is the difference? Look at a modern CPU..

AMD K8L

www.chip‐architect.com

■ Most die space devoted to control logic & caches ■ Maximize performance for arbitrary, sequential programs 24.06.2011

DIMA – TU Berlin

9

And at a GPU..

■ Little control logic, a lot of execution logic ■ Maximize parallel computational throughput

ATI RV770

24.06.2011

ixbtlabs.com

DIMA – TU Berlin

10

GPU architecture ■ ■ ■ ■

SIMD cores with small local memories (16-48 kB) Shared high bandwidth + high latency DRAM (1-6 GB) Multi-threading hides memory latency Bulk-synchroneous execution model x86 host

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

Local memory

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

Local memory

ALU ALU ALU ALU

…

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

Local memory

Shared DRAM 24.06.2011

DIMA – TU Berlin

11

GPUs & databases: opportunities ■ Most database operations well-suited for massively parallel processing ■ Efficient algorithms for many database primitives exist □ Scatter, gather/reduce, prefix sum, sort..

■ Find ways to utilize higher bandwidth & arithmetic throughput

24.06.2011

DIMA – TU Berlin

12

GPUs & databases: challenges & constraints

■ Find a way to live with architectural constraints □ GPU to CPU bandwidth (PCIe) smaller than CPU to RAM bandwidth □ Fast GPU memory smaller than CPU RAM

■ Technical hurdles □ GPU is a co-processor: − needs CPU to orchestrate work − get data from storage devices − etc.

□ GPU programming models (e.g., OpenCL, CUDA) are low level − need architecture-specific tuning − lack database primitives

□ Limited support for multi-tasking − extra work like aggregating operations into bulks required

24.06.2011

DIMA – TU Berlin

13

References & Further Reading ■ J. Owens: GPU architecture overview, In ACM SIGGRAPH 2007 courses (SIGGRAPH '07). ACM, New York, NY, USA, , Article 2 ■ H. Sutter: The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobb's Journal, 30(3), March 2005, http://www.gotw.ca/publications/concurrencyddj.htm (Visited May 2011) ■ V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News 38, 3 (June 2010), 451-460.

24.06.2011

DIMA – TU Berlin

14

Agenda ■ ■ ■ ■ ■ ■ ■

Database Systems Background Overview of GPU Technology Architecture of a Hybrid CPU/GPU System Database Operations on the GPU Alternative Architectures Research Challenges of Hybrid Architectures Current GPU DBMS Systems

24.06.2011

DIMA – TU Berlin

15

Hybrid CPU+GPU system ■ 1 or several (interconnected) multicore CPUs ■ 1 or several GPUs

RAM RAM RAM RAM

HDD

24.06.2011

CPU

CPU

I/OH

I/OH

GPU

GPU

DIMA – TU Berlin

RAM RAM RAM RAM

HDD

16

Bottlenecks ■ PCIe bandwidth & latency ■ GPU memory size ■ GPU has no direct access to storage & network RAM RAM RAM RAM

HDD

24.06.2011

CPU

CPU

I/OH

I/OH

GPU

GPU

DIMA – TU Berlin

RAM RAM RAM RAM

HDD

17

GDB: GPU query processing ■ Fully-fledged GPU query processor □ GPU hash indices, B+ trees, hash join, sort-merge join □ Handles data sizes larger than GPU memory, supports non-numeric data types

RAM RAM RAM RAM

HDD

CPU

I/OH

GPU

Relational query coprocessing on graphics processors Bingsheng He and Mian Lu and Ke Yang and Rui Fang and Naga K. Govindaraju and Qiong Luo and Pedro V. Sander

24.06.2011

DIMA – TU Berlin

18

GDB: results ■ 2x-7x speedup for compute-intense operators if data fits inside GPU memory ■ Mixed mode operators provide no speedup ■ No speedup if data does not fit into GPU memory

Performance of SQL queries (seconds) 50 45 40 35 30 25 20 15 10 5 0

CPU only GPU only

NEJ 24.06.2011

EJ (HJ) DIMA – TU Berlin

19

GPUTx: GPU OLTP query processing

■ Offload large bulks of OLTP queries to GPU ■ CPU generates processing plan based on a transaction dependency graph, GPU does actual query processing □ System with 1 CPU + 1 GPU □ Data must fit inside GPU memory □ Only supports stored procedures

High‐Throughput Transaction Executions on Graphics Processors Bingsheng He and  Jeffrey Xu Yu

24.06.2011

DIMA – TU Berlin

20

GPUTx: results ■ Implementation based on H-Store

TPC‐B, normalized throughput

TPC‐C, normalized throughput 25

40 35

20

30 25

15

20

CPU (4 core)

15

10

GPUTx

10

5

5 0

0 Scale factor 128

24.06.2011

256

512

1024

2048

Scale factor 20

DIMA – TU Berlin

40

60

80

21

References & Further Reading ■ B. He and J. Xu Yu: High-throughput transaction executions on graphics processors. Proc. VLDB Endow. 4, 5 (February 2011), 314-325. ■ B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander: Relational query coprocessing on graphics processors. ACM Trans. Database Syst. 34, 4, Article 21 (December 2009), 39 pages.

24.06.2011

DIMA – TU Berlin

22

Challenges of GPU Programming

■ No dynamic memory allocation ■ Divergence □ Threads taking different code paths Æ Serialization

■ Limited synchronization possibilities □ Only on block level □ Limited block size

24.06.2011

DIMA – TU Berlin

23

Agenda ■ ■ ■ ■

Database Systems Background Overview of GPU Technology Architecture of a Hybrid CPU/GPU System Database Operations on the GPU □ Index Operations □ Compression □ Sorting □ Relational Operators □ Further Operations

■ Alternative Architectures ■ Research Challenges of Hybrid Architectures ■ Current GPU DBMS Systems

24.06.2011

DIMA – TU Berlin

24

Opportunities and Challenges ► GPGPU architectures offer massively data-parallel processing ► Fruitfly architecture: code may run on future multicore CPUs ► Bottlenecks (PCIe, local memory, disk and network access) ► Ensuring correctness while fully utilizing data parallelism is hard ► Synchronization mechanism for read/write conflicts is limited ► Cost modeling hard - some CPU details are unknown or change quickly

24.06.2011

DIMA – TU Berlin

25

Primitives for DBMS Operator Implementations ■ Second order functions on arrays □ Map □ Scatter (indexed write) □ Gather (indexed read) □ Prefix-Scan (combines/reduces entries) □ Split (partitioning, e.g., hash or range) □ Sort (e.g., bitonic sort/merge sort, radix sort, quicksort)

■ Library of primitives (building blocks) for GPGPU query Relational joins on graphics processors processing Bingsheng He and Mian Lu and Ke Yang and Rui Fang and Naga K. Govindaraju and Qiong Luo and Pedro V. Sander

24.06.2011

DIMA – TU Berlin

26

Indexing ■ Classic: B+ Tree

Key Directory

…

… …

…

Payload in leaves

■ Directory accelerates access, exploting the fact that the data in a sorted and organized in a hierarchical way ■ Compression to map variable-length data to fixed size □ Dictionary encoding

■ CSS-Tree: Encode directory and payload in array, cache-aware 24.06.2011

DIMA – TU Berlin

27

Prefix Tree Index

INT = 793910

… …

0001 1111 0000 00112

…

1        15         0        310

■ Path of a key within the tree defined by absolute value ■ Split key into equally sized sub-keys ■ Tree depth depends on key length

24.06.2011

DIMA – TU Berlin

28

Speculative Hierarchical Index Traversal

■ Idea: Group processing to exploit faster local memory ■ Group computational trees into blocks ■ Tree block is executed by a GPU thread block □ use shared memory for intermediate results

■ Generate global result by traversing internal result list 24.06.2011

DIMA – TU Berlin

29

Database Compression GPUs □ □ ⇒ ⇒

very limited memory capacity high computation capabilities and bandwidth Compression reduces overhead of data transfer Compression allows GPUs to address bigger problems

ÆUsually: combine compression schemes for higher compression ratios ÆE.g., auxiliary scheme RLE and main scheme zero suppression

Æ 1000,1003,1005

Æ 1000,+0003,+0002

Æ 1000,+3,+2

Goal: Carry operations (e.g., filtering) out on compressed representation of the data W. Fang, B. He, Q. Luo: Database Compression on Graphics Processors, Proc. VLDB Endow. 3, 1-2 (September 2010), 670-680 24.06.2011

DIMA – TU Berlin

30

Sorting with GPUs

■ Numerous GPU algorithms published □ Bitonic sort, bucket sort, merge sort, quicksort, radix sort, sample sort..

■ Existing work: focus on in-memory sorting with 1 GPU ■ State of the art: merge sort, radix sort ■ GPUs outperform multicore CPUs when data transfer over PCIe is not considered

24.06.2011

DIMA – TU Berlin

31

24.06.2011

Core 0 Core 1

DIMA – TU Berlin

Core 2

Keys (out)..

Scatter

…

Scan

Keys (out)..

Scatter

Reduce

Scan

Binning

…

Scan

Binning

Binning

Keys (in)..

Core 2

Scan

Binning

Keys (in)..

Accumulate

Keys (in)..

Keys (out)..

Scatter

Binning Accumulate

Keys (in)..

Keys (out)..

Scatter

…

Scan

Reduce

Scan

Binning

…

Scan

Core 0

Keys (in)..

Core 1

Scan

Binning

Scan

Binning

Keys (in)..

Accumulate

Keys (in)..

Keys (out)..

Scatter

Binning

Keys (in)..

Accumulate

Keys (in)..

Keys (out)..

Scatter

… Scan

Reduce

Scan

Binning

…

Scan

Binning

Keys (in)..

Binning

Keys (in)..

Core 0

Scan

Binning

Binning

Keys (in)..

Bottom‐level Reduction

Accumulate

Keys (in)..

Keys (out)..

Scatter

Scan

Scan

Binning

Keys (in)..

Top‐level Scan

Accumulate

Keys (in)..

Keys (out)..

Scatter

Scan

Scan

Binning

Keys (in)..

Bottom‐level Scan

Radix Sort @ GPGPU Core 3 Accumulate

...

…

Accumulate

Reduce

…

Core 3

Revisiting Sorting for GPGPU Stream Architectures Duane Merrill, Andrew Grimshaw

32

GPU sorting performance ■ For GPU radix sort the PCIe transfer dominates running time

Throughput  (million items / second) 1200

1000

800

600

32bit keys 32bit keys, 32bit values

400

200

0 GPU Radix w/ PCIe transfer

GPU Radix

CPU Radix Revisiting Sorting for GPGPU Stream Architectures Duane Merrill, Andrew Grimshaw Faster Radix Sort via Virtual Memory and Write‐Combining   Jan Wasenberg, Peter Sanders

24.06.2011

DIMA – TU Berlin

33

Example: Non-indexed Nested-Loop Join

…

Thread Group 1,1

S

…

Thread Group 1,j

…

R

Thread Group i,1

…

Thread Group i,j

R` Thread 1

Thread T

…

…

…

…

S`

1. Split relations into blocks 2. Join smaller blocks in parallel Aka “fragment and replicate join” (symmetric/unsymmetric) There is more: indexed NLJ, hash join, sort‐merge join, shuffling rings, etc. Problem: To allocate memory for the result, the join has to be performed twice! 24.06.2011

DIMA – TU Berlin

34

Example: Aggregation x0

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

■ Brent-Kung circuit strategy ■ Only upsweep phase necessary because only final result is needed ■ Permutation of elements to minimize memory bank conflicts ■ Separate thread group to combine results of blocks 24.06.2011

DIMA – TU Berlin

35

There is more...

■ Map/Reduce ■ Regular Expression Matching ■ K-Means Clustering ■ Apriori Clustering ■ Exact String matching ■ …

24.06.2011

DIMA – TU Berlin

36

References & Further Reading: Indexing ■ J. Rao, K. A. Ross: Cache Conscious Indexing for DecisionSupport in Main Memory, In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB '99), Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, and Michael L. Brodie (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 78-89 ■ C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. Brandt, P. Dubey: FAST: fast architecture sensitive tree search on modern CPUs and GPUs, ACM SIGMOD ‘10, 339-350, 2010 ■ P. B. Volk, D. Habich, W. Lehner: GPU-Based Speculative Query Processing for Database Operations, First International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures In conjunction with VLDB 2010 24.06.2011

DIMA – TU Berlin

37

References & Further Reading: Sorting ■ D. G. Merrill and A. S. Grimshaw: Revisiting sorting for GPGPU stream architectures. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT '10). ACM, New York, NY, USA, 545-546 ■ J. Wassenberg, P. Sanders: Faster Radix Sort via Virtual Memory and Write-Combining, CoRR 2010, http://arxiv.org/abs/1008.2849 (Visited May 2011) ■ N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey: Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 international conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 351-362.

24.06.2011

DIMA – TU Berlin

38

References & Further Reading: Rel. Operators ■ B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, P. Sander: Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD '08). ACM, New York, NY, USA, 511-524. ■ D. Merrill, A. Grimshaw: Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia. December 2009.

24.06.2011

DIMA – TU Berlin

39

References & Further Reading: Further Ops ■ N. Cascarano, P. Rolando, F. Risso, R. Sisto: iNFAnt: NFA Pattern Matching on GPGPU Devices. SIGCOMM Comput. Commun. Rev. 40, 5 20-26. ■ M.C. Schatz, C. Trapnell: Fast Exact String Matching on the GPU, Technical Report. ■ W. Fang, K. K. Lau, M. Lu, X. Xiao, C. K. Lam, P. Y. Yang, B. He, Q. Luo, P. V. Sander, K. Yang: Parallel Data Mining on Graphics Processors, Technical Report HKUST-CS08-07, Oct 2008. ■ B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. 2008. Mars: a MapReduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08). ACM, New York, NY, USA, 260-269. ■ many more …

24.06.2011

DIMA – TU Berlin

40

Agenda ■ ■ ■ ■ ■ ■ ■

Database Systems Background Overview of GPU Technology Architecture of a Hybrid CPU/GPU System Database Operations on the GPU Alternative Architectures Research Challenges of Hybrid Architectures Current GPU DBMS Systems

24.06.2011

DIMA – TU Berlin

41

A blast from the past: DIRECT ■ Specialized database processors already in 1978 ■ Back then could not keep up with rise of commodity CPUs ■ Today is different: memory wall, ILP wall, power wall..

Host

Controller

Memory

Memory

Memory

Mass storage

Query processor Query processor Query processor DIRECT ‐ a multiprocessor organization for supporting relational data base management systems David J. DeWitt

24.06.2011

DIMA – TU Berlin

42

FPGAs ■ Massively parallel & reconfigurable processors ■ Lower clock speeds, but configuration for specific tasks makes up for this ■ Very difficult to program ■ Used in real world appliances: Kickfire/Teradata, Netezza/IBM..

Network CPU FPGA RAM

HDD

FPGA: what's in it for a database? Jens Teubner and Rene Mueller 24.06.2011

DIMA – TU Berlin

43

CPU/GPU hybrid (AMD Fusion) ■ CPU & GPU integrated on one die, shared memory ■ Uses GPU programming model (OpenCL) ■ No PCIe-bottleneck but only memory bandwidth like CPU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

Local memory

ALU ALU ALU ALU

ALU ALU ALU ALU

…

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

x86 core

…

x86 core

Local memory

Memory controller

RAM AMD Whitepaper: AMD Fusion Family of APUs

24.06.2011

DIMA – TU Berlin

44

There is more...

■ Network processors □ Many-core architectures □ Associative memory (constant time search)

■ CELL □ Similar to CPU/GPU hybrid: CPU core + several wide SIMD cores w/ local scratchpad memories

24.06.2011

DIMA – TU Berlin

45

Common goals, common problems

■ Similar approaches □ Many-core, massively parallel □ Distributed on-chip memory □ Hope for better perf/$ and perf/watt than traditional CPUs

■ Similar difficulties □ Memory bandwidth always a bottleneck □ Hard to program − parallel − synchronization − low-level & architecture-specific

24.06.2011

DIMA – TU Berlin

46

References & Further Reading ■ D. J. DeWitt: DIRECT - a multiprocessor organization for supporting relational data base management systems. In Proceedings of the 5th annual symposium on Computer architecture (ISCA '78). ACM, New York, NY, USA, 182-189. ■ R. Mueller and J. Teubner: FPGA: what's in it for a database?. In Proceedings of the 35th SIGMOD international conference on Management of data (SIGMOD '09), Carsten Binnig and Benoit Dageville (Eds.). ACM, New York, NY, USA, 999-1004. ■ AMD White paper: AMD Fusion family of APUs, http://sites.amd.com/us/Documents/48423B_fusion_whitepap er_WEB.pdf (Visited May 2011) ■ N. Bandi, A. Metwally, D. Agrawal, and A. El Abbadi: Fast data stream algorithms using associative memories. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (SIGMOD '07). ACM, New York, NY, USA, 247-256. ■ B. Gedik, P. S. Yu, and R. R. Bordawekar: Executing stream joins on the cell processor. In Proceedings of the 33rd international conference on Very large data bases (VLDB '07). VLDB Endowment 363-374.

24.06.2011

DIMA – TU Berlin

47

Agenda ■ ■ ■ ■ ■ ■ ■

Database Systems Background Overview of GPU Technology Architecture of a Hybrid CPU/GPU System Database Operations on the GPU Alternative Architectures Research Challenges of Hybrid Architectures Current GPU DBMS Systems

24.06.2011

DIMA – TU Berlin

48

Architectural constraints

■ PCIe bottleneck □ Direct access to storage devices, network etc.? □ Caching strategies for device memory

■ GPU memory size □ Deeper memory hierarchy (e.g. a few GB of fast GDDR + large „slow“ DRAM)?

24.06.2011

DIMA – TU Berlin

49

Performance portability challenges

■ Want forward scalability: performance should scale with next generations of GPUs □ Existing work often optimized for exactly 1 type of GPU chip

■ Need higher level programming models □ Hide hardware details (processor count, SIMD width, local memory size..)

■ Cost estimation of operators

24.06.2011

DIMA – TU Berlin

50

Database-specific challenges

■ Big data volume and limited memory ■ Execution Plan consists of multiple operators ■ Where to execute each operator? □ Trade off transfer time between CPU and GPU and computational advantage □ Cost Models for GPGPU, CPU and hybrid □

Amdahl’s law

24.06.2011

DIMA – TU Berlin

51

Agenda ■ ■ ■ ■ ■ ■ ■

Database Systems Background Overview of GPU Technology Architecture of a Hybrid CPU/GPU System Database Operations on the GPU Alternative Architectures Research Challenges of Hybrid Architectures Current GPU DBMS Systems □ Monet DB/CUDA □ Parstream

24.06.2011

DIMA – TU Berlin

52

General Overview: Hybrid MonetDB ■ Joint Project of TU Berlin and CWI ■ Goal: Investigate GPGPU in the context of database systems □ We extend the open-source database system MonetDB using OpenCL

■ Why MonetDB? □ Open-source □ Column-oriented in-memory database system

■ Why OpenCL? □ Open standard with support from many big players □ Not restricted to graphics cards

24.06.2011

DIMA – TU Berlin

53

Hybrid MonetDB System Outline

■ Proof-of-concept implementation □ Supports only fixed-size columns fitting into device memory □ Implemented operators − − − −

Table Scan Nested-Loop Join Aggregation Group By

■ Implementation consists of: □ □ □ □

OpenCL context handling GPU memory management Integration into MonetDB Assembly Language (MAL). Support for multiple GPUs

24.06.2011

DIMA – TU Berlin

54

Parstream High Performance Indexing Big Data

24.06.2011

Index

Compression

DIMA – TU Berlin

Partition

Parallel Excecution

55

ParStream – Building Blocks ParStream combines state of the art database technologies with unique technologies

Column Store

In-Memory Technology

Fast data access for analytical processing

Data and indices can stay in memory due to efficient compression

High Performance Index Unique index structure allows highly parallel execution of queries

Custom Query Operators SQL, JDBC and powerful C++ API enable fast query processing

Patent Pending – High USP 24.06.2011

DIMA – TU Berlin

56

References & Further Reading ■ S. Hong and H. Kim: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News 37, 3 (June 2009), 152-163. ■ L. Bic and R. L. Hartmann: AGM: a dataflow database machine. ACM Trans. Database Syst. 14, 1 (March 1989), 114-146. ■ H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun: A domain-specific approach to heterogeneous parallelism. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (PPoPP '11). ACM, New York, NY, USA, 35-46. ■ http://www.dima.tu-berlin.de ■ http://www.monetdb.nl ■ http:// www.parstream.com

24.06.2011

DIMA – TU Berlin

57

Hindi Thai

Traditional Chinese

Russian

Gracias Spanish

Thank You English

Arabic

Brazilian Portuguese

Danke German

Grazie

Merci

Italian Simplified Chinese

Tamil

Obrigado

French

Japanese Korean

24.06.2011

DIMA – TU Berlin

58

■ ■ ■

■

■

■

■

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.

■

24.06.2011

DIMA – TU Berlin

59