Firepile: GPU programming in Scala

Firepile: GPU programming in Scala Nate Nystrom University of Lugano, Switzerland [email protected] Derek White Kishen Das GPCE 23 Oct 2011 Port...
1 downloads 0 Views 2MB Size
Firepile: GPU programming in Scala Nate Nystrom

University of Lugano, Switzerland

[email protected]

Derek White Kishen Das

GPCE 23 Oct 2011 Portland, Oregon

University of Texas at Arlington

General-purpose GPUs GPUs being used for increasingly general-purpose computation • scientific applications • finance • computer vision • gaming Often, roughly 10x the performance improvement over multicore at the same cost Nate Nystrom –– Firepile –– GPCE 2011

2

But, GPU programming is painful GPU programming languages are impoverished: • no dynamic memory allocation • no recursion • no virtual dispatch* Complicated memory architecture Lots of tedious boilerplate to manage devices and communication between the host and the GPU * the latest version of CUDA now supports virtual dispatch Nate Nystrom –– Firepile –– GPCE 2011

3

Firepile Library for GPU programming in Scala Goals: • Ease of programming • Performance comparable with C • Write all code in Scala • Library manages GPU access, generates code to run on the GPU Nate Nystrom –– Firepile –– GPCE 2011

4

GPU programming (in OpenCL) Programs are partitioned between: host code

kernels

• manage devices

• data-parallel procedures that run on the GPU

• preallocate GPU memory

• little or no communication between threads

• copy data between the host and GPU

• written in a restricted subset of C

• compile kernels from strings of C code • invoke kernels

• no dynamic memory allocation, recursion, virtual dispatch • explicit management of the memory hierarchy

Nate Nystrom –– Firepile –– GPCE 2011

5

Summing an array on a GPU

x

1

2

3

4

5

6

7

Nate Nystrom –– Firepile –– GPCE 2011

8

6

Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x

1

2

3

4

5

6

7

Nate Nystrom –– Firepile –– GPCE 2011

8

6

Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x

1

2

3

4

5

6

7

[[ ∑i∈1..n x(i) ]] 8

Kernel code copied to GPU

Nate Nystrom –– Firepile –– GPCE 2011

6

Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x

1

2

3

4

5

6

7

[[ ∑i∈1..n x(i) ]] 8

1

2

3

4

5

6

7

8

Input array copied to GPU’s global memory

Nate Nystrom –– Firepile –– GPCE 2011

6

Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x

1

2

3

4

5

6

7

[[ ∑i∈1..n x(i) ]] 8

1

2

3

4

5

6

7

8

Invoke the kernel

Nate Nystrom –– Firepile –– GPCE 2011

6

Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x

1

2

3

4

5

6

7

[[ ∑i∈1..n x(i) ]] 8

Thread i copies x(i) into local memory, shared between all threads in a block

Nate Nystrom –– Firepile –– GPCE 2011

1 1

2 2

3 3

4 4

5

6 5

7 6

8 7

8

6

Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x

1

2

3

4

5

6

7

[[ ∑i∈1..n x(i) ]] 8

compute partial sum in log(blockSize) rounds

Nate Nystrom –– Firepile –– GPCE 2011

1

2

3

4

5

6

7 6

8

1

2

3

4

5

7

8

4

6

3

4

12 14 7

8

10 6

3

4

26 14 7

8

+ +

6

Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x

1

2

3

4

5

6

7

[[ ∑i∈1..n x(i) ]] 8

First thread in each block copies partial sum back to global memory

1

2

3

4

5

6

7 6

8

1

2

3

4

5

7

8

4

6

3

4

12 14 7

8

10 6

3

4

26 14 7

8

+ +

10 26

Nate Nystrom –– Firepile –– GPCE 2011

6

Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x

1

2

3

4

5

6

7

[[ ∑i∈1..n x(i) ]] 8

Partial sums copied back to host

10 26

Nate Nystrom –– Firepile –– GPCE 2011

1

2

3

4

5

6

7 6

8

1

2

3

4

5

7

8

4

6

3

4

12 14 7

8

10 6

3

4

26 14 7

8

+ +

10 26

6

Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x

1

2

3

4

5

6

7

[[ ∑i∈1..n x(i) ]] 8

Final sum computed on host

10 26

1

2

3

4

5

6

7 6

8

1

2

3

4

5

7

8

4

6

3

4

12 14 7

8

10 6

3

4

26 14 7

8

+ +

10 26

10+26 = 36 Nate Nystrom –– Firepile –– GPCE 2011

6

Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x

1

2

3

4

5

6

7

[[ ∑i∈1..n x(i) ]] 8

1

2

3

4

5

6

7 6

8

1

2

3

4

5

7

8

4

6

3

4

12 14 7

8

10 6

3

4

26 14 7

8

+ + 10 26

10 26

10+26 = 36 Nate Nystrom –– Firepile –– GPCE 2011

6

Summing an array in OpenCL __kernel void sum(__global float* in, __global float* out, __local float* tmp) { int i = get_global_id(0); int tid = get_local_id(0); tmp[tid] = in[i]; barrier(CLK_LOCAL_MEM_FENCE); int s = get_group_size(0) / 2; while (s > 0) { if (tid < 2*s) tmp[tid] += tmp[tid+s]; barrier(CLK_LOCAL_MEM_FENCE); s /= 2; } if (tid == 0) { out[get_group_id(0)] = tmp[0]; } }

Nate Nystrom –– Firepile –– GPCE 2011

7

Summing an array in OpenCL __kernel void sum(__global float* in, __global float* out, __local float* tmp) { int i = get_global_id(0); int tid = get_local_id(0); tmp[tid] = in[i]; barrier(CLK_LOCAL_MEM_FENCE); int s = get_group_size(0) / 2; while (s > 0) { if (tid < 2*s) tmp[tid] += tmp[tid+s]; barrier(CLK_LOCAL_MEM_FENCE); s /= 2; } if (tid == 0) { out[get_group_id(0)] = tmp[0]; }

the addition

}

Nate Nystrom –– Firepile –– GPCE 2011

7

OpenCL host code float sum(float *x, int vectorLength) { clGetPlatformID(&platform); clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &dev, NULL); context = clCreateContext(0, 1, & dev, NULL, NULL, &err); queue = clCreateCommandQueue(context, dev, 0, &err); program = clCreateProgramWithSource(context, 1, (const char **)&source, &kernelLength, &err); clBuildProgram(program, 0, NULL, NULL, NULL, NULL); kernel = clCreateKernel(program, "sum", &err); szLocalWorkSize = 256; szGlobalWorkSize = vectorLength; float out[szGlobalWorkSize / szLocalWorkSize]; cmDevSrc = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(cl_float) * szGlobalWorkSize, NULL, &err); cmDevDst = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float) * szGlobalWorkSize, NULL, &err); cmDevTmp = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * szGlobalWorkSize, NULL, &err); clEnqueueWriteBuffer(queue, cmDevSrc, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, x, 0, NULL, NULL); clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&cmDevSrc); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&cmDevDst); clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&cmDevTmp); clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL); clEnqueueReadBuffer(queue, cmDevDst, CL_TRUE, 0, sizeof(cl_float) * szGlobalWorkSize / szLocalWorkSize, dst, 0, NULL, NULL); float result = 0.0f; for (i = 0; i < szGlobalWorkSize / szLocalWorkSize; i++) result += dst[i]; return result; }

Nate Nystrom –– Firepile –– GPCE 2011

8

OpenCL host code float sum(float *x, int vectorLength) { clGetPlatformID(&platform); clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &dev, NULL); context = clCreateContext(0, 1, & dev, NULL, NULL, &err); queue = clCreateCommandQueue(context, dev, 0, &err);

set up device

program = clCreateProgramWithSource(context, 1, (const char **)&source, &kernelLength, &err); clBuildProgram(program, 0, NULL, NULL, NULL, NULL); kernel = clCreateKernel(program, "sum", &err);

compile the kernel declare the output array

szLocalWorkSize = 256; szGlobalWorkSize = vectorLength; float out[szGlobalWorkSize / szLocalWorkSize];

allocate GPU memory

cmDevSrc = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(cl_float) * szGlobalWorkSize, NULL, &err); cmDevDst = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float) * szGlobalWorkSize, NULL, &err); cmDevTmp = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * szGlobalWorkSize, NULL, &err);

copy input data to GPU

clEnqueueWriteBuffer(queue, cmDevSrc, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, x, 0, NULL, NULL); clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&cmDevSrc); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&cmDevDst); clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&cmDevTmp);

invoke the kernel

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);

read back the partial results

clEnqueueReadBuffer(queue, cmDevDst, CL_TRUE, 0, sizeof(cl_float) * szGlobalWorkSize / szLocalWorkSize, dst, 0, NULL, NULL); float result = 0.0f; for (i = 0; i < szGlobalWorkSize / szLocalWorkSize; i++) result += dst[i]; return result; }

Nate Nystrom –– Firepile –– GPCE 2011

reduce the partial results

return the result 8

Summing an array in Firepile

A.sum

Nate Nystrom –– Firepile –– GPCE 2011

9

Summing an array in Firepile

A.reduce(0)(_+_)

Nate Nystrom –– Firepile –– GPCE 2011

10

Reduce in Firepile class GPUArray[A](array: Array[A]) { def reduce(z: A)(f: (A,A) => A)(implicit dev: Device): A = { val input = this.array val space = dev.defaultPaddedBlockPartition(input.length) val output = dev.spawn { val out = Array.ofDim[A](space.groups.size) for (g A)(implicit dev: Device): A = { val input = this.array val space = dev.defaultPaddedBlockPartition(input.length) val output = dev.spawn { val out = Array.ofDim[A](space.groups.size) for (g A)(implicit dev: Device): A = { val input = this.array val space = dev.defaultPaddedBlockPartition(input.length)

• spawn invokes a kernel on the GPU

val output = dev.spawn { val out = Array.ofDim[A](space.groups.size) for (g A)(implicit dev: Device): A = { val input = this.array val space = dev.defaultPaddedBlockPartition(input.length) val output = dev.spawn { val out = Array.ofDim[A](space.groups.size) for (g A)(implicit dev: Device): A = { val input = this.array val space = dev.defaultPaddedBlockPartition(input.length) val output = dev.spawn { val out = Array.ofDim[A](space.groups.size) for (g x + a is represented as the code tree: Function(List(LocalValue(_, x, scala.Int)), Apply(Select(Ident(LocalValue(_, x, _)), Method(scala.Int.$plus, scala.Int)), List(Select(Ident(ThisType(A$$anonfun$m$1), _), Field(a$1, scala.Int)))))

Firepile extends Scala's (still experimental) code trees in the scala.reflect package Nate Nystrom –– Firepile –– GPCE 2011

14

Code trees from function values A function object is just a regular Java object with: • an apply method that implements the function, and • a field for each captured local variable To build: • Locate the bytecode of the apply method • Recover Scala type information from method and field signatures and infer local variable types • Parse the bytecode instructions to construct syntax trees • If needed, can recurse on apply’s callees Implemented using Soot bytecode framework [Vallée-Rai et al. 1999] Nate Nystrom –– Firepile –– GPCE 2011

15

Versus staging Why not construct the code tree at compile time and pass the tree into spawn? Forces callers of spawn to create trees reduce, and scan, and filter, and map, .... These methods are passed function values from the application • These must also be passed as trees val plus: scala.reflect.Code[(Int,Int) => Int] = _+_ A.reduce(Literal(0))(plus.tree)

Implementation detail propagates out of the library to the application Nate Nystrom –– Firepile –– GPCE 2011

16

Firepile compiler • Firepile compiles code trees into C code for the kernel • Usually straightforward pattern matching on the code trees • Issues: • object representation and method dispatch • closures • allocation Nate Nystrom –– Firepile –– GPCE 2011

17

Method dispatch Problem: virtual dispatch is not supported on the GPU Solution: • Include a type tag in object representation • Closed-world assumption (at run time) • Translate monomorphic calls into static calls • Translate polymorphic calls into a switch on the type tag • But: this can lead to “warp divergence” • Investigating optimizations to minimize this problem Nate Nystrom –– Firepile –– GPCE 2011

18

Closures

Since closures are just objects, see previous slide But closure calls are very polymorphic Instead, Firepile inlines the function’s code (if statically known) and generates specialized versions of kernels

Nate Nystrom –– Firepile –– GPCE 2011

19

Object allocation Some common cases handled specially • final case classes • arrays with fixed sizes Simple escape analysis to determine in which kinds of GPU memory objects should be allocated (private vs. local vs. global memory) Otherwise, punt. Nate Nystrom –– Firepile –– GPCE 2011

20

Invoking the kernel spawn ...

• compiles the kernel to C, then uses JavaCL* to generate native code • marshals kernel inputs into byte buffers • copies data to GPU global memory • invokes the kernel • copies results back from global memory • unmarshals results *code.google.com/p/javacl/ Nate Nystrom –– Firepile –– GPCE 2011

21

Buffer-backed arrays • Avoid unnecessary copying on the host by using buffer-backed arrays • Can copy directly to GPU without marshaling into a byte buffer • Same interface as Array, but more memory-efficient

1

1.0

2

2.0

3

3.0

4

4.0

5

5.0

BBArray[(Int,Float)]

1

1.0

2

2.0

3

3.0

4

4.0

5

5.0

Array[(Int,Float)] Nate Nystrom –– Firepile –– GPCE 2011

22

Benchmarks To evaluate performance, implemented five GPU kernels Ports of examples from NVIDIA OpenCL SDK • Reduce • Matrix-vector multiply • Matrix transpose • Discrete cosine transform • Black-Scholes Nate Nystrom –– Firepile –– GPCE 2011

23

Experiments Machine configuration: CPU 3.0 GHz Core 2 Quad 8GB RAM Windows 7

GPU NVIDIA 9800GT 512MB RAM 112 SP

Software Scala 2.9.0 JavaCL 1.0 HotSpot JVM 1.6.0

Compared three versions of each benchmark: Firepile code for both host and kernel C++ code written by NVIDIA for both host and kernel Hybrid: NVIDIA host code, Firepile kernel code Report average of 16 runs on a warmed-up JVM Nate Nystrom –– Firepile –– GPCE 2011

24

'!!" '#!" &#!"

%!!" '!!" $#!" &#!" $!!" &!!" #!" %#!" !" %!!"

*+,-

Results &!!" %#!" %!!" $#!" $!!"

Reduce

#!"

45"

!"

40"

$#!"

%!!!!!!"

'!!!!!!"

Firepile

0,123

NVIDIA

*+,-

*+,*+,-

Hybrid

*+,-

0,12

(!!!!!!" $)!!!!!!"

0,123

$!!" #!" !"

%!!!!!!" %!!!!!!"'!!!!!!" '!!!!!!" (!!!!!!" (!!!!!!" $)!!!!!!" $)!!!!!!"

35" 30" 25" 20"

Time (ms)

15" 10"

good

5" 0" 2^20"

2^21"

2^22"

2^23"

Problem size

Nate Nystrom –– Firepile –– GPCE 2011

25

'!!" '#!" &#!"

%!!" '!!" $#!" &#!" $!!" &!!" #!" %#!" !" %!!"

*+,-

Results &!!" %#!" %!!" $#!"

DCT 8x8

$!!"

30" 25"

#!"

20"

!"

'!!!!!!"

Black-Scholes

$#!" $!!"

450"

#!"

350"

!"

0,123

NVIDIA

*+,-

*+,*+,-

Hybrid

*+,-

0,12

(!!!!!!" $)!!!!!!"

0,123

400"

300" 250"

%!!!!!!" '!!!!!!" 200"%!!!!!!" '!!!!!!" (!!!!!!" (!!!!!!" $)!!!!!!" $)!!!!!!"

15"

150"

5" 0" 2^16"

2^18"

2^20"

2^22"

Time (ms)

10"

Time (ms)

%!!!!!!"

Firepile

100" 50" 0" 2M"

4M"

8M"

Problem size

Problem size

Matrix-vector multiply

Matrix transpose

140"

60"

120"

50"

100"

16M"

40"

80" 30" 60" 20"

20" 0" 12.1M"

13.2M"

14.3M"

Nate Nystrom –– Firepile –– GPCE 2011 Problem size

15.4M"

Time (ms)

Time (ms)

40"

10" 0" 2^16"

Problem size

2^18"

2^20"

2^22"

26

'!!" '#!" &#!"

%!!" '!!" $#!" &#!" $!!" &!!" #!" %#!" !" %!!"

*+,-

Results &!!" %#!" %!!" $#!"

DCT 8x8

$!!"

30" 25"

#!"

20"

!"

'!!!!!!"

Black-Scholes

$#!" $!!"

450"

#!"

350"

!"

0,123

NVIDIA

*+,-

*+,*+,-

Hybrid

*+,-

0,12

(!!!!!!" $)!!!!!!"

0,123

400"

300" 250"

%!!!!!!" '!!!!!!" 200"%!!!!!!" '!!!!!!" (!!!!!!" (!!!!!!" $)!!!!!!" $)!!!!!!"

15"

150"

5" 0" 2^16"

2^18"

2^20"

2^22"

Time (ms)

10"

Time (ms)

%!!!!!!"

Firepile

100" 50" 0" 2M"

4M"

8M"

Problem size

Problem size

Matrix-vector multiply

Matrix transpose

140"

16M"

60" 50" Firepile kernel is slower

120" 100"

40"

80" 30" 60" 20"

20" 0" 12.1M"

13.2M"

14.3M"

Nate Nystrom –– Firepile –– GPCE 2011 Problem size

15.4M"

Time (ms)

Time (ms)

40"

10" 0" 2^16"

Problem size

2^18"

2^20"

2^22"

26

'!!" '#!" &#!"

%!!" '!!" $#!" &#!" $!!" &!!" #!" %#!" !" %!!"

*+,-

Results &!!" %#!"

Firepile data movement %!!" consistently slower $#!"

DCT 8x8

$!!"

30" 25"

#!"

20"

!"

'!!!!!!"

Black-Scholes

$#!" $!!"

450"

#!"

350"

!"

0,123

NVIDIA

*+,-

*+,*+,-

Hybrid

*+,-

0,12

(!!!!!!" $)!!!!!!"

0,123

400"

300" 250"

%!!!!!!" '!!!!!!" 200"%!!!!!!" '!!!!!!" (!!!!!!" (!!!!!!" $)!!!!!!" $)!!!!!!"

15"

150"

5" 0" 2^16"

2^18"

2^20"

2^22"

Time (ms)

10"

Time (ms)

%!!!!!!"

Firepile

100" 50" 0" 2M"

4M"

8M"

Problem size

Problem size

Matrix-vector multiply

Matrix transpose

140"

16M"

60" 50" Firepile kernel is slower

120" 100"

40"

80" 30" 60" 20"

20" 0" 12.1M"

13.2M"

14.3M"

Nate Nystrom –– Firepile –– GPCE 2011 Problem size

15.4M"

Time (ms)

Time (ms)

40"

10" 0" 2^16"

Problem size

2^18"

2^20"

2^22"

26

'!!" '#!" &#!"

%!!" '!!" $#!" &#!" $!!" &!!" #!" %#!" !" %!!"

*+,-

Results &!!" %#!"

Firepile data movement %!!" consistently slower $#!"

DCT 8x8

$!!"

30" 25"

#!"

20"

!"

'!!!!!!"

Black-Scholes

$#!" $!!"

450"

#!"

350"

!"

0,123

NVIDIA

*+,-

*+,*+,-

Hybrid

*+,-

0,12

(!!!!!!" $)!!!!!!"

0,123

400"

300" 250"

%!!!!!!" '!!!!!!" 200"%!!!!!!" '!!!!!!" (!!!!!!" (!!!!!!" $)!!!!!!" $)!!!!!!"

15"

150"

5" 0" 2^16"

2^18"

2^20"

2^22"

Time (ms)

10"

Time (ms)

%!!!!!!"

Firepile

100" 50" 0" 2M"

4M"

8M"

Problem size

Problem size

Matrix-vector multiply

Matrix transpose

140"

16M"

60" 50" Firepile kernel is slower

120" 100"

40"

80" 30"

Firepile data movement consistently faster

Time (ms)

40" 20" 0" 12.1M"

13.2M"

14.3M"

Nate Nystrom –– Firepile –– GPCE 2011 Problem size

15.4M"

20"

Time (ms)

60"

10" 0" 2^16"

Problem size

2^18"

2^20"

2^22"

26

Future work

• Implementing a static analysis to compute memory usage of kernels so memory can be pre-allocated • Mixing static and dynamic code trees • Exploring use of code trees in other arenas

Nate Nystrom –– Firepile –– GPCE 2011

27

Related work • Thin wrappers around OpenCL or CUDA: • JavaCL [Chafik 10, code.google.com/p/javacl/] • Pycuda [Klöckner 09] • Embedded DSLs: • Accelerator for F# [Tarditi et al. 06] • GPU.gen for Haskell [Lee et al. 09] • ScalaCL [Chafik 10] • Delite [Brown et al. 11] • Compilers: • x10cuda [Cunningham et al. 10] • nvcc [NVIDIA 06] • Clyther [GeoSpin 10, clyther.sourceforge.net/] Nate Nystrom –– Firepile –– GPCE 2011

28

Firepile Data-parallel programming in Scala for the GPU Kernel bytecode transformed dynamically into code trees and then into native code to run on the GPU Performance comparable with C

Thanks! [email protected] http://github.com/firepile http://inf.usi.ch/nystrom Nate Nystrom –– Firepile –– GPCE 2011

29