Firepile: GPU programming in Scala Nate Nystrom
University of Lugano, Switzerland
[email protected]
Derek White Kishen Das
GPCE 23 Oct 2011 Portland, Oregon
University of Texas at Arlington
General-purpose GPUs GPUs being used for increasingly general-purpose computation • scientific applications • finance • computer vision • gaming Often, roughly 10x the performance improvement over multicore at the same cost Nate Nystrom –– Firepile –– GPCE 2011
2
But, GPU programming is painful GPU programming languages are impoverished: • no dynamic memory allocation • no recursion • no virtual dispatch* Complicated memory architecture Lots of tedious boilerplate to manage devices and communication between the host and the GPU * the latest version of CUDA now supports virtual dispatch Nate Nystrom –– Firepile –– GPCE 2011
3
Firepile Library for GPU programming in Scala Goals: • Ease of programming • Performance comparable with C • Write all code in Scala • Library manages GPU access, generates code to run on the GPU Nate Nystrom –– Firepile –– GPCE 2011
4
GPU programming (in OpenCL) Programs are partitioned between: host code
kernels
• manage devices
• data-parallel procedures that run on the GPU
• preallocate GPU memory
• little or no communication between threads
• copy data between the host and GPU
• written in a restricted subset of C
• compile kernels from strings of C code • invoke kernels
• no dynamic memory allocation, recursion, virtual dispatch • explicit management of the memory hierarchy
Nate Nystrom –– Firepile –– GPCE 2011
5
Summing an array on a GPU
x
1
2
3
4
5
6
7
Nate Nystrom –– Firepile –– GPCE 2011
8
6
Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x
1
2
3
4
5
6
7
Nate Nystrom –– Firepile –– GPCE 2011
8
6
Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x
1
2
3
4
5
6
7
[[ ∑i∈1..n x(i) ]] 8
Kernel code copied to GPU
Nate Nystrom –– Firepile –– GPCE 2011
6
Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x
1
2
3
4
5
6
7
[[ ∑i∈1..n x(i) ]] 8
1
2
3
4
5
6
7
8
Input array copied to GPU’s global memory
Nate Nystrom –– Firepile –– GPCE 2011
6
Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x
1
2
3
4
5
6
7
[[ ∑i∈1..n x(i) ]] 8
1
2
3
4
5
6
7
8
Invoke the kernel
Nate Nystrom –– Firepile –– GPCE 2011
6
Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x
1
2
3
4
5
6
7
[[ ∑i∈1..n x(i) ]] 8
Thread i copies x(i) into local memory, shared between all threads in a block
Nate Nystrom –– Firepile –– GPCE 2011
1 1
2 2
3 3
4 4
5
6 5
7 6
8 7
8
6
Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x
1
2
3
4
5
6
7
[[ ∑i∈1..n x(i) ]] 8
compute partial sum in log(blockSize) rounds
Nate Nystrom –– Firepile –– GPCE 2011
1
2
3
4
5
6
7 6
8
1
2
3
4
5
7
8
4
6
3
4
12 14 7
8
10 6
3
4
26 14 7
8
+ +
6
Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x
1
2
3
4
5
6
7
[[ ∑i∈1..n x(i) ]] 8
First thread in each block copies partial sum back to global memory
1
2
3
4
5
6
7 6
8
1
2
3
4
5
7
8
4
6
3
4
12 14 7
8
10 6
3
4
26 14 7
8
+ +
10 26
Nate Nystrom –– Firepile –– GPCE 2011
6
Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x
1
2
3
4
5
6
7
[[ ∑i∈1..n x(i) ]] 8
Partial sums copied back to host
10 26
Nate Nystrom –– Firepile –– GPCE 2011
1
2
3
4
5
6
7 6
8
1
2
3
4
5
7
8
4
6
3
4
12 14 7
8
10 6
3
4
26 14 7
8
+ +
10 26
6
Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x
1
2
3
4
5
6
7
[[ ∑i∈1..n x(i) ]] 8
Final sum computed on host
10 26
1
2
3
4
5
6
7 6
8
1
2
3
4
5
7
8
4
6
3
4
12 14 7
8
10 6
3
4
26 14 7
8
+ +
10 26
10+26 = 36 Nate Nystrom –– Firepile –– GPCE 2011
6
Summing an array on a GPU [[ ∑i∈1..n x(i) ]] x
1
2
3
4
5
6
7
[[ ∑i∈1..n x(i) ]] 8
1
2
3
4
5
6
7 6
8
1
2
3
4
5
7
8
4
6
3
4
12 14 7
8
10 6
3
4
26 14 7
8
+ + 10 26
10 26
10+26 = 36 Nate Nystrom –– Firepile –– GPCE 2011
6
Summing an array in OpenCL __kernel void sum(__global float* in, __global float* out, __local float* tmp) { int i = get_global_id(0); int tid = get_local_id(0); tmp[tid] = in[i]; barrier(CLK_LOCAL_MEM_FENCE); int s = get_group_size(0) / 2; while (s > 0) { if (tid < 2*s) tmp[tid] += tmp[tid+s]; barrier(CLK_LOCAL_MEM_FENCE); s /= 2; } if (tid == 0) { out[get_group_id(0)] = tmp[0]; } }
Nate Nystrom –– Firepile –– GPCE 2011
7
Summing an array in OpenCL __kernel void sum(__global float* in, __global float* out, __local float* tmp) { int i = get_global_id(0); int tid = get_local_id(0); tmp[tid] = in[i]; barrier(CLK_LOCAL_MEM_FENCE); int s = get_group_size(0) / 2; while (s > 0) { if (tid < 2*s) tmp[tid] += tmp[tid+s]; barrier(CLK_LOCAL_MEM_FENCE); s /= 2; } if (tid == 0) { out[get_group_id(0)] = tmp[0]; }
the addition
}
Nate Nystrom –– Firepile –– GPCE 2011
7
OpenCL host code float sum(float *x, int vectorLength) { clGetPlatformID(&platform); clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &dev, NULL); context = clCreateContext(0, 1, & dev, NULL, NULL, &err); queue = clCreateCommandQueue(context, dev, 0, &err); program = clCreateProgramWithSource(context, 1, (const char **)&source, &kernelLength, &err); clBuildProgram(program, 0, NULL, NULL, NULL, NULL); kernel = clCreateKernel(program, "sum", &err); szLocalWorkSize = 256; szGlobalWorkSize = vectorLength; float out[szGlobalWorkSize / szLocalWorkSize]; cmDevSrc = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(cl_float) * szGlobalWorkSize, NULL, &err); cmDevDst = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float) * szGlobalWorkSize, NULL, &err); cmDevTmp = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * szGlobalWorkSize, NULL, &err); clEnqueueWriteBuffer(queue, cmDevSrc, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, x, 0, NULL, NULL); clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&cmDevSrc); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&cmDevDst); clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&cmDevTmp); clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL); clEnqueueReadBuffer(queue, cmDevDst, CL_TRUE, 0, sizeof(cl_float) * szGlobalWorkSize / szLocalWorkSize, dst, 0, NULL, NULL); float result = 0.0f; for (i = 0; i < szGlobalWorkSize / szLocalWorkSize; i++) result += dst[i]; return result; }
Nate Nystrom –– Firepile –– GPCE 2011
8
OpenCL host code float sum(float *x, int vectorLength) { clGetPlatformID(&platform); clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &dev, NULL); context = clCreateContext(0, 1, & dev, NULL, NULL, &err); queue = clCreateCommandQueue(context, dev, 0, &err);
set up device
program = clCreateProgramWithSource(context, 1, (const char **)&source, &kernelLength, &err); clBuildProgram(program, 0, NULL, NULL, NULL, NULL); kernel = clCreateKernel(program, "sum", &err);
compile the kernel declare the output array
szLocalWorkSize = 256; szGlobalWorkSize = vectorLength; float out[szGlobalWorkSize / szLocalWorkSize];
allocate GPU memory
cmDevSrc = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(cl_float) * szGlobalWorkSize, NULL, &err); cmDevDst = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float) * szGlobalWorkSize, NULL, &err); cmDevTmp = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * szGlobalWorkSize, NULL, &err);
copy input data to GPU
clEnqueueWriteBuffer(queue, cmDevSrc, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, x, 0, NULL, NULL); clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&cmDevSrc); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&cmDevDst); clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&cmDevTmp);
invoke the kernel
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);
read back the partial results
clEnqueueReadBuffer(queue, cmDevDst, CL_TRUE, 0, sizeof(cl_float) * szGlobalWorkSize / szLocalWorkSize, dst, 0, NULL, NULL); float result = 0.0f; for (i = 0; i < szGlobalWorkSize / szLocalWorkSize; i++) result += dst[i]; return result; }
Nate Nystrom –– Firepile –– GPCE 2011
reduce the partial results
return the result 8
Summing an array in Firepile
A.sum
Nate Nystrom –– Firepile –– GPCE 2011
9
Summing an array in Firepile
A.reduce(0)(_+_)
Nate Nystrom –– Firepile –– GPCE 2011
10
Reduce in Firepile class GPUArray[A](array: Array[A]) { def reduce(z: A)(f: (A,A) => A)(implicit dev: Device): A = { val input = this.array val space = dev.defaultPaddedBlockPartition(input.length) val output = dev.spawn { val out = Array.ofDim[A](space.groups.size) for (g A)(implicit dev: Device): A = { val input = this.array val space = dev.defaultPaddedBlockPartition(input.length) val output = dev.spawn { val out = Array.ofDim[A](space.groups.size) for (g A)(implicit dev: Device): A = { val input = this.array val space = dev.defaultPaddedBlockPartition(input.length)
• spawn invokes a kernel on the GPU
val output = dev.spawn { val out = Array.ofDim[A](space.groups.size) for (g A)(implicit dev: Device): A = { val input = this.array val space = dev.defaultPaddedBlockPartition(input.length) val output = dev.spawn { val out = Array.ofDim[A](space.groups.size) for (g A)(implicit dev: Device): A = { val input = this.array val space = dev.defaultPaddedBlockPartition(input.length) val output = dev.spawn { val out = Array.ofDim[A](space.groups.size) for (g x + a is represented as the code tree: Function(List(LocalValue(_, x, scala.Int)), Apply(Select(Ident(LocalValue(_, x, _)), Method(scala.Int.$plus, scala.Int)), List(Select(Ident(ThisType(A$$anonfun$m$1), _), Field(a$1, scala.Int)))))
Firepile extends Scala's (still experimental) code trees in the scala.reflect package Nate Nystrom –– Firepile –– GPCE 2011
14
Code trees from function values A function object is just a regular Java object with: • an apply method that implements the function, and • a field for each captured local variable To build: • Locate the bytecode of the apply method • Recover Scala type information from method and field signatures and infer local variable types • Parse the bytecode instructions to construct syntax trees • If needed, can recurse on apply’s callees Implemented using Soot bytecode framework [Vallée-Rai et al. 1999] Nate Nystrom –– Firepile –– GPCE 2011
15
Versus staging Why not construct the code tree at compile time and pass the tree into spawn? Forces callers of spawn to create trees reduce, and scan, and filter, and map, .... These methods are passed function values from the application • These must also be passed as trees val plus: scala.reflect.Code[(Int,Int) => Int] = _+_ A.reduce(Literal(0))(plus.tree)
Implementation detail propagates out of the library to the application Nate Nystrom –– Firepile –– GPCE 2011
16
Firepile compiler • Firepile compiles code trees into C code for the kernel • Usually straightforward pattern matching on the code trees • Issues: • object representation and method dispatch • closures • allocation Nate Nystrom –– Firepile –– GPCE 2011
17
Method dispatch Problem: virtual dispatch is not supported on the GPU Solution: • Include a type tag in object representation • Closed-world assumption (at run time) • Translate monomorphic calls into static calls • Translate polymorphic calls into a switch on the type tag • But: this can lead to “warp divergence” • Investigating optimizations to minimize this problem Nate Nystrom –– Firepile –– GPCE 2011
18
Closures
Since closures are just objects, see previous slide But closure calls are very polymorphic Instead, Firepile inlines the function’s code (if statically known) and generates specialized versions of kernels
Nate Nystrom –– Firepile –– GPCE 2011
19
Object allocation Some common cases handled specially • final case classes • arrays with fixed sizes Simple escape analysis to determine in which kinds of GPU memory objects should be allocated (private vs. local vs. global memory) Otherwise, punt. Nate Nystrom –– Firepile –– GPCE 2011
20
Invoking the kernel spawn ...
• compiles the kernel to C, then uses JavaCL* to generate native code • marshals kernel inputs into byte buffers • copies data to GPU global memory • invokes the kernel • copies results back from global memory • unmarshals results *code.google.com/p/javacl/ Nate Nystrom –– Firepile –– GPCE 2011
21
Buffer-backed arrays • Avoid unnecessary copying on the host by using buffer-backed arrays • Can copy directly to GPU without marshaling into a byte buffer • Same interface as Array, but more memory-efficient
1
1.0
2
2.0
3
3.0
4
4.0
5
5.0
BBArray[(Int,Float)]
1
1.0
2
2.0
3
3.0
4
4.0
5
5.0
Array[(Int,Float)] Nate Nystrom –– Firepile –– GPCE 2011
22
Benchmarks To evaluate performance, implemented five GPU kernels Ports of examples from NVIDIA OpenCL SDK • Reduce • Matrix-vector multiply • Matrix transpose • Discrete cosine transform • Black-Scholes Nate Nystrom –– Firepile –– GPCE 2011
23
Experiments Machine configuration: CPU 3.0 GHz Core 2 Quad 8GB RAM Windows 7
GPU NVIDIA 9800GT 512MB RAM 112 SP
Software Scala 2.9.0 JavaCL 1.0 HotSpot JVM 1.6.0
Compared three versions of each benchmark: Firepile code for both host and kernel C++ code written by NVIDIA for both host and kernel Hybrid: NVIDIA host code, Firepile kernel code Report average of 16 runs on a warmed-up JVM Nate Nystrom –– Firepile –– GPCE 2011
24
'!!" '#!" !"
%!!" '!!" $#!" !" $!!" &!!" #!" %#!" !" %!!"
*+,-
Results &!!" %#!" %!!" $#!" $!!"
Reduce
#!"
45"
!"
40"
$#!"
%!!!!!!"
'!!!!!!"
Firepile
0,123
NVIDIA
*+,-
*+,*+,-
Hybrid
*+,-
0,12
(!!!!!!" $)!!!!!!"
0,123
$!!" #!" !"
%!!!!!!" %!!!!!!"'!!!!!!" '!!!!!!" (!!!!!!" (!!!!!!" $)!!!!!!" $)!!!!!!"
35" 30" 25" 20"
Time (ms)
15" 10"
good
5" 0" 2^20"
2^21"
2^22"
2^23"
Problem size
Nate Nystrom –– Firepile –– GPCE 2011
25
'!!" '#!" !"
%!!" '!!" $#!" !" $!!" &!!" #!" %#!" !" %!!"
*+,-
Results &!!" %#!" %!!" $#!"
DCT 8x8
$!!"
30" 25"
#!"
20"
!"
'!!!!!!"
Black-Scholes
$#!" $!!"
450"
#!"
350"
!"
0,123
NVIDIA
*+,-
*+,*+,-
Hybrid
*+,-
0,12
(!!!!!!" $)!!!!!!"
0,123
400"
300" 250"
%!!!!!!" '!!!!!!" 200"%!!!!!!" '!!!!!!" (!!!!!!" (!!!!!!" $)!!!!!!" $)!!!!!!"
15"
150"
5" 0" 2^16"
2^18"
2^20"
2^22"
Time (ms)
10"
Time (ms)
%!!!!!!"
Firepile
100" 50" 0" 2M"
4M"
8M"
Problem size
Problem size
Matrix-vector multiply
Matrix transpose
140"
60"
120"
50"
100"
16M"
40"
80" 30" 60" 20"
20" 0" 12.1M"
13.2M"
14.3M"
Nate Nystrom –– Firepile –– GPCE 2011 Problem size
15.4M"
Time (ms)
Time (ms)
40"
10" 0" 2^16"
Problem size
2^18"
2^20"
2^22"
26
'!!" '#!" !"
%!!" '!!" $#!" !" $!!" &!!" #!" %#!" !" %!!"
*+,-
Results &!!" %#!" %!!" $#!"
DCT 8x8
$!!"
30" 25"
#!"
20"
!"
'!!!!!!"
Black-Scholes
$#!" $!!"
450"
#!"
350"
!"
0,123
NVIDIA
*+,-
*+,*+,-
Hybrid
*+,-
0,12
(!!!!!!" $)!!!!!!"
0,123
400"
300" 250"
%!!!!!!" '!!!!!!" 200"%!!!!!!" '!!!!!!" (!!!!!!" (!!!!!!" $)!!!!!!" $)!!!!!!"
15"
150"
5" 0" 2^16"
2^18"
2^20"
2^22"
Time (ms)
10"
Time (ms)
%!!!!!!"
Firepile
100" 50" 0" 2M"
4M"
8M"
Problem size
Problem size
Matrix-vector multiply
Matrix transpose
140"
16M"
60" 50" Firepile kernel is slower
120" 100"
40"
80" 30" 60" 20"
20" 0" 12.1M"
13.2M"
14.3M"
Nate Nystrom –– Firepile –– GPCE 2011 Problem size
15.4M"
Time (ms)
Time (ms)
40"
10" 0" 2^16"
Problem size
2^18"
2^20"
2^22"
26
'!!" '#!" !"
%!!" '!!" $#!" !" $!!" &!!" #!" %#!" !" %!!"
*+,-
Results &!!" %#!"
Firepile data movement %!!" consistently slower $#!"
DCT 8x8
$!!"
30" 25"
#!"
20"
!"
'!!!!!!"
Black-Scholes
$#!" $!!"
450"
#!"
350"
!"
0,123
NVIDIA
*+,-
*+,*+,-
Hybrid
*+,-
0,12
(!!!!!!" $)!!!!!!"
0,123
400"
300" 250"
%!!!!!!" '!!!!!!" 200"%!!!!!!" '!!!!!!" (!!!!!!" (!!!!!!" $)!!!!!!" $)!!!!!!"
15"
150"
5" 0" 2^16"
2^18"
2^20"
2^22"
Time (ms)
10"
Time (ms)
%!!!!!!"
Firepile
100" 50" 0" 2M"
4M"
8M"
Problem size
Problem size
Matrix-vector multiply
Matrix transpose
140"
16M"
60" 50" Firepile kernel is slower
120" 100"
40"
80" 30" 60" 20"
20" 0" 12.1M"
13.2M"
14.3M"
Nate Nystrom –– Firepile –– GPCE 2011 Problem size
15.4M"
Time (ms)
Time (ms)
40"
10" 0" 2^16"
Problem size
2^18"
2^20"
2^22"
26
'!!" '#!" !"
%!!" '!!" $#!" !" $!!" &!!" #!" %#!" !" %!!"
*+,-
Results &!!" %#!"
Firepile data movement %!!" consistently slower $#!"
DCT 8x8
$!!"
30" 25"
#!"
20"
!"
'!!!!!!"
Black-Scholes
$#!" $!!"
450"
#!"
350"
!"
0,123
NVIDIA
*+,-
*+,*+,-
Hybrid
*+,-
0,12
(!!!!!!" $)!!!!!!"
0,123
400"
300" 250"
%!!!!!!" '!!!!!!" 200"%!!!!!!" '!!!!!!" (!!!!!!" (!!!!!!" $)!!!!!!" $)!!!!!!"
15"
150"
5" 0" 2^16"
2^18"
2^20"
2^22"
Time (ms)
10"
Time (ms)
%!!!!!!"
Firepile
100" 50" 0" 2M"
4M"
8M"
Problem size
Problem size
Matrix-vector multiply
Matrix transpose
140"
16M"
60" 50" Firepile kernel is slower
120" 100"
40"
80" 30"
Firepile data movement consistently faster
Time (ms)
40" 20" 0" 12.1M"
13.2M"
14.3M"
Nate Nystrom –– Firepile –– GPCE 2011 Problem size
15.4M"
20"
Time (ms)
60"
10" 0" 2^16"
Problem size
2^18"
2^20"
2^22"
26
Future work
• Implementing a static analysis to compute memory usage of kernels so memory can be pre-allocated • Mixing static and dynamic code trees • Exploring use of code trees in other arenas
Nate Nystrom –– Firepile –– GPCE 2011
27
Related work • Thin wrappers around OpenCL or CUDA: • JavaCL [Chafik 10, code.google.com/p/javacl/] • Pycuda [Klöckner 09] • Embedded DSLs: • Accelerator for F# [Tarditi et al. 06] • GPU.gen for Haskell [Lee et al. 09] • ScalaCL [Chafik 10] • Delite [Brown et al. 11] • Compilers: • x10cuda [Cunningham et al. 10] • nvcc [NVIDIA 06] • Clyther [GeoSpin 10, clyther.sourceforge.net/] Nate Nystrom –– Firepile –– GPCE 2011
28
Firepile Data-parallel programming in Scala for the GPU Kernel bytecode transformed dynamically into code trees and then into native code to run on the GPU Performance comparable with C
Thanks!
[email protected] http://github.com/firepile http://inf.usi.ch/nystrom Nate Nystrom –– Firepile –– GPCE 2011
29