GPGPU Programming with CUDA
Leandro Avila University of Northern Iowa
Mentor: Dr. Paul Gray Computer Science Department
University of Northern Iowa
Outline
Introduction
Architecture Description
Introduction to CUDA API
Introduction
Shift in the traditional paradigm of sequential programming, towards parallel processing. Scientific computing needs to change in order to deal with vast amounts of data. Hardware changes contributed to move towards parallel processing.
Three Walls of Serial Performance
Memory Wall
Instruction Level Parallelism Wall
Discrepancy between memory and CPU performance Effort put into ILP increases with not enough returns
Power Wall
Clock frequency vs. Heat dissipation efforts.
Manferdelli, J. (2007) The ManyCore Inflection Point for Mass Market Computer Systems
Accelerators
In HPC, an accelerator is a hardware component whose role is to speed up some aspect of the computing workload. In the old days (1980s), supercomputers we had array processors, for vector operations on arrays, and floating point accelerators. More recently, Field Programmable Gate Arrays (FPGAs) allow reprogramming deep into the hardware.
Courtesy of Henry Neeman http://www.oscer.ou.edu/
Accelerators
Advantages
They make your code run faster
Disadvantages
More expensive
Harder to program
Code is not portable from one accelerator to another. (OpenCL attempts to change this)
Courtesy of Henry Neeman http://www.oscer.ou.edu/
Introducing GPGPU
General Purpose Computing on Graphics Processing Units Great example of the trend of moving away from the traditional model.
Why GPUs?
Graphics Processing Units (GPUs) were originally designed to accelerate graphics tasks like image rendering. They became very popular with videogamers, because they’ve produced better and better images, and lightning fast. And, prices have been extremely good, ranging from three figures at the low end to four figures at the high end. GPUs mostly do stuff like rendering images.
This is done through mostly floating point arithmetic – the same stuff people use supercomputing for!
Courtesy of Henry Neeman http://www.oscer.ou.edu/
GPU vs. CPU Flop Rate
From Nvidia CUDA Programing Guide
Architecture
Architecture Comparison Nvidia Tesla C1060
Intel i7 975 Extreme
Processing Cores
240
4
Memory
4GB
L1 Cache – 32KB/core L2 Cache – 256KB/core L3 Cache – 8MB (shared)
1.3 GHz
3.33.GHz
Memory Bandwidth
102 GB/Sec
25 GB/sec
Floating Point Operations / Sec
933 Single Precision 78 Double Precision
70 Double Precision
Clock Speed
CPU vs. GPU
From Nvidia CUDA Programing Guide
Components
Texture Processors Clusters
Streaming Multiprocessors
Streaming Processor
From http://www.tomshardware.com/reviews/nvidiacudagpu,19547.html
Streaming Multiprocessors
Blocks of threads are assigned to SMs A SM contains 8 Scalar Processors Tesla C1060
Number of SM = 30
Number of Cores = 240
The more SM you have the better
Hardware Hierarchy
Stream Processor Array
Texture Processor Clusters
Contains 3 Streaming Multiprocessors
Streaming Multiprocessors
Contains 10 Texture Processor Clusters
Contains 8 Scalar Processors
Scalar Processors
They do the work :)
Connecting some dots...
Great! We see the GPU architecture is different from what we see in the traditional CPU.
So... Now what?
What this all means?
How do we use it?
Glossary
The HOST – Is the machine executing main program
The DEVICE – Is the card with the GPU
The KERNEL – Is the routine that runs on the GPU
A THREAD – Is the basic execution unit in the GPU
A BLOCK – Is a group of threads
A GRID – Is a group of blocks
A WARP – Is a group of 32 threads
CUDA Kernel Execution
Recall that threads are organized in BLOCKS and at the same time BLOCKS are organized in a GRID. The GRID can have 2 dimensions. X and Y
The BLOCK(S) can have 3 dimensions X,Y,Z
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum sizes of each dimension of a block: 512 x 512 x 64
Prior to kernel execution we need to set it up by setting the dimensions of the GRID and the dimensions of the BLOCKS
Scheduling in Hardware Host
Device Grid 1
Kernel 1
Block (0, 0)
Block (1, 0)
Block (2, 0)
Block (0, 1)
Block (1, 1)
Block (2, 1)
Grid 2 Kernel 2
Block (1, 1) Thread (0, 0)
Thread (1, 0)
Thread (2, 0)
Thread (3, 0)
Thread (4, 0)
Thread (0, 1)
Thread (1, 1)
Thread (2, 1)
Thread (3, 1)
Thread (4, 1)
Thread (0, 2)
Thread (1, 2)
Thread (2, 2)
Thread (3, 2)
Thread (4, 2)
Kirk & Hwu – University of Illinois Urbana- Champaign
Grid is launched Blocks are distributed to the necessary SMs SM initiates processing of warps SM schedules warps that are ready As warps finish and resources are liberated, then new warps are scheduled. SM can take 1024 threads
Ex: 256 x 4 OR 128 x 8
Memory Layout
Registers and shared memory are the fastest Local Memory is virtual memory Global Memory is the slowest.
From Nvidia CUDA Programing Guide
Thread Memory Access
Threads access memory as follows
Registers – Read & Write
Local Memory – Read & Write
Shared Memory – Read & Write (block level)
Global Memory – Read & Write (grid level)
Constant Memory – Read (grid level)
Remember that Local Memory is implemented as virtual memory from a region that resides in Global Memory.
CUDA API
Programming Pattern
Host reads input and allocates memory in the device
Host copies data to the device
Host invokes a kernel that gets executed in parallel, using the data and hardware in the device, to do some useful work. Host copies back the results from the device for post processing.
Kernel Setup _global_ void myKernel(); //declaration dim3 dimGrid(2,2,1); dim3 dimBlock(4,8,8); myKernel( d_b, d_a );
Device Memory Allocation cudaMalloc(&myDataAddress,sizeOfData) Address of a pointer to the allocated data and the size of such data. cudaFree(myDataPointer) Used to free the allocated memory on the device. Also check cudaMallocHost() and cudaFreeHost() in the CUDA Refrence Manual.
Device Data Transfer cudaMemcpy() Requires: pointer to destination, pointer to source, size, type of transfer Examples: cudaMemcpy(elements_d, elements_h,size,cudaMemcpyHostToDevice); cudaMemcpy(elements_h,elements_d,size,cudaMemcpyDeviceToHost);
Function Declaration Executes On
Callable From
_device_ float myDeviceFunc()
Device
Device
_host_ float myHostFunc()
Host
Host
_global_ void myKernel()
Device
Host
_ global _ is used to declare a kernel. It must be void.
Useful Variables gridDim.(x|y) = grid dimension on x and y blockDim = number of threads in a block blockIdx = block index whithin the grid blockIdx.(x|y) threadIdx = Thread index within a block threadIdx.(x|y|z)
Variable Type Qualifiers
Variable type qualifiers specify the memory location of a variable on the device’s memory __device__
__constant__
Declares a constant in the device
__shared__
Declares a variable in the device
Declares a variable in thread shared memory
Note: All shared memory variables start at the same address. You must use offsets if multiple variables are declared in shared memory.