www.bsc.es

Introduction to CUDA Programming Data Transfer, CUDA Streams, and MPI

CUDA STREAMS: OVERLAPPING COMPUTATION AND DATA TRANSFERS

Allocate/Free Pinned Memory (a.k.a. Page Locked Memory) cudaHostAlloc() – – – –

Three parameters Address of pointer to the allocated memory Size of the allocated memory in bytes Option – use cudaHostAllocDefault for now

cudaFreeHost() – One parameter – Pointer to the memory to be freed

Introduction to CUDA Programming

3

Using Pinned Memory Use the allocated memory and its pointer the same way those returned by malloc(); The only difference is that the allocated memory cannot be paged by the OS The cudaMemCpy function should be about 2X faster with pinned memory

Introduction to CUDA Programming

4

Serialized Data Transfer and GPU computation

So far, the way we use cudaMemCpy serializes data transfer and GPU computation

Trans. A

Trans. B

Vector Add

time

Only use one direction, GPU idle

PCIe Idle

Tranfer C

Only use one direction, GPU idle

Introduction to CUDA Programming

5

Device Overlap Some CUDA devices support device overlap – Simultaneously execute a kernel while performing a copy between device and host memory

int Device; cudaDeviceProp prop; cudaGetDevice(&Device); cudaGetDeviceProperties(&prop, Device); if (prop.deviceOverlap) …

Introduction to CUDA Programming

6

Overlapped (Pieplined) Timing Divide large vectors into segments Overlap transfer and compute of adjacent segments Trans A.1

Trans B.1

Comp C.1 = A.1 + B.1

Trans C.1

Trans A.2

Comp C.2 = A.2 + B.2

Trans C.2

Trans A.3

Comp C.3 = A.3 + B.3

Trans B.2

Trans B.3

Trans A.4 Introduction to CUDA Programming

Trans B.4 7

Using CUDA Streams and Asynchronous MemCpy CUDA supports parallel execution of kernels and MemCpy with “Streams” Each stream is a queue of operations (kernels and MemCpys) Operations in different streams can go in parallel – “Task parallelism”

Introduction to CUDA Programming

8

Conceptual View of Streams PCI UP

PCI Down

Copy Engine

Kernel Engine

MemCpy A.1

MemCpy A.2

MemCpy B.1

MemCpy B.2

Kernel 1

Kernel 2

MemCpy C.1

Stream 0

MemCpy C.2

Stream 1

Operations (Kernels, MemCpys)

Introduction to CUDA Programming

9

A Simple Multi-Stream Host Code cudaStream_t stream0, stream1; cudaStreamCreate( &stream0); cudaStreamCreate( &stream1); float *d_A0, *d_B0, *d_C0; // device memory for stream 0 float *d_A1, *d_B1, *d_C1; // device memory for stream 1 // cudaMalloc for d_A0, d_B0, d_C0, d_A1, d_B1, d_C1 go here for (int i=0; i