Three parameters Address of pointer to the allocated memory Size of the allocated memory in bytes Option – use cudaHostAllocDefault for now
cudaFreeHost() – One parameter – Pointer to the memory to be freed
Introduction to CUDA Programming
3
Using Pinned Memory Use the allocated memory and its pointer the same way those returned by malloc(); The only difference is that the allocated memory cannot be paged by the OS The cudaMemCpy function should be about 2X faster with pinned memory
Introduction to CUDA Programming
4
Serialized Data Transfer and GPU computation
So far, the way we use cudaMemCpy serializes data transfer and GPU computation
Trans. A
Trans. B
Vector Add
time
Only use one direction, GPU idle
PCIe Idle
Tranfer C
Only use one direction, GPU idle
Introduction to CUDA Programming
5
Device Overlap Some CUDA devices support device overlap – Simultaneously execute a kernel while performing a copy between device and host memory
int Device; cudaDeviceProp prop; cudaGetDevice(&Device); cudaGetDeviceProperties(&prop, Device); if (prop.deviceOverlap) …
Introduction to CUDA Programming
6
Overlapped (Pieplined) Timing Divide large vectors into segments Overlap transfer and compute of adjacent segments Trans A.1
Trans B.1
Comp C.1 = A.1 + B.1
Trans C.1
Trans A.2
Comp C.2 = A.2 + B.2
Trans C.2
Trans A.3
Comp C.3 = A.3 + B.3
Trans B.2
Trans B.3
Trans A.4 Introduction to CUDA Programming
Trans B.4 7
Using CUDA Streams and Asynchronous MemCpy CUDA supports parallel execution of kernels and MemCpy with “Streams” Each stream is a queue of operations (kernels and MemCpys) Operations in different streams can go in parallel – “Task parallelism”
Introduction to CUDA Programming
8
Conceptual View of Streams PCI UP
PCI Down
Copy Engine
Kernel Engine
MemCpy A.1
MemCpy A.2
MemCpy B.1
MemCpy B.2
Kernel 1
Kernel 2
MemCpy C.1
Stream 0
MemCpy C.2
Stream 1
Operations (Kernels, MemCpys)
Introduction to CUDA Programming
9
A Simple Multi-Stream Host Code cudaStream_t stream0, stream1; cudaStreamCreate( &stream0); cudaStreamCreate( &stream1); float *d_A0, *d_B0, *d_C0; // device memory for stream 0 float *d_A1, *d_B1, *d_C1; // device memory for stream 1 // cudaMalloc for d_A0, d_B0, d_C0, d_A1, d_B1, d_C1 go here for (int i=0; i