Introduction to CUDA Programming

www.bsc.es Introduction to CUDA Programming Data Transfer, CUDA Streams, and MPI CUDA STREAMS: OVERLAPPING COMPUTATION AND DATA TRANSFERS Allocat...

Author: Mark Underwood

6 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

An Introduction to CUDA

CUDA Efficient Programming

GPGPU Programming with CUDA

The CUDA Programming Model

Shader Programming vs CUDA

Introduction to Integer Programming. 518: Introduction to Integer Programming

Introduction to Windows Programming

Introduction to Programming paradigms

Introduction to Financial Programming

Introduction to Programming Competitions

Introduction to Programming

Introduction to PIC Programming

Introduction to GPU Programming

Introduction to PIC Programming

Introduction to Parallel Programming

Introduction to Coding & Programming

Introduction to Programming

Introduction to PIC Programming

Introduction to Computer Programming

Introduction to Programming II

Introduction to Computer Programming

www.bsc.es

Introduction to CUDA Programming Data Transfer, CUDA Streams, and MPI

CUDA STREAMS: OVERLAPPING COMPUTATION AND DATA TRANSFERS

Allocate/Free Pinned Memory (a.k.a. Page Locked Memory) cudaHostAlloc() – – – –

Three parameters Address of pointer to the allocated memory Size of the allocated memory in bytes Option – use cudaHostAllocDefault for now

cudaFreeHost() – One parameter – Pointer to the memory to be freed

Introduction to CUDA Programming

3

Using Pinned Memory Use the allocated memory and its pointer the same way those returned by malloc(); The only difference is that the allocated memory cannot be paged by the OS The cudaMemCpy function should be about 2X faster with pinned memory

Introduction to CUDA Programming

4

Serialized Data Transfer and GPU computation

So far, the way we use cudaMemCpy serializes data transfer and GPU computation

Trans. A

Trans. B

Vector Add

time

Only use one direction, GPU idle

PCIe Idle

Tranfer C

Only use one direction, GPU idle

Introduction to CUDA Programming

5

Device Overlap Some CUDA devices support device overlap – Simultaneously execute a kernel while performing a copy between device and host memory

int Device; cudaDeviceProp prop; cudaGetDevice(&Device); cudaGetDeviceProperties(&prop, Device); if (prop.deviceOverlap) …

Introduction to CUDA Programming

6

Overlapped (Pieplined) Timing Divide large vectors into segments Overlap transfer and compute of adjacent segments Trans A.1

Trans B.1

Comp C.1 = A.1 + B.1

Trans C.1

Trans A.2

Comp C.2 = A.2 + B.2

Trans C.2

Trans A.3

Comp C.3 = A.3 + B.3

Trans B.2

Trans B.3

Trans A.4 Introduction to CUDA Programming

Trans B.4 7

Using CUDA Streams and Asynchronous MemCpy CUDA supports parallel execution of kernels and MemCpy with “Streams” Each stream is a queue of operations (kernels and MemCpys) Operations in different streams can go in parallel – “Task parallelism”

Introduction to CUDA Programming

8

Conceptual View of Streams PCI UP

PCI Down

Copy Engine

Kernel Engine

MemCpy A.1

MemCpy A.2

MemCpy B.1

MemCpy B.2

Kernel 1

Kernel 2

MemCpy C.1

Stream 0

MemCpy C.2

Stream 1

Operations (Kernels, MemCpys)

Introduction to CUDA Programming

9

A Simple Multi-Stream Host Code cudaStream_t stream0, stream1; cudaStreamCreate( &stream0); cudaStreamCreate( &stream1); float *d_A0, *d_B0, *d_C0; // device memory for stream 0 float *d_A1, *d_B1, *d_C1; // device memory for stream 1 // cudaMalloc for d_A0, d_B0, d_C0, d_A1, d_B1, d_C1 go here for (int i=0; i