Computational Science on Graphics Cards using CUDA

Computational Science on Graphics Cards using CUDA Lesson 2 Simone Melchionna Jonas Lätt Francis Lapique LMMM LMMM DIT [email protected] jo...
Author: Godfrey Chase
3 downloads 1 Views 1MB Size
Computational Science on Graphics Cards using CUDA Lesson 2

Simone Melchionna Jonas Lätt Francis Lapique

LMMM LMMM DIT

[email protected] [email protected] [email protected]

1

This lecture is based on the book CUDA by Example: An Introduction to General-Purpose GPU Programming, Jason Sanders , Edward Kandrot

GETTING STARTED INTRODUCTION to CUDA C PARALLEL PROGRAMMING IN CUDA C THREAD COOPERATION CONSTANT MEMORY AND EVENTS TEXTURE MEMORY ATOMICS STREAMS CUDA C ON MULTIPLE GPUS

2

Getting Started After Lesson 1 I hope you're excited to learn more. You can think of the CUDA Architecture as the scheme by wich NVIDIA has build GPUs that can perform both 3D graphics and General Purpose computation ( nongraphics domains).

3

Development Environment Prerequisites to developing code in CUDA C 1-A cuda-enabled graphics processor ( GeForce Quadro FX, Quadro NVS ...) 2-An NVIDIA device

GTX 480,

driver www.nvidia.com/cuda

3-A CUDA development toolkit CUDA C application are going to be computing on two different processors. One compiler will compile code for your GPU, one will compile code for you CPU. NVIDIA provides the compiler for your GPU code. 4-A standard C Compiler ( Windows Visual Studio 2008 Express, Linux RHE 5.x, Mac OS X > 10.5.7)

4

HELLO, WORLD ! •

We refer to the CPU and the system's memory as the host and refer to the GPU and its memory as the device. A function that executes on the device is called a kernel. #include __global__ void kernel(void) { } int main (void) { kernel (); printf("Hello, World !\n"); return 0; }

__global__ qualifier alerts the compiler that function should be compiled to run on a device instead of the host The angle brackets denote parameters that will influence how the runtime will launch our device code ( not arguments to the device code) Arguments to the device code are passed within the parentheses like any other function invocation 5

PASSING PARAMETERS #include __global__ void kernel(int a, int b, int *c) { *c = a + b; } int main (void) { int c; int *dev_c; cudaMalloc ((void**)&dev_c, sizeof(int)); kernel (2,7,dev_c); cudaMemcpy( &c, dev_c,sizeof(int), cudaMemcpyDeviceToHost); printf("Hello, World!\n"); printf("2 + 7 = %d\n", c); return 0; }

CudaMalloc() (similary to the standard C call malloc() tells the CUDA runtime to allocate memory on the device. The first argument is a pointer to the pointer you want to hold the address on the allocated memory and the second parameter is the size of the allocation you want to make.

6

PASSING PARAMETERS You cannot use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host. You can pass pointers allocated with cudaMalloc() to functions that execute on the host To free memory we need to use a call to cudaFree() You can access memory on a device through calls to CudaMemcpy (like memcpy) , cudaMemcpyDeviceToHost instructing the runtime that the source pointer is a device pointer and the destination pointer is a host pointer. CudaMemcpyHostToDevice for the opposite situation The statements *c = a + b is as simple as it looks ( it adds the parameters a and b together and stores the result in the memory pointed by c) 7

Querying Devices Systems with multiple GPUs are becoming more and more common. Because there is no guarantee that CUDA runtime will choose the best or most appropriate GPU for your application it would be useful knowing what types of capabilities the device had. Suppose that we are writing an applicationn that depends on having double-precision floating-point support. After a quick consultation with Appendix of the NVIDIA CUDA Programming Guide, we know that cards that have compute capability 1.3 or higher support double-precision. We need to find at leat one device of compute capability 1.3 or higher. With cudaGetDeviceCount(&count) we can iterate through the devices and query relevant information about each with cudaGetDeviceProperies(&prop). What kind of properties can we retrieve ?

8

Querying Devices prop is a structure of type cudaDeviceProp containing: struct cudaDeviceProp { char name [256]; size_t totalGlobalMem; size_t sharedMemPerBlock; int regsPerBlock; int warpSize; size t memPitch; int maxThreadsPerBlock; int maxThreadsDim[3]; int maxGridSize[3]; size t totalConstMem; int major; dev_inSrc ) ; cudaFree( d->dev_outSrc ) ; cudaFree( d->dev_constSrc ) ; cudaEventDestroy( d->start ) ; cudaEventDestroy( d->stop ) ; }

59

Simple Heating Model It can be convenient to use 2-dimensional domains. Let's a look at how we can modify the application to use 2-dimensional textures. We add a dimensionality argument of 2 in order to declare 2-dimensional textures: texture texture texture

texConstSrc; texIn; texOut;

We need to change text1Dfetch() calls to text2D() calls . We no longer need to use the linearized offset variable to compute the set of offsets top, left,right and bottom and have to worry about bounds overflow. __global__ void blend_kernel( float *dst, bool dstOut ) { // map from threadIdx/BlockIdx to pixel position int x = threadIdx.x + blockIdx.x * blockDim.x; int y = threadIdx.y + blockIdx.y * blockDim.y; int offset = x + y * blockDim.x * gridDim.x; float t, l, c, r, b; if (dstOut) { t = tex2D(texIn,x,y-1); l = tex2D(texIn,x-1,y); c = tex2D(texIn,x,y); r = tex2D(texIn,x+1,y); b = tex2D(texIn,x,y+1); } else { t = tex2D(texOut,x,y-1); l = tex2D(texOut,x-1,y); c = tex2D(texOut,x,y); r = tex2D(texOut,x+1,y); b = tex2D(texOut,x,y+1); } dst[offset] = c + SPEED * (t + b + r + l - 4 * c); }

60

Simple Heating Model CUDA runtime requires that we provide a cudaChannelFormatDesc when you bind 2dimensional textures. We accept the default parameters and simply need to specify that we require a floating-point descriptor. cudaMalloc( (void**)&data.dev_inSrc, imageSize ) ; cudaMalloc( (void**)&data.dev_outSrc, imageSize ) ; cudaMalloc( (void**)&data.dev_constSrc,imageSize ) ; cudaChannelFormatDesc desc = cudaCreateChannelDesc(); cudaBindTexture2D( NULL, texConstSrc, data.dev_constSrc, desc, DIM, DIM, sizeof(float) * DIM ) ; cudaBindTexture2D( NULL, texIn,

data.dev_inSrc, desc, DIM, DIM, sizeof(float) * DIM ) ;

cudaBindTexture2D( NULL, texOut, data.dev_outSrc, desc, DIM, DIM, sizeof(float) * DIM ) ;

Decision one- and two dimensional textures on a case-by-case basis. 61

Atomic Operations We will take a look at some of the situations where we need to use special primitives in order to safely accomplish things that can be quite trivial in a traditional single-threaded application. Consider a situation where two threads A and B to both increment the value in x Thread Thread Thread Thread Thread Thread

A B A B A B

reads the value in, A reads 7 from x reads the value in, B reads 7 from x adds 1 to the value it read, A computes 8 adds 1 to the value it read, B computes 8 writes the result back to x, x device or down. float cuda_malloc_test( int size, bool up ) { cudaEvent_t start, stop; int *a, *dev_a; float elapsedTime; cudaEventCreate( &start ) ; cudaEventCreate( &stop ) ; a = (int*)malloc( size * sizeof( *a ) ); cudaMalloc( (void**)&dev_a, size * sizeof( *dev_a ) ) ); cudaEventRecord( start, 0 ) ; for (int i=0; ideviceID ) ; cudaSetDeviceFlags( cudaDeviceMapHost ) ; }

Second, we no longer use cudaMemcpy(), but we use cudaHostGetDevicePointer() to get valid device pointers for the host memory. We use standard memory for the partial results. int float float

size = data->size; *a, *b, c, *partial_c; *dev_a, *dev_b, *dev_partial_c;

// allocate memory on the CPU side a = data->a; b = data->b; partial_c = (float*)malloc( blocksPerGrid*sizeof(float) ); // allocate the memory on the GPU cudaHostGetDevicePointer( &dev_a, a, 0 ) ; cudaHostGetDevicePointer( &dev_b, b, 0 ) ; cudaMalloc( (void**)&dev_partial_c, blocksPerGrid*sizeof(float) ) ; // offset 'a' and 'b' to where this GPU is gets it data dev_a += data->offset; dev_b += data->offset;

99

Multi-GPUs and Portable Memory Since our walk-through of routine() has been somewhat fragmented, the entire code void* routine( void *pvoidData ) { DataStruct *data = (DataStruct*)pvoidData; if (data->deviceID != 0) { cudaSetDevice( data->deviceID ) ; cudaSetDeviceFlags( cudaDeviceMapHost ) ; } int size = data->size; float *a, *b, c, *partial_c; float *dev_a, *dev_b, *dev_partial_c; // allocate memory on the CPU side a = data->a; b = data->b; partial_c = (float*)malloc( blocksPerGrid*sizeof(float) );

// offset 'a' and 'b' to where this GPU is gets it data dev_a += data->offset; dev_b += data->offset; dot( size, dev_a, dev_b, dev_partial_c ); // copy the array 'c' back from the GPU to the CPU cudaMemcpy( partial_c, dev_partial_c, blocksPerGrid*sizeof(float), cudaMemcpyDeviceToHost ) ; // finish up on the CPU side c = 0; for (int i=0; ireturnValue = c; return 0; }

portable.cu

100

MPI + CUDA Installing MPICH2 Just unpack the software and type configure make make install

If you don't like the default installation directory, add a -prefix option too.

Compiling MPICH/MPICH2 application programs Type mpicc -g -o binary_file_name source_file.c instead of mpicc.)

(If you wish to use C++, use mpixx

Running MPICH2 application programs

The key to running an MPICH2 application is the mpd daemon, one of which must run on each machine to be used by your program. Make sure you have a file .mpd.conf in your home directory, with a line like secretword= with your favorite word following.The primitive way to launch the daemons is to simply type mpd & in a terminal window at each machine. If you want to run k daemons on a machine, type mpd --ncpus=k But for multiple machines, it is more convenient to use mpdboot. To do this, first set up a file mpd.hosts in some directory, with the names of the machines on which you want daemons to be running; list the network names of the machines, one line per machine, e.g. pc29.cs.ucdavis.edu pc30.cs.ucdavis.edu Then type mpdboot. If you want to have more than one mpd process on a given machine, you need to type mpdboot --totalnum=-1 –ncpus=k where k is the total number of MPI processes you wish to run. To run your program, the above command mpiexec -l -n 3 prime 100 0

You can run mpdallexit to shut down the daemons.

101

MPI + CUDA All communication occurs within a communicator (MPI_Comm), or group of processes, where each process has a unique identifier. The predefined communicator MPI_COMM_WORLD contains all processes. Each process can determine its rank, and the size of a group it belongs to: MPI Comm rank(MPI Comm comm, int *rank); MPI Comm size(MPI Comm comm, int *size);

Initialization is required before MPI can be used: MPI Init(int *argc, char ***argv);

where argc and argv are the arguments to the main program (“command line parameters”). Before termination a program should clean up MPI data structures: MPI Finalize(); 102

MPI + CUDA #include #include "mpi.h" int main(int argc,char *argv[]) { int me, nprocs, l; char procname[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Get_processor_name(procname, &l); printf("hello, my rank is %d/%d: my name %s\n", me, nprocs, procname); MPI_Finalize(); return 0; }

103

MPI + CUDA #include "mpi.h" #include #include double f(double); double f(double a) { return (4.0 / (1.0 + a*a)); } int main(int argc,char *argv[]) { int n, myid, numprocs, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; double startwtime = 0.0, endwtime; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME];

if (myid == 0) startwtime = MPI_Wtime(); MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); h = 1.0 / (double) n; sum = 0.0; /* A slightly better approach starts from /* large i and works back */ for (i = myid + 1; i

Suggest Documents