Accelera'ng Code with OpenCL

Accelera'ng Code with OpenCL 30/07/2012 David Black-‐Schaﬀer 30/07/2012 | 2 Uppsala University / Department of Informa'on Tec...

Author: Lionel Montgomery

12 downloads 0 Views 501KB Size

Report

Download PDF

Recommend Documents

EAM benchmark to OpenCL TM with limited code modifications

Accelerating Habanero-Java Programs with OpenCL Generation

OpenCL Fast Fourier Transform

OpenCL Emu Documentation

The OpenCL Specification

SOCL: An OpenCL Implementation with Automatic Multi-Device Adaptation Support

An introduction to many-core parallel computing with OpenCL

Altera SDK for OpenCL

Intel FPGA SDK for OpenCL

Real-Time Shallow Water Simulation with OpenCL for CPUs. Adam Lake, Arnon Peleg software, Intel OpenCL WG, The Khronos Group

The Open Computing Language (OpenCL)

OpenCL 1.1 Enhancements for Multi-GPU Performance

Intel SDK for OpenCL* - Median Filter Sample

COSC6365. Introduction to HPC. Lecture 17 OpenCL. Lennart Johnsson Dept of Computer Science COSC6365. OpenCL Books

Customizing Case Feed with Code

MAKING ART WITH MATH + CODE

Dynamic Code Generation with the E-Code Language

Intel SDK for OpenCL Applications 2016 R3

Vectorisation and Portable Programming using OpenCL

Intel SDK for OpenCL Applications 2015

Intel SDK for OpenCL Applications 2016

OpenCL TM API 1.0 Quick Reference Card

Accelera'ng Code with OpenCL

30/07/2012

David Black-‐Schaﬀer

30/07/2012 | 2

Uppsala University / Department of Informa'on Technology

Let’s Look at Op'mizing an OpenCL Program

Uppsala Programming for Multicore Architectures Research Center

•  Simple PDE solver (2x 4096x4096 grid of ﬂoats; 64MB of data) –  5-‐point stencil computa'on for all points –  Convergence es'ma'on (max-‐min) –  Repeat un'l converged

Accelera'ng Code with OpenCL

!"#

David Black-‐Schaﬀer

Test System: Sandy Bridge i7-‐2600 3.4GHz (4-‐core) AMD Radeon HD 6980M

Assistant Professor, Department of Informa'on Technology Uppsala University

David Black-‐Schaﬀer

30/07/2012 | 3

David Black-‐Schaﬀer

C Code 1.2

" "for (int x=1; x LIMIT) {! } " "// Calculation! " "update(in, out);! " "! float find_range(float *data, int size) { "float max, min; " "// Compute Range! "max = min = 0.0f; " "range = find_range(out, SIZE*SIZE);! "// Iterate over the data and find the "swap(&in, &out);! "!

}

30/07/2012 | 4

1. Baseline

1.0

0.8

All performance data will be normalized to the C-‐code. Overhead

0.6

Range Update

min/max

"for (int i=0; i LIMIT) {!

–  Compile program –  Setup the buﬀers –  Copy the data to the device

! " " " "

"// Calculation! "start_perf_measurement(&update_perf);! "update_cl(get_in_buffer(), get_out_buffer());! "stop_perf_measurement(&update_perf);!

" " " " " " " " " " "

"// Read back the data! "start_perf_measurement(&read_perf);! "read_back_data(get_out_buffer(), out);! "stop_perf_measurement(&read_perf);! "! "// Compute Range! "start_perf_measurement(&range_perf);! "range = find_range(out, SIZE*SIZE);! "stop_perf_measurement(&range_perf);! "! "iterations++;!

•  We s'll need to do these every 'me:

" "}

"printf("Iteration %d, range=%f.\n", iterations, range);! "!

!

–  Read back the results (for range()) –  Enqueue the kernel (to do the calcula'on) –  Wait for it to ﬁnish !

© 2012 David Black-‐Schaﬀer

3

Accelera'ng Code with OpenCL

David Black-‐Schaﬀer

30/07/2012

30/07/2012 | 19

Uppsala University / Department of Informa'on Technology

David Black-‐Schaﬀer

New update_cl() and read_back_data()

30/07/2012 | 20

Uppsala University / Department of Informa'on Technology

Performance With Reduced Overhead

void update_cl(cl_mem in_b, cl_mem out_b) {! "cl_int error;! 1.2 "// Set the kernel arguments! 4. Overhead Outside of Loop "error = clSetKernelArg(update_kernel, 0, sizeof(in_b), &in_b);! "checkError(error, "clSetKernelArg in");! 1 "error = clSetKernelArg(update_kernel, 1, sizeof(out_b), &out_b);! "checkError(error, "clSetKernelArg out");! "! "// Enqueue the kernel! 0.8 "size_t global_dimensions[3] = {SIZE,SIZE,0}; // Ignore the border! "error = clEnqueueNDRangeKernel(opencl_queue, update_kernel, 2, NULL, global_dimensions, NULL, 0, NULL, NULL);! "checkError(error, "clEnqueueNDRangeKernel");! 0.6 "clFinish(opencl_queue);! }! !

0.4

!

void read_back_data(cl_mem buffer_to_read_from, float *result_buffer) {! "cl_int error;! 0.2 "// Enqueue a read to get the data back! Update is about 4x "error = clEnqueueReadBuffer(opencl_queue, buffer_to_read_from, CL_FALSE, 0, SIZE_BYTES, result_buffer, NULL, on the 40, -‐core CPU. NULL);! "checkError(error, "clEnqueueReadBuffer");! 0 "clFinish(opencl_queue);! C-‐CPU CL-‐CPU CL-‐CPU }!

David Black-‐Schaﬀer

30/07/2012 | 21

Uppsala University / Department of Informa'on Technology

David Black-‐Schaﬀer

Create Buﬀers Compile Program Read Data Range

Lots of 'me reading data…

Update

Update is very fast on the GPU. CL-‐GPU

CL-‐GPU

30/07/2012 | 22

Next Step

•  Update is now very fast on the GPU and 4x faster on the 4-‐core CPU. (Good work!) •  However…

•  To eliminate the 'me reading the data we need to keep the data on the device (GPU) •  To do this we need to move the range() func'on to the GPU.

–  GPU:

–  CPU:

Cleanup

Write Data

Uppsala University / Department of Informa'on Technology

Analysis

•  34% reading (transferring the data) •  46% range (on the CPU)

Overhead

Finish

Why are we spending 'me transferring the data for the OpenCL CPU version?

•  But the range() is a reduc'on, so we need synchroniza'on across all threads on the device…

•  16% reading (transferring the data??) •  48% range (We could use OpenCL’s map and upmap func'ons to

map the data into the applica'on’s space and thereby avoid this overhead on the CPU.)

David Black-‐Schaﬀer

30/07/2012 | 23

Uppsala University / Department of Informa'on Technology

David Black-‐Schaﬀer

range() kernel

Pupng range() on the Device !

1) Divide the data into 4096 chunks and calculate min/maxes for each in parallel on the device 2) Read back the 4096 min/max values and calculate the ﬁnal min/max on the CPU

$! $" $# $* $+ $, $-

* * *

%&' %&' %&' %&' %&' %&' %&'

* * *

" " " "

"// Calculation! "start_perf_measurement(&update_perf);! "update_cl(get_in_buffer(), get_out_buffer());! "stop_perf_measurement(&update_perf);!

" " " " " " " " " " " " " " " "

"// Range! "start_perf_measurement(&range_perf);! "range_cl(get_out_buffer());! "stop_perf_measurement(&range_perf);! "! "// Read back the data! "start_perf_measurement(&read_perf);! "read_back_data(range_buffer, range_data);! "stop_perf_measurement(&read_perf);! "! "// Compute Range! "start_perf_measurement(&reduction_perf);! "range = find_range(range_data, RANGE_SIZE*2);! "stop_perf_measurement(&reduction_perf);! "! "iterations++;!

" "}

"printf("Iteration %d, range=%f.\n", iterations, range);! "!

!

•  This reduces the data transfer from the full data set to just 4096 values. !"#$ %&'()$

© 2012 David Black-‐Schaﬀer

30/07/2012 | 24

"// ======== Compute! "while (range > LIMIT) {!

•  The range() func'on is a reduc'on. That means we need synchroniza'on across the whole kernel. •  So we do this in two steps:

! ! ! ! ! ! ! " " " " " " " # # # # # # # * * * * * * + + + + + + , , , , , - - - - - - - * * *

Uppsala University / Department of Informa'on Technology

%() %() %() %() %() %() %()

!

Calculate the min/max for 4096 chunks of the data in parallel.

Read back the 4096 min/max values. (32kB instead of 64MB.)

On the CPU do the ﬁnal min/max reduc'on over the 4096 min/ max values.

4

Accelera'ng Code with OpenCL

David Black-‐Schaﬀer

30/07/2012

30/07/2012 | 25

Uppsala University / Department of Informa'on Technology

David Black-‐Schaﬀer

range() kernel

30/07/2012 | 26

Uppsala University / Department of Informa'on Technology

Pupng range() on the Device

kernel void range(global float *data, int total_size, global float *range) {! "float max, min;! !

1.2

}!

"// Find out which items this work-item processes! "int size_per_workitem = total_size/get_global_size(0);! "int start = size_per_workitem*get_global_id(0);! "int stop = start+size_per_workitem;! "! "// Finds the min/max for our chunk of the data! "min = max = 0.0f;! "for (int i=start; i