GPU: Power vs Performance

IT 14 015 Examensarbete 30 hp June 2014 GPU: Power vs Performance Siddhartha Sankar Mondal Institutionen för informationsteknologi Department of I...
Author: Guest
0 downloads 0 Views 3MB Size
IT 14 015

Examensarbete 30 hp June 2014

GPU: Power vs Performance

Siddhartha Sankar Mondal

Institutionen för informationsteknologi Department of Information Technology

Abstract GPU: Power vs Performance Siddhartha Sankar Mondal

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

GPUs are widely being used to meet the ever increasing demands of High performance computing. High-end GPUs are one of the highest consumers of power in a computer. Power dissipation has always been a major concern area for computer architects. Due to power efficiency demands modern CPUs have moved towards multicore architectures. GPUs are already massively parallel architectures. There has been some encouraging results for power efficiency in CPUs by applying DVFS . The vision is that a similar approach would also provide encouraging results for power efficiency in GPUs. In this thesis we have analyzed the power and performance characteristics of GPU at different frequencies. To help us achieve that, we have made a set of microbenchmarks with different levels of memory boundedness and threads. We have also used benchmarks from CUDA SDK and Parboil. We have used a GTX580 Fermi based GPU. We have also made a hardware infrastructure that accurately measures the power being consumed by the GPU.

Handledare: Stefanos Kaxiras Ämnesgranskare: Stefanos Kaxiras Examinator: Ivan Christoff IT 14 015 Sponsor: UPMARC Tryckt av: Reprocentralen ITC

Ac k n ow l e d gement

I would like to thank my supervisor Prof. Stefanos Kaxiras for giving me the wonderful opportunity to work on this interesting topic. I would also like to thank his PhD students Vasileios Spiliopoulos and Konstantinos Koukos for helping me get started with the benchmarks and hardware setup. The LATEX community has been of great help while making this document. Also, I would like to thank the spirit that lives in the computer. This thesis was funded by UPMARC.

v

C on t e n ts

Ac k n ow l e d g e m e n t C on t e n ts

v

vii

1

In t rodu c t i on

2

Bac kg rou n d 3 2.1 CUDA Programming model 2.2 GPU Architecture 4 2.3 Power issues 5 2.4 Latency 6 2.5 Previous work 7

3

Met h od ol o g y 9 3.1 Experimental Setup 9 3.2 Power measurement 9 3.3 DVFS 11 3.4 Microbenchmarks 11 3.5 Benchmarks 12 3.5.1 Matrix Multiplication 13 3.5.2 Matrix Transpose 13 3.5.3 Histogram 13 3.5.4 Radix Sort 13 3.5.5 Merge Sort 13 3.5.6 Conjugate Gradient 13 3.5.7 BFS(Breadth First Search) 13 3.5.8 Eigenvalues 13 3.5.9 Black-Scholes option pricing 13 3.5.10 3D FDTD(3D Finite Difference Time Domain method) 3.5.11 Scalar Product 14

4

Eva luat i on

1 3

15 vii

14

viii

c ontents

4.1

4.2 4.3

5

Microbenchmarks 16 4.1.1 Microbenchmark 1 16 4.1.2 Microbenchmark 2 17 4.1.3 Microbenchmark 3 18 4.1.4 Microbenchmark 4 19 Benchmarks 20 Memory Bandwidth 24 4.3.1 Black-Scholes 24 4.3.2 Eigenvalue 25 4.3.3 3D FDTD 26 4.3.4 Scalar Product 27

C on c lu si on

Bib l i o g r a ph y Asse m b ly c ode

29 31 33

Pow e r m e asu re m e n ts

37

chap ter

1

In t rodu c t ion

Over the last few years the CPU and GPU architectures have been evolving very rapidly. With the introduction of programmable shaders in GPU since the turn of century, it has been used by the scientific computing community as a powerful computational accelerator to the CPU. For many compute intensive workloads GPUs gives few orders of magnitude better performance than a CPU. The introduction of languages like CUDA and OpenCL has made it more easier to program on a GPU. In our evaluation we have used the Nvidia GTX580 GPU. It is based on the Fermi architecture[15]. With 512 compute units called CUDA cores it can handle 24,576 active threads [9, 15]. Its off-chip memory bandwidth is around 192 GB/s, which is considerably higher than CPUs main memory bandwidth(around 32 GB/s for core i7). Theoretically it can do over 1.5 TFLOPS(with single precision FMA). Even with such great numbers to boast about, GPUs have a few bottlenecks that one has to consider before sending all workload down to the GPU. One of them is the power and the other is the cost to transfer data from host to device(i.e. CPU to GPU) and again back from device to host(i.e. GPU to CPU). The bandwidth of 16x PCI express is around 8GB/s. Power is a very important constraint for every computer architecture[7]. Modern high-end GPUs have thermal design power(TDP) of over 200 watts. The GTX580 has a TDP of 244 watts. Whereas a high-end multicore CPU like Intel core i7 has a TDP of around 100 watts. In terms of GFLOPS/watt GPUs can be considered to be more power efficient than CPUs. But for GPUs to be considered as co-processor it consumes way too much power out of the total power budget. Though GPUs have such high off-chip memory bandwidth, accessing the off-chip global memory is very expensive(latency of around 400 to 800 clock cycles). GPUs hide this latency by scheduling very large number of threads at a time. For memory bound application it becomes hard to hide this latency and they have slack. On CPUs we can take advantage of this by applying DVFS to reduce the dynamic power. 1

2

chap ter 1. introduction

In this thesis work we investigate the power and performance characteristics of a GPU with DVFS(Dynamic voltage and frequency scaling). To help us achieve that we make use of microbenchmarks with different levels of memory boundedness and threads. We follow it up by applying DVFS to some more benchmarks. Apart from the introduction in this chapter the thesis is structured as follows. In Chapter 2 , we discuss about GPU programming, GPU architecture, the concept behind DVFS based power savings and also some previous work. In Chapter 3, we describe the method used to measure power and implement DVFS on GPU. We also throw some light on the benchmarks that were used for the evaluation. In Chapter 4, we evaluate the performance and power consumption of the benchmarks under different number of threads, shader frequencies and memory bandwidth. In Chapter 5, we reflect on the conclusions drawn from our evaluation. In Appendix: Assembly code, we provide the assembly code of the micro benchmarks we used. In Appendix: Power measurements, we provide all the execution times and power consumption readings from the evaluation.

chap ter

2

Bac kg rou nd

2.1

cuda pro gramming model

CUDA is a GPGPU programming language that can be used to write programs for NVIDIA GPUs. It is an extension of the C programming language. A code written in CUDA consists of a mix of host(CPU) code and device(GPU) code[9]. The NVCC compiler compiles the device code and separates it from the host code. At a later stage a C compiler can be used to compile the host code and the device code parts are replaced by calls to the compiled device code. A data parallel function that is to be run on a device is called a kernel. A kernel creates many threads to compute this data parallel workload. A thread block consists of many threads and a group of thread block form a grid. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

__global__ void SimpleKernel (float A[N],float B[N],float C[N], int N) { // calculate thread id int i= blockIDx.x * blockDim.x + threadIdx.x; if(i