High-speed parallel processing on CUDA-enabled Graphics Processing Units

High-speed parallel processing on CUDA-enabled Graphics Processing Units ____________________________________ THESIS submitted in partial fulfillment...

Author: Justina Porter

0 downloads 0 Views 2MB Size

Report

Download PDF

Recommend Documents

AES on Graphics Processing Units

Graphics Processing Units (GPUs)

Graphics processing units (GPUs)

An Optimized Parallel IDCT on Graphics Processing Units

Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

A Parallel Algebraic Multigrid Solver on Graphics Processing Units

EFFICIENT MULTIFRAGMENT EFFECTS ON GRAPHICS PROCESSING UNITS

SU(3) gluodynamics on Graphics Processing Units

Massively parallel Monte Carlo simulation with graphics processing units (GPU)

Parallel Cycle Based Logic Simulation using Graphics Processing Units

Accelerating Genetic Programming Using Graphics Processing Units

Fast Electromagnetic Integral-Equation Solvers on Graphics Processing Units

Computationally Efficient Tsunami Modelling on Graphics Processing Units (GPU)

Data-Parallel Algorithms for Agent-Based Model Simulation of Tuberculosis On Graphics Processing Units

Parallel Multigrid Preconditioning on Graphics Processing Units (GPUs) for Robust Power Grid Analysis

CMSC 411 Computer Systems Architecture Lecture 23 Graphics Processing Unit (GPU) Graphics Processing Units (GPUs)

Optimizing RDF stores by coupling General-purpose Graphics Processing Units and Central Processing Units

Implementation of Parallel Processing Techniques on Graphical Processing Units. Brad Baker, Wayne Haney, Dr. Charles Choi

Graphics Processing Unit

Efficient Parallel Lists Intersection and Index Compression Algorithms using Graphics Processing Units

Signal Processing on a Graphics Card

00 Fast K-selection Algorithms for Graphics Processing Units

CASA Parallel Processing Framework

Parallel Query Processing

High-speed parallel processing on CUDA-enabled Graphics Processing Units ____________________________________ THESIS

submitted in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE in COMPUTER ENGINEERING

by

Johan A. Huisman born in Sliedrecht, The Netherlands

Computer Engineering Department of Electrical Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

High-speed parallel processing on CUDA-enabled Graphics Processing Units ____________________________________ by Johan Huisman Abstract A new trend in computing is the use of multi-core processors and the use of Graphics Processing Units (GPUs) for general purpose high speed parallel computing. Therefore TNO wants to respond to the current trend of parallel computing and wants to investigate on which architectures they want to apply and parallelize their applications on. This thesis discusses the implementation of two applications used by TNO onto a CUDA-enabled GPU and a multi-core processor. The applications chosen for the implementation are object detection and ultrasound simulation (UMASIS). the development time and performance gain were measured during the implementation of both case. With these measurements an estimation of future projects can be made.

Laboratory Codenumber

: Computer Engineering : CE-MS-2010-05

Committee Members: Advisor:

Dr. Ir. Ben H.H. Juurlink, Associate Professor, CE, TU Delft

Advisor:

Dr. Ir. Arjan J. van Genderen, Assistant Professor, CE, TU Delft

Advisor:

Ir. Maurits S. van der Heiden, Senior Research Scientist, TNO I&T Delft

Chairperson: Dr. Ir. Koen L.M. Bertels, Associate Professor, CE, TU Delft Member:

Ir. Maurits S. van der Heiden, Senior Research Scientist, TNO I&T Delft

Member:

Dr. Ir. Stephan Wong, Assistant Professor, CE, TU Delft

Member:

Dr. Ir. Charles Botha, Assistant Professor, Mediamatics, TU Delft

i

Acknowledgement First of all, I am grateful to Nelleke Huisman-van den Berg, my loving wife, and Arie Huisman, my father, who greatly helped, supported and encouraged me throughout my study. I want to thank TNO Industrie & Techniek and the Computer Engineering Department of the Delft University of Technology for giving me the opportunity to commence this thesis. My special thanks go to my supervisor at TNO, Maurits van der Heiden, for his patience, time and help. Special thanks also to Ben Juurlink and Arjan van Genderen for their advice, support, and encouragement during the MSc project. Johan Huisman Delft, The Netherlands June 7, 2010

ii

Table of Contents Acknowledgement ........................................................................................................................... ii List of Figures .................................................................................................................................. v List of Tables ................................................................................................................................. vii List of Acronyms .......................................................................................................................... viii Chapter 1: Introduction .................................................................................................................... 9 1.1 Motivation ............................................................................................................................ 10 1.2 Research Goals ..................................................................................................................... 11 1.3 Analysis strategy .................................................................................................................. 12 1.4 Contributions ........................................................................................................................ 12 1.5 Related Work........................................................................................................................ 13 1.6 Structure of this thesis .......................................................................................................... 15 Chapter 2: Background ................................................................................................................... 16 2.1 Current and future developments in processing ................................................................... 16 2.2 Multi-core Architecture ........................................................................................................ 17 2.3 Definition of parallelism ...................................................................................................... 18 2.4 NVIDIA CUDA (GPU) ........................................................................................................ 26 2.5 Application programming interfaces (API) .......................................................................... 28 2.6 Development process evaluation .......................................................................................... 31 Chapter 3: Case Study: Object detection ........................................................................................ 33 3.1 Object detection (Computer Vision) .................................................................................... 34 3.2 Object detection application ................................................................................................. 38 3.3 Profiling ................................................................................................................................ 42 3.4 Implementation on GPU....................................................................................................... 45 3.5 Optimization ......................................................................................................................... 51 3.6 Analyses ............................................................................................................................... 54 3.7 Summary .............................................................................................................................. 60 Chapter 4: Case Study: UMASIS ................................................................................................... 61 4.1 UMASIS ............................................................................................................................... 61 4.2 Application ........................................................................................................................... 64 4.3 Profiling ................................................................................................................................ 70 4.4 Realization ............................................................................................................................ 72 4.5 Optimization ......................................................................................................................... 75 4.6 Analyses ............................................................................................................................... 79 4.7 Summary .............................................................................................................................. 82 Chapter 5: Discussion..................................................................................................................... 83 5.1 Comparison between the two cases ...................................................................................... 83 5.2 Performance comparison between various CUDA-enabled GPUs ...................................... 84 5.3 OpenMP vs. CUDA.............................................................................................................. 87 5.4 Maturity of CUDA ............................................................................................................... 90 5.5 Summary .............................................................................................................................. 92 6 Conclusions ................................................................................................................................. 93 6.1 Summary .............................................................................................................................. 93 6.2 Contributions ........................................................................................................................ 94 6.3 Future Suggestions ............................................................................................................... 95

iii

Appendix A: NVIDIA GPU Architecture (CUDA) ....................................................................... 96 Kernel, grid, threads and Blocks ................................................................................................ 96 Streaming Multiprocessors ......................................................................................................... 97 SIMT .......................................................................................................................................... 97 Warp ........................................................................................................................................... 98 Memory ...................................................................................................................................... 98 Bibliography ................................................................................................................................. 100

iv

List of Figures Figure 1. Clock frequency increase and later stagnation over the years [38]. ................................. 9 Figure 2. Example of homogeneous multi-core processor ............................................................. 18 Figure 3. Example of a speedup between serial vs. parallel implementation ................................ 19 Figure 4. Ideal vs. typical speedup ................................................................................................. 20 Figure 5. Various speedups with different α (non-parallelizable part) .......................................... 21 Figure 6. Schematic of the integral image calculations, the green node is the first node ............. 23 Figure 7. First step to perform the integral image in parallel in the vertical direction................... 23 Figure 8. Second step to perform the integral image in parallel in the horizontal direction. ......... 24 Figure 9. CUDA architecture ......................................................................................................... 27 Figure 10. OpenCL, merging two architectural divisions two one computing language [35] ....... 30 Figure 11. Graph showing development process, divided in three development phases ............... 32 Figure 12. Sample of object detection, detecting of cars from behind ........................................... 33 Figure 13. Object detection, above training phase and under detection phase .............................. 34 Figure 14. Examples of various types of haar features .................................................................. 35 Figure 15a. Original image (left) converted to the integral image (right) ...................................... 35 Figure 16. Left: Detection window is scaled, size of the image is fixed. ................................. 37 Figure 17. Flow chart of the object detect application ................................................................... 38 Figure 18. Data Structure of the Cascade, N number of Classifiers where each classifier has a ... 39 Figure 19. Left: Flow chart of the detect object function (SCALE_IMAGE) ............................... 40 Figure 20. Flow chart of the Features Extraction function ............................................................ 41 Figure 21. Profiling Results for the Scale window vs. Scale image modes ................................... 43 Figure 22. Design of the fixed data structure ................................................................................. 45 Figure 23. Flow chart feature extraction step running parallel in two axis (X,Y) ......................... 46 Figure 24. Flow chart non-parallel calculation of the integral image ............................................ 47 Figure 25. Left: Flow chart step 1, sum of values in the columns in parallel ................................ 48 Figure 26. Implementation results, scale window and scale image modes .................................... 49 Figure 27. Flow chart feature detection step running parallel in three axis (X,Y,Scale) ............... 51 Figure 28. Optimization results for SCALE_WINDOW ............................................................... 52 Figure 29. Object detection GPU activity time plot ....................................................................... 55 Figure 30. Three types of memory reads ........................................................................................ 56 Figure 31. Example on how the image can be split in multiple images for various scales. ........... 57 Figure 32. Occupation of the GPU over time ................................................................................. 58 Figure 33. Pipelined object detection with two threads ................................................................. 58 Figure 34. Implementation process divided in three development phases ..................................... 59 Figure 35. UMASIS tools............................................................................................................... 61 Figure 36. Above: Velocity function. Under: Weighting vectors for 4th order equation .............. 63 Figure 37. Flow chart UMASIS solver .......................................................................................... 64 Figure 38. Flow chart calculate velocities Figure 39. Flow chart calculate stresses ........... 66 Figure 40. Flow chart calculate boundaries Figure 41. Flow chart mirror stresses .................... 67 Figure 42. Flow chart calculate beam Figure 43. Flow chart apply taper ................................. 68 Figure 44. Flow chart insert source points Figure 45. Flow chart record data points ................ 69 Figure 46. Simulation model .......................................................................................................... 70 Figure 47. Profiling results ............................................................................................................. 71

v

Figure 48. UMASIS realization results .......................................................................................... 74 Figure 49. left: glob. mem. fetch single calculation, right: glob. mem. fetch 3x3 calculations ..... 75 Figure 50. Global memory fetch for 4 blocks of 4x4 calculations ................................................. 76 Figure 51. Optimization results UMASIS ...................................................................................... 78 Figure 52. Beam results from PC (left) and GPU (right) ............................................................... 80 Figure 53. Differences in precision between PC and GPU ............................................................ 80 Figure 54. Graph divided in development phases .......................................................................... 82 Figure 55. Speedup vs. development time of object detection and UMASIS ................................ 83 Figure 56. Comparison between GPUs and reference PC, lower is better ..................................... 86 Figure 57. Comparison OpenMP vs. GPU for object detection ..................................................... 88 Figure 58. Comparison OpenMP vs. GPU for UMASIS ............................................................... 89 Figure 59. CUDA organization of kernels, grids, blocks and threads............................................ 96 Figure 60. NVIDIA CUDA architecture ........................................................................................ 97 Figure 61. CUDA architecture ....................................................................................................... 99

vi

List of Tables Table 1. Specification comparison between GPUs ........................................................................ 26 Table 2. Specification comparison between GPU and CPU .......................................................... 27 Table 3. Specification of the reference PC used for profiling ........................................................ 42 Table 4. Profiling results, scale window and scale image modes .................................................. 43 Table 5. Specification of the GPU used for the implementation ................................................... 48 Table 6. Implementation results, scale window and scale image modes. ...................................... 49 Table 7. Optimization results for SCALE_WINDOW. The speedup is the host CPU runtime ..... 53 Table 8. Profiling Results ............................................................................................................... 71 Table 9. UMASIS results. The speedup is the host runtime divided by the GPU runtime ............ 74 Table 10. Opt. Results. The Tot. speedup is the host runtime divided by the GPU runtime. ........ 78 Table 11. Comparison between GPUs ........................................................................................... 85 Table 12. Comparison between GPUs and reference PC ............................................................... 86 Table 13. Specification of the reference PC used for profiling ...................................................... 87

vii

List of Acronyms CPU

Central Processing Unit

CUDA

Compute Unified Device Architecture

DMA

Direct Memory Access

FLOPS

FLoating Operations Per Second

FPGA

Field Programmable Gate Array

FDTD

Finite Difference Time Domain

GPU

Graphics Processing Unit

HPC

High Performance Computing

OpenCL

Open Computing Language

OpenCV

Open Computer Vision

PC

Personal Computer

SIMD

Single Instruction Multiple Data

SIMT

Single Instruction Multiple Thread

UMASIS

Ultrasonic Modeling and AnalySIS

viii

9

CHAPTER 1. INTRODUCTION

Chapter 1: Introduction In early days computing was done on centralized computer systems, called mainframes. These mainframes were large and powerful systems on which all of the computations for a group of users were performed. Executions of the calculations were scheduled in time. Later on mainframes evolved to systems able to execute multiple concurrent computations at a time. It was till that time when personal computers (PCs) where introduced that mainframes became less popular. Mainframes demanded large space and a high energy consumption. Despite that PCs were less powerful than mainframes, but users had their own computer and could execute programs on demand. This instead of having to wait until their programs was executed by the scheduler of the mainframe. After this PCs, were getting more powerful each generation due to smaller production factors resulting in higher clock frequencies.

Figure 1. Clock frequency increase and later stagnation over the years [38].

Recent developments in computer industry show an upper limit in core frequency and reveal a trend in parallel approach of computing instead of serial to overcome this limit. Modern processors are composed of multiple processing elements (Multi-core) instead of one more powerful processing unit compared to the previous (Figure 1). Making full benefits, programs should be optimized to make use of all available processing elements in a processor. Therefore older programs should be optimized for multi-core processors by rewriting processing intensive parts. The time that a speedup could be realized by replacing the old hardware for a new faster one seems to be over. That is why parallelization of the code is

10

CHAPTER 1. INTRODUCTION

needed to make use of future computing power. In general this task is not trivial and requires a certain amount of knowledge of the processor architecture and parallelization techniques. Highly parallel processing is extensively used in the gaming and graphical computing segment on specialized hardware. This hardware, called graphics processing units (GPUs) [1], are many-core processors build for graphics acceleration purposes. Recently the GPUs are enabled to execute general purpose tasks by means of the CUDA architecture [2]. Compared to today‟s multi-core processors, having typically up to four processing elements, these GPUs are providing up to hundreds of processing elements. So parallel execution could be exploited very well on these kind of computing architectures.

1.1 Motivation This project was conducted at TNO. Inside TNO I&T there are multiple high performance computing (HPC) applications. These applications require high computing power to achieve an acceptable performance. Some of these applications are written for single thread execution. We must therefore respond to the current trend of parallel computing and have to investigate on which architectures we want to apply and parallelize our applications on. The interest in applying alternative architecture is driven by the specifications that architectures like the IBM Cell B.E. [3] and GPUs in theory can offer. Important specifications like memory bandwidth and massively parallel computing power are making these architectures very promising. For example, the memory bandwidth of 25 GB/s for the Cell/B.E. and more than 100 GB/s for the GPU is much higher than the ±8GB/s for current CPUs. Besides that, the Cell/B.E could theoretically compute 250 GFlops where the GPU could even compute up to 1 TFlops. These specifications are impressive in comparison to a single core processor that can deliver approximate 15 GFlops. Even small clusters, like the Cray XD1 which can deliver per rack about 0.6 TFlops [4], are performing in comparison, in terms of size, prize and energy, a lot less than these new architectures. These new architectures are currently available for a reasonable price. However, development of HPC applications on these new architectures is relatively new and complex and have limited number of programming tools. Nevertheless which platform is chosen to run the HPC applications, parallelization of the code is necessary. Since almost all modern computing architectures use parallelism to improve their performance.

11

CHAPTER 1. INTRODUCTION

1.2 Research Goals The research goals of the project are defined using the following criteria with respect to each computing platform and application. -

-

Parallelization of code to a CUDA-enabled GPU: What steps are required to perform a successful implementation on a CUDA-enabled GPU? What are the difficulties and limitations? Efficiency: What can be achieved in theory and what is gained in performance in practice? Costs: How are the hardware and development cost in comparison with the performance achieved from the implementations? Maturity of CUDA: Is it profitable to invest in further development of CUDA-enabled applications? Is there enough ground and assurance that current CUDA applications will be future proof. How has CUDA developed thus far and can predictions for the future be made? All these questions are related to the question of the CUDA maturity.

A selection has been made of computing platforms on which the applications, used and developed within TNO, can run. These computing platforms are selected from available modern computing architectures. The computing architectures that will be used to perform our applications are: - Single-core (reference) - Multi-core CPU (OpenMP) - GPU (CUDA) Two use-cases are defined from a selection of HPC applications developed and used inside TNO. They should be made able to run on all the selected computing platforms. The criterion for the selection of applications is that they are used within different fields and that they make use of diverse algorithms. With this criterion the following applications were selected: - Object detection (Computer Vision) - Ultrasonic finite difference modelling (UMASIS) The first application is object detection, used within computer vision. In this application an algorithm based on the so called „haar features‟ is used for detection of objects in a computer image [5]. This by making use of a decision tree whereby many small calculations are performed to achieve a match in the image. Due to the decision tree this algorithm alternates much in runtime between calculations. UMASIS is an application that simulates ultrasonic pulse measurements to perform inspections on materials. The algorithm relies much on Finite Difference Time Domain (FDTD) calculations which are very computational intensive and generate lots of data, requiring in some cases multiple days of computation. At the moment UMASIS is executed task parallel on a Cray XD1 supercomputer. The Cray XD1 runs multiple simulation jobs concurrently.

12

CHAPTER 1. INTRODUCTION

These two cases are both already operational but are written as single core/thread applications.

1.3 Analysis strategy The analysis strategy can be broken up in five phases: - Define target platforms. - Define realistic cases. - Profile the individual case. - Implement and realize the cases on the target platforms. - Benchmark and analysis of the results. Within these phases multiple aspects are being worked out. Profiling individual cases: - What are the most time consuming parts? - What schemes and algorithms are used - Is it parallelizable? - What is the maximum speed up that can be realized theoretically? - How do we map the algorithm on the specific target? Analysis of the results: - What is the realized speedup on each target platform? - What is the effort taken to realize the speedup? - What can we conclude from these results? - What goes wrong and what went very well? - Which further improvements are possible in the near future?

1.4 Contributions The contributions of this thesis are:  Implementation of object detection on a CUDA-enabled GPU: The application of object detection demand high-speed computing to get real-time performance. By implementing the object detection on a CUDA-enabled GPU this can be achieved.  Implementation of UMASIS on a CUDA-enabled GPU: The UMASIS application is very time consuming.. To reduce runtime the UMASIS application is implemented on a CUDA-enabled GPU.  Analysis of the development process for CUDA based on two implementations: The development time and performance gain are recorded during the implementation of both use cases. With this information performance and time to develop estimations can be done for future projects. To ease this process some important factors related to success rate of the implementation are needed to be identified.

13

CHAPTER 1. INTRODUCTION

1.5 Related Work 1.5.1 CUDA The experiences on applying the NVIDIA CUDA programming model to acceleration applications are still recent and the number of publications is small but is growing rapidly. Nevertheless there are a couple of publications on experiences of academic research and industrial products which have used CUDA to achieve significant parallel speedup. The applications fall into a variety of application domains including machine learning [6], database processing [7], bioinformatics [8,9], numerical linear algebra[10,11], medical imaging[12], and physical simulation[13,14]. From these individual experiences the following can be concluded:  CUDA provides a straightforward means of describing inherently parallel computations particularly for data-parallel algorithms.  NVIDIA GPU architectures deliver high computational throughput.  CUDA-enabled GPUs are relatively small, less expensive and highly available compared to systems delivering comparable performance. However they found that performance can only be realized when the specific architecture is taken in account obeying the following rules:  Realize a sufficient amount of fine-grained parallelism: This by splitting the code in many small subtasks.  Make use of blocking computations to fit CUDA thread block abstraction: Threads are defined in blocks, by dividing the 'problem' over huge amount of small blocks the GPU is used as efficient as possible.  Make data-parallel programs efficient, threads of a warp follow same execution path: Set the right amount of threads per block and avoid the use of too much registers and branches. A good analysis tool for this is the CUDA occupancy calculator.  Make full use of the high-speed and low latency per-block shared memory: Use as much as possible the fast registers and shared memory and avoid overuse of the global memory. David Luebke gives a more in-depth coverage of the CUDA programming model [15]. Knowledge on the functional concepts is required to make sure that the programmer can make use of the full potential of the GPU.

1.5.2 Image processing on GPUs GPUs were originally intended to perform graphics related calculations primary for games. As stated earlier, GPUs can now be used to perform other calculations. There are several publications concerning GPUs for image processing tasks. The following papers [16,17,18] are reviewed to examine the application of image processing on GPUs. 1.5.2.1 GPUCV The application of image processing is very broad and cannot be covered totally on the GPU. One attempt is made to introduce a framework that exploits the GPU processing power to accelerate image processing and computer vision [16]. This framework named GPUCV is an open library

14

CHAPTER 1. INTRODUCTION

that is similar to OpenCV [19]. The main disadvantage is that the GPUCV provides much less functionality compared to that of OpenCV. Also the library uses the old pixel-shader programming tools and does not use the more advanced CUDA programming tools. Comparing GPUCV and the native OpenCV library results in a speedup of the individual functions ranging between 4x to 18x. The biggest advantage of the framework is in the ease of applying the accelerated functions in existing code written for OpenCV. The user of this library only has to replace the original OpenCV procedures into to the GPUCV equivalents. A disadvantage of the current GPUCV library is that it is not up-to-date anymore due to the lack of CUDA support. Still the results give a good insight in the full potential of GPU processing. 1.5.2.2 Mapping of algorithms There are several reasons to use GPUs for image processing. For example some reasons are highly parallel floating point computations, increased memory bandwidth and multi-GPU computing. The CUDA programming model can be used to meet these purposes. According to [17] it is important to investigate the mapping of algorithms to GPUs. This to realize more advanced optimization and achieve speedups of 10x or even higher instead of typical speedups of 2x-3x for „naïve‟ implementations. 1.5.2.3 Canny edge detection Canny edge detection is widely used in edge feature detection and then very often in the computer vision domain. That is why a canny edge detector, which makes use the CUDA programming model to accelerate the detection, is created [18]. Most of the code written for this application is meant to optimize memory usage, alignment, and size properties of the GPU architecture. This result in a significant speedup compared to a single-core processor. The results compared to multi-core processors are more moderate because of the use of hysteresis which were not implemented on the GPU.

1.5.3 Comparison between various architecture It is not easy to make a fair comparison between various architectures. Especially if systems are non-generic and have very different architectures. Still there is a high demand from software developers and system architects to know what the best target system is for their application. The developers of Simbiosys [20], an application for the bio-industry, report on the choices and findings related to the acceleration of their application. It compares the Cell, GPU and an FPGA where the Cell chosen to be the best architecture for their application. The main downside of the FPGA is the lack of support for floating point operations. GPU and Cell are more similar in hardware architecture compared to the FPGA. Some differences between these architectures are: Cell supports double precision floating point calculations, GPUs use caches where Cell puts complete control in hands of the programmer through direct DMA programming, GPUs use wider register (256-bit vs. 128-bit of Cell), Cells ability to execute two operations in a single cycle and finally the higher throughput of the Cell versus GPU. A cost comparison between all platforms is made. To compare individual platforms the processing power of 400 general purpose computers were used as reference. Then the required amount of GPUs and Cells were calculated to perform the same processing power. The results

15

CHAPTER 1. INTRODUCTION

show in 2008 for their application, the Cell as most cost efficient architecture applying PS3 Game Consoles incorporating Cell processors, followed respectively by the GPU and the FPGA.

1.6 Structure of this thesis The introduction, motivation, research goal, analysis strategy and related work of the project are given in Chapter 1. Background information on the different architectures and the definition of parallelism can be found in Chapter 2. The implementation, realization, optimization and analysis of the two cases can be found in the Chapters 3 and 4. The chapters handle the process of translating the existing serial code to parallel code. In Chapter 5 analysis and findings will be discussed. The conclusion and final recommendation are discussed in Chapter 6.

16

CHAPTER 2. BACKGROUND

Chapter 2: Background In this chapter the background is given for this thesis. In Chapter 2.1 the current and future development in processing are discussed. In Chapter 2.2 and 2.3 the two main architectures, resp. multi-core and NVIDIA CUDA-enabled GPU, are shortly introduced. In Chapter 2.4 levels of parallelism are discussed and samples of parallelization of code are given. Finally in Chapter 2.5 some APIs are discussed and code examples are given.

2.1 Current and future developments in processing 2.1.1 From serial to parallel For a long time we profited from Moore‟s law [21] and benefited from an exponential speed increase due to an almost exponential growth of the clock frequency, until a few years ago the maximum clock frequency of modern computers reached a limit of approximately 3Ghz. This is due to power constraints [22] that occur at higher frequencies and the constraints of the maximal speed of light [40]. Several solutions have been introduced to deal with this problem. One of these solutions is integrating multiple independent processors on one chip. Hereby certain parts of the code can be executed concurrently on each core which will theoretically result in a shorter runtime.

2.1.2 Multi-core Architecture Trends The general trend in processor development is from multi-core to many-core: from dual-, quad-, octo-core chips to ones with tens or even hundreds of cores. In addition, multi-core chips mixed with simultaneous multithreading, memory-on-chip and special-purpose “heterogeneous" cores (Cell/B.E) promise further performance and efficiency gains, especially in processing multimedia, recognition and networking applications. There is also a trend of improving energy efficiency by focusing on performance-per-watt.

2.1.3 General Purpose Graphics Processing Units (GPGPU) The use of co-processors specialized for certain tasks is not new in the computing history. One of these specialized co-processors is the Graphics Processing Unit (GPU) which has as its main task the computation of computer images and transmitting the computed data to a computer screen. As computing evolved and graphics calculations were transferred from mainly 2D to 3D, the GPUs where evolving from single- to many-core co-processers. It is till recently that GPUs can be used to perform general purpose or non-graphics applications. Big GPU venders such as NVIDIA and AMD/ATI are expanding their drivers to support processing of user generated code, often a Clike language, on their GPUs. Hereby a new platform for numerical calculations to run massively parallel programs is created.

17

CHAPTER 2. BACKGROUND

2.2 Multi-core Architecture A multi-core processor combines two or more independent cores into a single package composed on a single integrated circuit (IC). A dual-core processor contains two cores, and a quad-core four cores. A processor with all cores on a single chip is called a monolithic processor [23]. Each core has optimizations such as superscalar, pipelining and multithreading. A system with N cores is efficient when it can compute N or more threads concurrently. The most common multi-core processors are those used in personal computers, primarily from Intel and AMD. However, the technology is also widely used in other technology areas, especially those of embedded processors and in GPUs.

2.2.1 SIMD General processor cores are designed to execute each instruction in a sequential way. These instructions perform a single operation on the processor like add, subtract, read or load, etc. After a while extra instruction sets, like MMX and SSE for the Intel x86 or VMX / AltiVec for the IBM PowerPC, were introduced by various processor manufacturers. These instruction sets can handle single instruction multiple data (SIMD). By performing in one instruction multiple operations at a time in parallel, instead of performing this operation in a sequential way, a speedup could be achieved.

2.2.2 Architecture composition One of the biggest areas for variety in multi-core architecture is the composition and balance of cores themselves. Some architectures apply one core design and repeat it consistently (homogenous), while others apply a mixture of different cores, each optimized for a different role (heterogeneous).

2.2.3 Caches Caches are primarily used to reduce latencies involved by memory fetches. This by creating small but fast memory spaces close to the processing cores. The cores share the same interconnect to the rest of the system. This can give in some cases issues in providing sustainable memory bandwidth. For example, in case all cores perform a read/write on the memory, congestion on the interconnect can be caused. The interconnect thereby fails to facilitate all the requests at once. To prevent this problem there are multiple levels of caches inside the processor (Figure 2). The core will first request data from the caches before it is addressing the global memory, thereby lowering the amount of communication on the interconnect.

18

CHAPTER 2. BACKGROUND

Figure 2. Example of homogeneous multi-core processor

2.3 Definition of parallelism There are various types of parallelism in computing, where the most important types of parallelism are:  Task-level parallelism.  Data-level parallelism.  Instruction-level parallelism.

2.3.1 Task-level parallelism Task-level parallelism is achieved in a multiprocessor system when each processor executes a different thread or process on the same or different data. The threads may execute the same or different code. Threads can communicate with each other; this is usually data that is passed from one to another. An example of task-level parallelism is when there are two tasks and two processors. Then both tasks can run concurrently in parallel, where each processer is assigned to one of the two specific tasks.

2.3.2 Data-level parallelism Data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code. An example of a case where data-level parallelism can be used is the subtraction of two matrices of the size NxN from each other. We can divide the matrix in P parts such that 1≤P≤N2 where P is the number of processors. And then let all the processors subtract there part of the matrix and save the result to the corresponding part. In theory, if the bandwidth is unlimited, the matrix subtraction should be in this case P times faster than performed on a single processor.

2.3.3 Instruction-level parallelism An application consists of a set of instructions which are executed by a processor in order after each other. These instructions can be re-ordered and combined into groups which are then

19

CHAPTER 2. BACKGROUND

executed in parallel without changing the result of the program. This principle is called instruction-level parallelism. Modern processors have multi-stage instruction pipelines. Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage. A processor with an N-stage pipeline can have up to N different instructions at different stages. An example of a pipelined processor is a RISC processor that haves five stages: instruction fetch, decode, execute, memory access, and write back. Some processors can issue more than one instruction at a time. These are known as superscalar processors. Instructions can be grouped together only if there is no data dependency between them.

2.3.4 Dependencies When implementing parallel algorithms, it is important to understand data dependencies. The basic rule is that a program cannot run quicker than the longest path of concurrent dependent calculations, this is also known as the critical path. Calculations that depend on previous calculations in the path must be executed one after each other. If there is no dependency in algorithm between computations then it is possible to execute the algorithm in independent parts in parallel (Figure 3).

Figure 3. Example of a speedup between serial vs. parallel implementation

2.3.5 Amdahl's law and Gustafson's law The speedup from parallelism should be theoretically linear, when the number of processors is doubled the processing time should be halved. But in general the speed-up of algorithms in most cases do not approach the theoretical optimum. In general the speed-up is near the optimal for limited amounts of processors and after a certain amount of processors it converges to a constant value. This typical behavior is expressed in Figure 4. This behavior is in most cases the result of

20

CHAPTER 2. BACKGROUND

hardware constraints, limiting the maximal performance of the system, for example the global memory I/O speed or the maximum throughput of the peripheral bus.

Figure 4. Ideal vs. typical speedup

Amdahl's law [25] can be used to determine the potential speed-up of an algorithm on a parallel computing platform. The law states the portion of the program which cannot be parallelized will limit the overall speed-up available from parallelization. Almost any algorithm consists of parallelizable parts and non-parallelizable parts. This relationship is given by the equation: (

Equation 1

)

The speedup S is the result of the number of processing elements N and the proportional part P of parallelizable code in the algorithm. If the part of the program that is parallelizable is 50% of the total runtime then a higher speedup of 2 times cannot be achieved, no matter how many processors are added. The total runtime could be in theory as fast as the minimal runtime of the slowest parts. Amdahl's law assumes a fixed problem size and that the size of the sequential section is independent of the number of processors. To deal with this, Gustafson's law [26] was formulated. This law is closely related to Amdahl's law but it has a scaling factor. The law can be formulated as:

(

)

Equation 2

where P is the number of processors, S is the speedup, and α the non-parallelizable part of the process. An example of Amdahl's law is given in Figure 5.

21

CHAPTER 2. BACKGROUND

Figure 5. Various speedups with different α (non-parallelizable part)

2.3.6 Parallelization of code Parallelization of code is a process that is best done in certain steps, the basic steps in designing parallel applications are: -

-

-

-

Partitioning, the partitioning stage is intended to find opportunities for parallel execution. The focus is to define a large number of small tasks in order to yield what is termed a fine-grained decomposition of a problem. Communication, the tasks generated by a partition are intended to execute concurrently but cannot, in general, execute independently. The computation, to be performed in one task, will typically require data from another task (previous). So data must be transferred between tasks before computation can proceed. This information flow is specified in the communication phase of a design. Agglomeration, in this stage there is a move from the abstract toward the concrete situation/case. Decisions made in the partitioning and communication phases are revisited with the goal of obtaining an algorithm that will execute efficiently on a parallel computer system. In particular, it must be considered whether it is useful to combine tasks identified by the partitioning phase to provide a smaller number of tasks of greater size each. Also it is determined whether it is worthwhile to replicate data and/or computation. Mapping, this is the final stage of the parallel algorithm design process. The place where each task is executed is specified. This mapping problem does not arise on systems providing automatic task scheduling.

22

CHAPTER 2. BACKGROUND

2.3.7 Example of a parallelization with mutual independencies The following is an example of a parallelization of code with no mutual dependency between the computations. In this example a matrix B is computed in T(X * Y) time steps. Function Calc_B For each y