Two Dimensional Loop Nest Scheduling with OpenMP

Two Dimensional Loop Nest Scheduling with OpenMP Howard Price MSc in High Performance Computing The University of Edinburgh Year of Presentation: 200...
61 downloads 0 Views 1MB Size
Two Dimensional Loop Nest Scheduling with OpenMP Howard Price

MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2006

Abstract Standard OpenMP scheduling options, such as static and dynamic, can be used to parallelise a nested loop structure by distributing the iterations of the outer-most loop. However, this can be very inefficient in some circumstances. For example, in a double nested loop that is used to traverse the elements of a two-dimensional array, parallelising the outer loop is effectively a one-dimensional decomposition of a two dimensional problem. This can create large quantities of elements on the boundaries between the chunks of iterations, and increase inter-processor communication costs as a result. Ultimately this can be detrimental to performance. Using OpenMP, two-dimensional decomposition and scheduling algorithms have been developed that are based on 1D equivalents. The implementation of these algorithms is based on decomposition of both levels of a double nested loop so that the problem is divided into two dimensional chunks, or “patches”. This report details extensions of previous work on 2D guided scheduling and 2D feedback guided loop scheduling (FGLS), as well as the logic and implementation behind new algorithms for 2D versions of static, dynamic, affinity, and dynamic affinity scheduling. Performance benchmarking has shown that 2D scheduling algorithms can offer significant performance advantages over the standard OpenMP scheduling options. When a good combination of 2D decomposition and scheduling algorithm is used, excellent scaling can be observed in situations in which the 1D scheduling algorithms perform badly. It has also been found that the performance of the scheduling algorithms can be extremely sensitive to overheads incurred by the two-dimensional nature of the problem. The most simple and efficient schedulers can perform the best in practice, and for this reason an optimised formulation of two-dimensional FGLS is found to have the most practical value of the nested loop schedulers tested.

H. J. Price

i

University of Edinburgh – Own Work Declaration This sheet must be filled in (each box ticked to show that the condition has been met), signed and dated, and included with all assessments - work will not be marked unless this is done This sheet will be removed from the assessment before marking Name:

………………………………………

Course/Programme: Title of work:

Number: …………………

……………………………………………………………………...

……………………………………………………………………………….

I confirm that all this work is my own except where indicated, and that I have: •

Clearly referenced/listed all sources as appropriate

…



Referenced and put in inverted commas all quoted text (from books, web, etc)

…



Given the sources of all pictures, data etc. that are not my own

…



Not made any use of the report(s) or essay(s) of any other student(s) either past or present

…



Not sought or used the help of any external professional agencies for the work

…



Acknowledged in appropriate places any help that I have received from others (e.g. fellow students, technicians, statisticians, external sources)

…



Complied with any other plagiarism criteria specified in the Course handbook

…

I understand that any false claim for this work will be penalised in accordance with the University regulations

…

Signature ……………………………………. Date ………………………………………….. Please note: If you need further guidance on plagiarism, you can 1. Consult your course book 2. Speak to your course organiser or supervisor 3. Check out http://www.aaps.ed.ac.uk/regulations/Plagiarism/Intro.htm

Please read the notes about the use of plagiarism detection software overleaf.

Contents 1

Introduction 1.1 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2

2

Background Theory 2.1 Shared Memory Programming . . . . . . . . . 2.1.1 OpenMP . . . . . . . . . . . . . . . . 2.1.2 Parallel Metrics . . . . . . . . . . . . . 2.2 Loop Level Parallelism . . . . . . . . . . . . . 2.2.1 Standard OpenMP Scheduling Options 2.2.2 Alternative Scheduling Algorithms . .

3

4

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

3 3 4 4 5 5 7

Underlying Implementation of 2D Scheduling Algorithms 3.1 Legacy Code . . . . . . . . . . . . . . . . . . . . . . 3.2 Primary Data Structures . . . . . . . . . . . . . . . . . 3.2.1 The Patch . . . . . . . . . . . . . . . . . . . . 3.2.2 The Partition . . . . . . . . . . . . . . . . . . 3.2.3 The Domain . . . . . . . . . . . . . . . . . . 3.3 Loads . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Communication Time . . . . . . . . . . . . . 3.4.2 Computation Time . . . . . . . . . . . . . . . 3.4.3 Data Array Padding . . . . . . . . . . . . . . . 3.5 Code Verification . . . . . . . . . . . . . . . . . . . . 3.6 False Sharing . . . . . . . . . . . . . . . . . . . . . . 3.7 Assessment of Code . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

10 11 11 12 12 13 13 13 15 16 16 17 17 18

. . . . . . . . . .

19 19 19 20 21 23 23 25 25 28 31

Two-Dimensional Nested Loop Scheduling 4.1 Domain Decomposition Algorithms . . . . . . 4.1.1 2D Recursive Guided . . . . . . . . . . 4.1.2 2D Isotropic Guided Decomposition . . 4.1.3 RCB Decomposition . . . . . . . . . . 4.2 Two-Dimensional Scheduling Algorithms . . . 4.2.1 2D Static and Dynamic Scheduling . . 4.2.2 2D Guided Scheduling . . . . . . . . . 4.2.3 2D Affinity Scheduling . . . . . . . . . 4.2.4 2D Dynamic Affinity Scheduling . . . 4.2.5 2D Feedback Guided Loop Scheduling

ii

. . . . . .

. . . . . . . . . .

. . . . . .

. . . . . . . . . .

. . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

H. J. Price 5

Results and Analysis 5.1 Benchmark Tests . . . . . . . . . . . . . . . . 5.2 Hardware Considerations . . . . . . . . . . . . 5.3 Presentation of Results . . . . . . . . . . . . . 5.4 Constant Load . . . . . . . . . . . . . . . . . . 5.4.1 1D Scheduling Algorithms . . . . . . . 5.4.2 2D Static Scheduling . . . . . . . . . . 5.4.3 2D Dynamic Scheduling . . . . . . . . 5.4.4 2D Guided Scheduling . . . . . . . . . 5.4.5 2D Affinity Scheduling . . . . . . . . . 5.4.6 2D Dynamic Affinity Scheduling . . . 5.4.7 2D Feedback Guided Loop Scheduling 5.5 Gaussian Load . . . . . . . . . . . . . . . . . 5.5.1 1D Scheduling Algorithms . . . . . . . 5.5.2 2D Static Scheduling . . . . . . . . . . 5.5.3 2D Dynamic Scheduling . . . . . . . . 5.5.4 2D Guided Scheduling . . . . . . . . . 5.5.5 2D Affinity Scheduling . . . . . . . . . 5.5.6 2D Dynamic Affinity Scheduling . . . 5.5.7 2D Feedback Guided Loop Scheduling

iii

. . . . . . . . . . . . . . . . . . .

38 38 39 40 40 40 42 44 44 45 48 48 50 50 52 54 54 55 56 59

6

Discussion 6.1 Comparison of the Feedback Guided Schedulers . . . . . . . . . . . . . . . . . . . . .

60 65

7

Conclusions 7.1 Project Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68 69 71

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

A Code Usage

73

B Glossary

74

List of Figures 1.1

Example 1D and 2D decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

2.1

Block and cyclic decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

3.1 3.2 3.3

Example of the simple filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two dimensional Gaussian load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of cache line alignment with a patch in memory . . . . . . . . . . . . . . .

11 14 17

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13

“Olkhovskyy” Guided Decomposition of a square domain . . . . . . . . . . . 2D Isotropic Guided Decomposition of a square domain . . . . . . . . . . . RCB decomposition of a square domain into 8 sub-domains . . . . . . . . . . RCB decomposition of a square domain into 5 sub-domains . . . . . . . . . . RCB Guided Decomposition of a square domain . . . . . . . . . . . . . . . . Comparison of guided decomposition algorithm chunk size distribution . . . 2D static/dynamic decomposition onto 4 processors . . . . . . . . . . . . . . Patch re-allocation in affinity scheduling . . . . . . . . . . . . . . . . . . . . Example of 2D affinity scheduling in action . . . . . . . . . . . . . . . . . . Example of 2D dynamic affinity scheduling in action . . . . . . . . . . . . . Example 2D FGLSdecomposition and 1D piecewise workload approximation Example 2D FGLS decomposition of a Gauss load on 16 processors . . . . . Load imbalance per iteration for 2D FGLS . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

20 21 21 22 23 24 25 29 30 32 34 36 37

Parallel speed-up of code using static scheduling with a constant load . . . . . . . . . Parallel speed-up of code using dynamic scheduling with a constant load . . . . . . . . Parallel speed-up of code using 2D static scheduling with a constant load . . . . . . . 2D block decomposition details in a square domain of side length 192 . . . . . . . . . Parallel speed-up of code using 2D dynamic scheduling with a constant load . . . . . . Parallel speed-up of code using 2D guided scheduling using RCB guided decomposition with a constant load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Parallel speed-up of code using 2D guided scheduling using 2D isotropic guided decomposition with a constant load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Parallel speed-up of code using 2D affinity scheduling using RCB affinity decomposition with a constant load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Parallel speed-up of code using 2D affinity scheduling using 2D isotropic guided decomposition with a constant load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Parallel speed-up of code using 2D dynamic affinity scheduling using RCB guided decomposition with a constant load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11 Parallel speed-up of code using 2D dynamic affinity scheduling using 2D isotropic affinity decomposition with a constant load . . . . . . . . . . . . . . . . . . . . . . . . . .

41 42 43 43 45

5.1 5.2 5.3 5.4 5.5 5.6

iv

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

46 46 47 47 49 49

H. J. Price

v

5.12 5.13 5.14 5.15 5.16 5.17 5.18

50 51 52 53 53 54

5.19 5.20 5.21 5.22 5.23 5.24 6.1 6.2 6.3 6.4 6.5 6.6 6.7

Parallel speed-up of code using 2D FGLS using a constant load . . . . . . . . . . . . . Parallel speed-up of code using static scheduling with a Gauss load . . . . . . . . . . . Parallel speed-up of code using dynamic scheduling with a Gauss load . . . . . . . . . Example 2D static scheduling decomposition and patch allocation . . . . . . . . . . . Parallel speed-up of code using 2D static scheduling with a Gauss load . . . . . . . . . Parallel speed-up of code using 2D dynamic scheduling with a Gauss load . . . . . . . Parallel speed-up of code using 2D guided scheduling using 2D isotropic guided decomposition with a Gauss load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel speed-up of code using 2D guided scheduling using 2D RCB guided decomposition with a Gauss load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel speed-up of code using 2D affinity scheduling using 2D isotropic guided decomposition with a Gauss load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel speed-up of code using 2D affinity scheduling using 2D RCB guided decomposition with a Gauss load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel speed-up of code using 2D dynamic affinity scheduling using 2D isotropic guided decomposition with a Gauss load . . . . . . . . . . . . . . . . . . . . . . . . . Parallel speed-up of code using 2D dynamic affinity scheduling using RCB guided decomposition with a Gauss load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel speed-up of code using 2D FGLS with a Gauss load . . . . . . . . . . . . . . Optimum observed parallel efficiency for each schedule types acting on a constant load in the “high computation” regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimum observed parallel efficiency for each schedule types acting on a Gauss load in the “high computation” regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Best parallel efficiency of all schedule types with a constant load in the “communication only” regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Best parallel efficiency of all schedule types with a Gauss load in the “communication only” regime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Best parallel efficiency of all schedule types with a constant load in the “balance” regime. Best parallel efficiency of all schedule types with a Gauss load in the “balance” regime. Iteration times for 2D dynamic affinity and 2D FGLS schedules working on a Gauss load.

55 56 57 57 58 58 59

61 61 62 63 64 64 66

Listings 1.1 2.1 2.2 3.1

2D Nested Loop structure of interest in this report 1D loop with guided scheduling . . . . . . . . . Reversed 1D Loop with guided scheduling . . . . Timed work routine . . . . . . . . . . . . . . . .

vi

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 7 7 14

Acknowledgements This project started off slowly and greatly accelerated as it progressed. I would like to thank Mark Bull for his patience and supervision throughout. I would also like to thank him, and the rest of the EPCC staff, for the extraordinarily high standard of teaching I have had the pleasure of witnessing during my time here. This year’s MSc cohort have also been exceedingly generous in their expert advice. I would especially like to doff my cap to the community that develops and maintains the wonderful software that I have generously been allowed to use to aid me in producing this report: Linux, Emacs, WinSCP, Ifranview, Inkscape, CVS, LATEX... And I would also like to take the opportunity to thank my wonderful girlfriend Catherine for putting up with me this summer.

Chapter 1

Introduction The purpose of this report is to detail the performance of nested loop-level parallelisation techniques implemented in OpenMP [1]. User-defined scheduling routines have been constructed and tested, and the results are compared and contrasted with the standard OpenMP schedule options and previous work on this subject [2]. The performance of a simple test code is assessed in this report, although the methods should be employable on production codes. Most scientific codes consist of a small number of iterative loops that perform the majority of the useful work. The iterations of a loop can shared between the processors in a multiprocessor system to achieve the same result in less time. When parallelising code in this way it is important to asses the independence of the loop iterations to maintain the correct functionality of the application. The two-level nested loops that this report is concerned with take the form: Listing 1.1: 2D Nested Loop structure of interest in this report f o r ( i = 0 ; i < N ; i ++) f o r ( j = 0 ; j < M; j ++) update ( a [ i ] [ j ] ) ; where a is a two dimensional array and N and M are the domain extents in each dimension. The element at (i, j) is updated via a calculation that may or may not involve neighbouring elements. The OpenMP standard contains loop level decomposition methods (see Section 2.2.1) that can be used to distribute the iterations of a loop between threads. This can be an effective parallelisation strategy for many problems, and a careful choice of decomposition method can result in a well balanced load across the processors. However, any inter-thread communication that is required by the algorithm in question will affect the performance of the code. In the code snippet given above for example, to update the value of an element a[i][j] the update subroutine may require information from the neighbouring array elements to carry out the computation, eg. a 4/5 point stencil from a particular 2-dimensional discretisation. If neighbour information is required, then a 1-dimensional decomposition of the outermost loop, while giving a good load balance, may also result in a large, performance hindering communications-to-computation 1

H. J. Price

2

Figure 1.1: Example of 1D and 2D domain decompositions onto 4 processors. Here, the boundary between neighbouring threads is 3 times the domain length for the 1D decomposition, but only 2 times the domain length for the 2D decomposition as the fraction of boundary elements per patch (i.e. boundary to size ratio) is lower for each thread. Therefore the 2D decomposition requires less communication to exchange boundary data. ratio due to the shape of the sub-domains that are created. This is illustrated in Figure 1.1. Two dimensional decomposition strategies are the focus of this report. They aim to both (i) achieve a good load balance, (ii) minimise communication. The techniques studied involve a distribution of each level of the nested loop onto threads in a fashion that creates contiguous 2D regions of iteration space for processing as an entity. They are referred to as chunks or patches. The computational domain is split into sub-domains that can be mapped effectively onto available processors and which have a low boundary-to-size ratio to minimise unnecessary inter-processor communication.

1.1

Overview of Thesis

Chapter 2 covers the background material that is necessary to understand nested loop scheduling and the remainder of the thesis. The essentials of shared memory programming are covered with particular emphasis on the OpenMP standard. The reader is reminded of the key parallel metrics by which the scalability of a code is measured. Finally, the well known one-dimensional loop scheduling algorithms are outlined. Chapter 3 formulates a simple problem that can be approached using loop level parallelism. The main features of the accompanying code that allow the easy implementation of custom decomposition and scheduling algorithms are explained, along with important in-built load models. Code verification and performance are also considered. Chapter 4 explains the decomposition and scheduling techniques that have been developed during this project. The methodology and implementation of each are described in turn with reference to the data structures introduced earlier. Chapter 5 reports the performance and scalability of the implemented schedule functions during a series of benchmark tests. The results are discussed in Chapter 6 and finally the conclusions drawn from the project are given in Chapter 7.

Chapter 2

Background Theory 2.1

Shared Memory Programming

Parallel shared memory programming involves implicit interprocessor communication via memory accesses to a global address space. A serial code can be parallelised using the shared memory paradigm without a great deal of effort, which makes shared memory programming appealing to computational scientists who wish to quickly parallelise legacy codes. Parallelising codes using a distributed memory message passing approach, such as MPI [3], will generally involve a much greater initial expenditure of effort. The downside to shared memory programming is that the code will typically not be scalable to very large numbers of processors. Physical and economic limitations are evident in the available hardware: the aggregate memory bandwidth available to the processors and cache coherency issues leads to a bottleneck that ultimately limits the scalability of codes running on a shared memory platform. Shared memory programming (also know as shared variable programming) is based on the notion of threads. Much in the same way that multiple threads of execution may run concurrently in an OS on a single processor machine, multiple threads may run simultaneously on a system with multiple processors, usually with one thread per processor. Each thread is uniquely identifiable and has access to shared variables. In a shared memory program there may be shared data that is accessible to all threads, as well as private data that is only accessible to the thread that “owns” it. There is no explicit inter-thread communication in shared memory programming. All communication must be orchestrated via reads and writes to the shared address space. This form of implicit communication can be very fast on symmetric multiprocessor (SMP) and cache-coherent non-uniform memory access (cc-NUMA) machines. Assuming a good load balance across processors, the performance of a code is ultimately limited by the memory hierarchy of the system it is running on. Synchronisation is a major issue in shared memory programming. With increasing numbers of parallel threads each having the ability to accesses a common memory, the possibility for undesirable conflicts increases. Concurrent reads may be acceptable, but concurrent writes to the same address by different threads will result in undefined results. This can lead to correctness errors in the output from the parallelised algorithm. The errors in poorly synchronised code may not immediately present themselves

3

H. J. Price

4

during testing as the probability of a memory access clash may be very low statistically. The shared memory programmer must be vigilant in addressing synchronisation within his code, but be careful not to over-synchronise, which can severely inhibit performance and scalability.

2.1.1

OpenMP

OpenMP [1] is an API for programming shared memory parallel computers. Implementations of the standard allows OpenMP to be used on true shared memory machines as well as distributed shared memory machines such as SMP clusters. The shared memory programming paradigm is based on the single program multiple data (SPMD) programming model. A serial code can be “marked up” to parallelise the code, which results in different threads operating on different data at run-time. OpenMP allows serial code to be parallelised via the addition of compiler directives. These directives are preceded by a sentinel that effectively makes the directive a comment to a serial compiler. This is highly advantageous in reducing development costs as the same code can be compiled and run on both serial and parallel platforms; only one copy of the code needs to be maintained so algorithmic changes to the code can be tested in serial and quickly executed in parallel. Conversely, codes parallelised using the distributed memory programming paradigm usually require major modifications to algorithmic and data structures – an approach that necessitates far more effort to maintain a single code-base that can be easily compiled for execution on serial and parallel platforms. In OpenMP the thread team is made up of a master thread as well as zero or more other threads that can fork and join with the master thread of execution during the lifetime of the executing program. It is sensible to have only one thread per physical processor available for performance reasons. Work sharing directives are responsible for distributing the workload between the threads in an OpenMP program (see Section 2.2.1). As inter-thread communication occurs only via reads and writes to shared memory, access to shared variables must often be protected. In OpenMP a critical section of code is a block that only one thread may be running at any one time. Another method of protecting updates to shared data is to use a lock. Locks are simply variables that can be set by a thread. The lock can only be un-set by the thread that locked it. If another thread encounters a lock in the code it cannot proceed while the lock is set. Whilst locks are no more expensive to use than critical sections, they can make code difficult to comprehend and manage if used without due care and attention.

2.1.2

Parallel Metrics

The key metrics when measuring the parallel performance are parallel speed-up and parallel efficiency. For a given problem size parallel speed-up is defined as the ratio of the best serial execution time to the parallel execution time on p processors. Parallel efficiency is the parallel speed-up divided by the number of processors. Therefore perfect scaling is indicated by a speed-up of p and a parallel efficiency of 1. Performance is presented using these two metrics in this report.

H. J. Price

5

Figure 2.1: Example of block (left) and cyclic (right) decompositions onto 2 processors. Block decomposition results in one chunk per thread. A cyclic decomposition can result in many separate chunks per thread and can increase the number of boundary elements.

2.2

Loop Level Parallelism

The majority of the work inside scientific codes takes place within loops. Therefore the most obvious place to start implementing parallelism in code is in the most time consuming loops, be they single-level or nested. The independence of loop iterations must always be at the forefront of the developer’s mind when parallelising a loop: when the responsibility for executing the iterations of a loop has been divided between a team of parallel threads, the order in which the iterations are processed is no longer defined.

2.2.1

Standard OpenMP Scheduling Options

The OpenMP standard includes a number of simple schedule clauses that can be used to control how the iterations of a loop are distributed. These are placed at the end of an OpenMP pre-processor directive in the source code that instructs the compiler to parallelise the immediately following do/for loop in a particular manner. An important notion in the schedules is the chunk which is a distinct contiguous region of iteration space that is processed as a whole by a thread.

Static Scheduling In a do/for loop preceded by the STATIC schedule clause, loop-level parallelism is implemented by distributing contiguous chunks of iteration space of a specified size on to threads. If a chunk size is not specified then there will be p chunks which each have a size of N/p (approximately), where N is the problem size and p is the number of processors. This gives a block decomposition, in which each thread is allocated only one chunk to process. If a chunk size of less than N/p is specified then this schedule clause will result in a cyclic distribution of chunks between the processors, in which iteration space is divided up into more chunks than there are threads. The chunks are allocated to threads in a round robin manner. Block and cyclic distributions are illustrated in Figure 2.1.

H. J. Price

6

A block decomposition can give a good load balance in circumstances in which the load is the same at every region of the solution domain. Cyclic distributions can be useful for sharing an uneven load between processors, but can be inefficient for some applications as the decomposition can result in a lot of data being communicated between threads to perform a particular calculation.

Dynamic Scheduling The decomposition strategy for dynamic scheduling is the same as for static scheduling. As with STATIC scheduling the regular chunk size can be controlled. However, the dynamic scheduling algorithm differs in that the distribution of chunks onto threads is not determined before the main loop commences. The iteration space is divided into chunks and distributed to threads as and when they are available to work during the execution of the code. Therefore DYNAMIC scheduling is an example of a dynamic load-balancing algorithm. Dynamic load balancing routines can be useful when the load is not known in advance, but can detriment performance as data locality may not be at all well preserved between successive executions of the parallel loop.

Guided Scheduling As in dynamic scheduling, the assignment of chunks onto threads is not pre-determined for guided scheduling [4]. Instead, chunks are allocated to to threads as required during program execution when a thread is available to work. Therefore guided scheduling is also and example of a dynamic load balancing algorithm. In guided scheduling there is one global queue of chunks that is accessible by all threads. The chunk size is initially large and gets successively smaller down the queue in a specific pattern. The size of a chunk is approximately half the number of iterations remaining divided by the number of threads in the team: chunk size =

r 2p

(2.1)

where r is the number of remaining iterations and p is the number of processors (or threads). A minimum chunk size can be specified by the user, as it can be inefficient to process many small chunks individually. r but most implementations set k = 2. Strictly speaking the OpenMP standard defines the chunk size as kp The chunks in the queue start out large and become progressively for guided scheduling, so the algorithm performs well for problems in which the work load is largest in regions of the domain in which the chunk size is smallest. However, if the workload is high where the chunksize is large then the cost of the first chunks in the queue can vastly outweigh the cost of chunks further down the queue. In this circumstance it is possible for the first chunk in the queue to take more time to process than the rest of the queue combined. The thread that picks up and processes this first chunk may be the last to finish! Clearly this defeats the dynamic load balancing remit of guided scheduling and may severely inhibit the chances of attaining a good parallel efficiency.

H. J. Price

7

One way around this specific problem is to simply reverse the loop order, e.g. Listing 2.1: 1D loop with guided scheduling #pragma omp f o r s c h e d u l e ( g u i d e d ) f o r ( i = 0 ; i < N ; i ++) doWork ( b [ i ] ) ; where the workload is highest for low i would become: Listing 2.2: Reversed 1D Loop with guided scheduling #pragma omp f o r s c h e d u l e ( g u i d e d ) f o r ( i = N−1; i >= 0 ; i −−) doWork ( b [ i ] ) ; This process ensures that the smaller chunks are located where the workload density is highest. For obvious reasons this approach is termed reverse guided scheduling. It is commonly used as it can be implemented quickly using the standard OpenMP GUIDED schedule clause with reversed loop limits. The trick is to spot its applicability.

2.2.2

Alternative Scheduling Algorithms

The standard OpenMP loop iteration scheduling options are effective for dealing many real world problems. Reverse guided scheduling (mentioned above) is a simple variation on a standard algorithm that can be useful when the loading associated with a problem is known in advance. The drive for optimum performance over a wide range of disparate problems has lead to the formulation of other loop scheduling algorithms, each of which has advantages and disadvantages for particular problems.

Trapezoidal Scheduling Trapezoidal self scheduling (TSS) [9] is a variation on guided scheduling. As in guided scheduling there is one global queue for all threads. The main difference is the variation in chunk size down the queue. In TSS the chunk size decreases linearly between an upper and lower limit. This leads to a queue of chunks in which each chunk is smaller than the previous by a constant amount, and the length of the queue is usually shorter than the guided equivalent, with fewer very small chunks. This serves to reduce the synchronisation cost at the expense of a possible loss of load balance.

Affinity Scheduling In affinity scheduling each thread involved in the parallel loop is given a contiguous chunk of iterations at the start of the main loop [5]. Each thread receives approximately N/p , where N is the size of the problem (primary array dimension) and p is the number of threads. The local allocations is broken down into a local queue of smaller chunks for processing. To carry out the computation, each thread

H. J. Price

8

works successively through ri /2p sized chunks of the iterations from its local queue until the queue is exhausted, where ri is the number of iteration remaining in the local queue. In this way affinity scheduling is analogous to guided self-scheduling running in parallel on each thread in the team. However, in affinity scheduling when a thread completes its own queue of chunks it can take a rj /2p chunk of the iterations from an unfinished thread’s queue to work on, where rj is the number of iterations remaining on the remote queue selected. The re-allocated chunk is deliberately taken from the thread with the most iterations remaining. Essentially during affinity scheduling each thread continues working on its own chunk or “stealing” chunks from other threads until each iteration of the loop has been executed. By allocating the same regions of iteration space to the same threads on at the start of each iteration, some data locality should result (to a greater or lesser extent). This can offer affinity scheduling a performance advantage over the simpler dynamic schedulers (i.e. dynamic and guided) in some circumstances.

Dynamic Affinity Scheduling Dynamic affinity scheduling extends affinity scheduling [6]. The process relies on the entire work-loop being executed many times in succession. This is a common occurrence in many iterative simulations, for example, a CFD simulation is made up of repeated iterations of the solver and perhaps physical time-steps. A system is initialised with a particular partitioning of the solution domain onto processors. If the load balance is sub-optimal then some re-distribution of the workload based on statistical data from previous iterations is carried out to attempt to improve performance through increased data locality. Dynamic affinity scheduling records the total number of iterations executed by each thread at each complete iteration. The number processed by each thread is used to best-guess the optimum starting size for its local queue on the next iteration. The aim of this method is to dynamically load balance the computation at each iteration and ultimately maximise data locality between successive iterations through effective repartitioning. The equilibrium decomposition should be reached reasonably quickly for a static load, i.e. an uneven load that does not change from iteration to iteration. Optimum queue sizes should result in a small fraction of the total run-time for any performance gain (relative to affinity) to be noticed. If the load is transient (i.e. time/iteration dependent) then the dynamic queue allocations will continue to gradually re-adjust over the course of the entire simulation, always tending toward an evolving local minimum.

Feedback Guided Loop Scheduling Abbreviated to FGLS, feedback guided loop scheduling [10] is similar to dynamic affinity scheduling in that it relies on data from successive iterations to continually improve the decomposition. Both techniques are therefore feedback guided processes, however in FGLS each processor is given a single, contiguous indivisible chunk of iteration space to process each iteration. This serves to reduce run-time overheads and use more time for useful processing. Whereas in dynamic affinity scheduling the fraction of the computational domain covered by each thread

H. J. Price

9

guides the weighted decomposition at each iteration, in FGLS the combined processing time of all of the chunks by each thread is the key variable. The algorithm assumes a constant workload over all of the region of space described by the chunk that a particular thread has processed, and a piecewise constant approximation to the true workload is constructed. The approximated workload is divided between the threads to give the revised domain decomposition of the domain at each iteration. FGLS does not offer a mechanism to balance the workload during iterations; good performance is only achieved if the pre-iteration decomposition balances the load well. A two dimensional FGLS scheduling algorithm has also been developed [10] in which a 2D map of the workload is created and recursive bisectioning is used to decompose the workload and the domain prior to processing at each iteration.

Chapter 3

Underlying Implementation of 2D Scheduling Algorithms The hypothesis of this thesis is that two-dimensional nested loop scheduling algorithms can offer significant performance advantages over traditional one-dimensional methods in certain scenarios. In order to test this it is necessary to formulate and construct two-dimensional decomposition and scheduling algorithms and compare the parallel performance of these against the established and existing schedule types. As a simple test case a simple image filter is applied in a 2D square domain of side N . The domain is discretised to form a regular Cartesian grid of points in the x-y plane. The filter consists of a five-point stencil that performs a sweep over the interior points. At each iteration the value of each interior point is replaced by a value dependent on those of its four nearest neighbours: 1 t t t t ut+1 i,j = (ui−1,j + ui+1,j + ui,j−1 + ui,j+1 ) 4

(3.1)

The boundary values are fixed and are never updated over the course of the run. At each iteration the interior values better approximate the final solution. The system eventually reaches a state that does not change from iteration to iteration. At this stage it is said to have converged. An example of the algorithm at work is shown in Figure 3.1. As updates are not performed in place, two arrays of size N × N are required to store the data. Updates to points in the domain are completely independent of one another and take place in alternate arrays at alternate iterations. This means that threads can simultaneously process separate points in the domain in any order and correct results are ensured. Ai,j = 14 (Bi−1,j + Bi+1,j + Bi,j−1 + Bi,j+1 ) on odd iterations

(3.2)

Bi,j = 41 (Ai−1,j + Ai+1,j + Ai,j−1 + Ai,j+1 ) on even iterations

(3.3)

Rather than having two separate arrays in the code a single array of extent 2 × N × N is used. On 10

H. J. Price

11

Figure 3.1: The time-line above shows the the filter being used in a square domain with fixed boundary conditions. The lower and left boundaries are fixed to a low value (black), while the upper and right boundaries are fixed at a high value (white). The interior points in domain are initialised to a low value (far left). Over the course of the run the interior points approach the solution until convergence is reached (far right). alternate iterations the data is updated on alternate sides of the array from data in the other half. This technique is shown in Section 3.4.

3.1

Legacy Code

An archive of C code that was used by Olkhovskyy to produce a report on nested loop scheduling [2] was made available at the start of the project. Unfortunately this code proved to be difficult to comprehend due to poor structure and naming convention throughout. The code was developed in a short space of time for execution on an EPCC machine that has since been decommissioned. Compilation from Makefile with absent dependencies proved difficult initially and the visualisation subroutines that serve to illustrate the correctness of the decomposition algorithms are non-functional as they are reliant on deprecated library calls. Further, the code is sparsely commented, has little accompanying documentation is not described in any detail in the final report. Two elements of the legacy code that have been re-cycled for use in the revised code that accompanies this report are the loading functions (see Section 3.3), which originate from work on 2D FGLS scheduling [10], and the basic “patch” data structure (see Section 3.2). The revised suite of code has been created in order to ensure code correctness, provide the necessary output and allow the easy implementation of new decomposition and schedule algorithms. The remainder of this chapter focuses on the most important aspects of the code: the data structures, important model parameters, verification tests and performance considerations. The practical usage of the code is briefly outlined in Appendix A and in the code readme document.

3.2

Primary Data Structures

The scheduling and decomposition algorithms have been implemented in the C programming language, which is not designed as an object oriented (OO) language but still facilitates the use of objects to represent the key identifiable elements in a program. Data structures have been created to ease the creation of the geometric decomposition and scheduling algorithms. They are fundamental to the decomposition and scheduling algorithms and are outlined in this section.

H. J. Price

3.2.1

12

The Patch

A patch is the smallest unit of iteration space that is processed as a whole by a single thread during a particular scheduling algorithm. Like its parents - the domain and the partition, covered later - a patch is a contiguous two dimensional region of iteration space and has fields to describe its size and position. The patch contains a further data structure to record the time taken to process the collection of points in the domain that the patch represents (see Section 3.4). This is useful information for feedback guided scheduling algorithms (2D dynamic affinity and 2D FGLS) and for post-processing of results, e.g. to measure the load balance in the system. Each patch exists under a partition and forms part of its collective linked list. Therefore each patch contains a pointer to the next patch in the list. During affinity scheduling (covered later in Section 4.2.3) the list of patches in each partition can be altered by other threads as individual patches are stolen from other threads. In order to preserve the original structure in this circumstance (and avoid the need for expensive repartitioning at each iteration) the patch object contains a back-up pointer to the next patch in the queue to allow the original decomposition to be restored easily.

3.2.2

The Partition

A partition serves one of two purposes: it can simply be a container for a single patch that is of the same size and location, as in 2D static scheduling; or a partition can be attributed to a specific thread and decomposed into many patches each of which is initially assigned to that thread for processing, as in 2D affinity scheduling. Like the patch, the partition is a geometric abstraction of a contiguous region of iteration space, and so has fields describing its size and position within the domain. The partitions in a domain are stored as a linked list structure, therefore each partition contains a pointer to the next in the list, or a null pointer if it resides at the end of the list. Each partition contains a pointer to the head of a list of patches. As the region of space that the partition represents may be subdivided into many patches, the partition also contains a field that defines how many times the partition should be sub-divided. This field is used during recursive coordinate bisectioning of the partition into patches for 2D affinity and dynamic affinity scheduling (see Section4.1.3). For the purposed of two-dimensional feedback guided loop scheduling (see Section 4.2.5) there is a field for storing the average processing time per point for the partition. A partition should be instantiated with the partition_alloc subroutine that takes the range of the partition as parameters. As the partition will contain one or more patch objects, the memory associated with partition should be freed using the partition_free destructor. The partition destructor is used throughout the code in order to avoid memory leaks. These can be particularly important during 2D dynamic affinity and 2D FGLS scheduling in which re-partitioning of the domain occurs at each iteration. The amount of available memory can quickly vanish with successive iterations if all of the memory underneath the object that is to be freed is not re-claimed by the system. This can lead to bugs that will cause segmentation faults during execution at large numbers of iteration that would usually go un-noticed during day-to-day testing. A helper utility to free all the partitions in the domain exists to ease this process.

H. J. Price

3.2.3

13

The Domain

The highest-level geometric data structure in the model is the domain. This two-dimensional geometric abstraction of iteration space has fields that define its range in the each dimension, the corresponding width and height and the number of points in the entire domain. A domain contains 1 or more partitions that are stored in a linked list. As a domain normally contains partition objects, each of which contains a patch objects, a destructor domain_free is provided to free up any memory when the domain is no longer of use.

3.3

Loads

In a actual scientific computation the load at each point in a domain may not be known in advance of running a simulation. For example a cosmological simulation of galaxy formation may require more computation in regions in which the particle density is highest. In such a case the choice of decomposition is critical to the performance of the code. The ideal decomposition should be able to distribute the load so that is is well balanced across all processors, whilst also minimising the amount of interprocessor data communication required. For the purposes of testing the two-dimensional scheduling algorithms covered in this document some simple synthetic two-dimensional loads have been identified. The most straight-forward two dimensional load is a constant load, in which each point in the x-y plane of the domain requires the same amount of computational time to process. This may be true of an image reconstruction algorithm for example. A constant load is useful for benchmarking scheduling algorithms as it is totally homogeneous, and can be easily load balanced by standard one-dimensional scheduling algorithms described in Section 2.2. This allows the relative performance and cost of the new approaches to be assessed. One dimensional decompositions will be unable to decompose a highly two-dimensional load without resorting to very thin chunks. A good example of this type of load is a 2D Gaussian distribution, which is illustrated in Figure 3.2 and described by Equation 3.4. f (x, y) ∝ e−(

x2 +y 2 ) s2

(3.4)

in which s is the standard deviation of the Gaussian distribution. It is not possible to achieve a good load balance between a large number of processors using a one-dimensional decomposition without using very thin chunks. This type of load is very inhomogeneous and a 2D decomposition allows the number of boundary elements per patch to be minimised whilst maintaining a good load balance.

3.4

Timing

Timing in the code is achieved via calls to the OpenMP omp_get_wtime function. The implementation of the code suite allows the code to either execute a prescribed number of iterations, or keep iterating for a specified amount of time. Each individual benchmark test has been run so that the run time - the

H. J. Price

14

Figure 3.2: A two dimensional Gaussian load. The load is highly biased towards the centre of the domain. (Image reproduced from [2]) time spent doing useful work after initialisation and initial decomposition - is at least 10 seconds. It was found that this gives consistent results on repeated experiments. There are two distinct sections to the function that processes patches. These are shown in Listing 3.1 in which a patch with range (x1 : x2 , y1 : y2 ) is processed. The odd variable oscillates between 0 and 1 on successive iterations to update alternate sides of the array. Listing 3.1: Timed work routine odd = i t e r a t i o n % 2 ; / / ∗∗∗ Measure t h e COMMUNICATION t i m e ∗∗∗ s t a r t T i m e = omp_get_wtime ( ) ; f o r ( i = x1 ; i p so N dominates and the algorithm is

H. J. Price

32

Figure 4.10: Example of 2D dynamic affinity scheduling on 16 processors after 1 iteration (left) and 30 iterations (right) Each coloured block in the lower diagrams represents a patch processed by a particular thread ID. A Gaussian load is in effect biases the load heavily towards the centre of the domain. At each iteration the domain is decomposed using an RCB decomposition to form 16 partitions with weights determined from the number of points covered by each thread on the previous iteration. The initial equi-partitioning (left) allocates the majority of the work to the four partitions in the center of the domain. The surrounding, relatively lightly loaded threads finish their own queues much faster than the interior threads, and steal chunks from the centre to dynamically balance the load (centre of lower-left figure). After 30 iterations the dynamic algorithm has re-partitioned the domain so that each processor receives an (almost) equal amount of work in its local queue and very little stealing occurs.

H. J. Price

33

effectively O(N ). This can be an extremely expensive operation to perform at each iteration – especially if the actual computational algorithm per point is quite cheap.

Implementation An 2D FGLS decomposition algorithm has been developed to address the issue of using an expensive re-partitioning at each iteration. The asymptotic cost of the new method is far lower than that of the previous implementation, so the process has lower re-partitioning overheads at each iteration. Instead of sweeping over the region to be bisected line-by-line to determine the cut position, a one-dimensional piecewise approximation to the true workload is constructed using geometrical considerations to find the bisection point. The algorithm is explained by example in this section. The large rectangular box at the top of Figure 4.11 represents the domain to be bisected. The smaller rectangles that make up the domain are the patches into which the domain was decomposed for processing on the last iteration. In this example there are six threads involved in the calculation and as this is FGLS each patch processes one patch and one patch only. On the last iteration, each thread recorded the time taken to process its patch. Dividing the patch processing time by the area of the patch gives the mean time per unit area for each patch. These values are represented by the pink numbers. The map of data that results is a two-dimensional approximation of the true workload in the domain. Feedback guided loop scheduling aims to give each thread an even share of the workload by dividing the approximation to the workload into equal sections. The total workload is represented by the combined time taken to process each patch:

Total work =

t X

Ti

(4.2)

i=1

where Ti is the time taken to process patch i on the last iteration. If we wish to split the domain between the six threads involved in the calculation, then we need to recursively bisect the domain shown at the top of Figure 4.11 such that each of the six new patches has an equal share of the total workload. The first stage in a RCB domain decomposition into six patches of equal load is to bisect the domain once so that 3/6 of the work lies on each side of the domain. This is not the same as bisecting so that 3/6 of the area is on each side. In this example we will use the x-coordinate to perform the first bisection. The aim is to determine the point on the x-axis at which to split the domain such that half of the work is on each side. In order to do this a one-dimensional approximation to the workload distribution is constructed in the x-direction. This is represented by the plot at the bottom of Figure 4.11. The value at each point represents the time required to process a strip of points at that point and the area under the curve represents the total work. The point on the x-axis that bisects the area under the curve is the bisection point for the RCB decomposition. This will result in an equal share of the work on each side of the split. The 1D workload approximation can be constructed very quickly by consideration of the position of the patches in the domain. These form the four sections that make up the 1D workload approximation in our example. If each patch has a range in the x-direction x1 : x2 , then in general, the number of sections in the approximation is equal to the number of distinct values of x1 . This can easily be seen from inspection of the original decomposition.

H. J. Price

34

Figure 4.11: Example 2D FGLSdecomposition (top) and 1D piecewise workload approximation(bottom). The pink numbers represent the number of time units per point in a particular patch. To split evenly in the x direction the area under the piecewise workload approximation must be bisected.

H. J. Price

35

The time per point for each section in the workload approximation can be easily calculated by adding up the contributions of each patch that has an overlap with the section in the x-direction. For example, the workload per point in section b is calculated by contributions from two patches: the upper-left patch that has a time per area of 1 overlaps the range of section b in the x-direction so it contributes a fraction; and the lower patch that has a time per area of 2 fully contributes. For arithmetic simplicity in the example, we have taken the side length of this square lower patch to be 1. In reality we would be working in units of points. The total workload in section b is the sum of this work in the contributions from both patches. The work contributed by each is the product of the time-per-point multiplied by the area that contributes. So the lower patch contributes 2 ∗ 1, and the upper patch overlaps by an area 1 so contributes 1 ∗ 1. This gives a total work in this section of 3. The workload per point in the x-direction is the total work divided by the width of the section, which is the same as the side length of the lower patch: 1. This gives the workload per point for section b to be 3/1 = 1. This process can be performed extremely quickly and is repeated for each section to build up the profile over the whole range of x. With the constant workload approximation in place all that is required is to bisect the area under the curve to give the bisection location that splits the workload evenly. In this example we can observe visually from the workload approximation th that if we add up from left to right, half of the work will comprise all of section a, all of section b and a fraction of section c. Algorithmically this is done with a simple loop over the sections from left to right, keeping a running total of the cumulative area. When the running total, which represents the processing time, is greater than the desired amount for the lower half of the bisection (half of the total work in this example), then the last section to be added represents a section that only contributes a fraction of its width. The distance into the last section that is required to make up the remainder of the work can be calculated easily simply as a ratio of the remaining work that is required after the full-contributing sections, to the total work represented by this last part-contributing patch. When this distance has been found the bisection location in x is known and the domain can be bisected. This process sounds long winded but is performed extremely efficiently as it involves simple arithmetic that scales as the number of threads. In this example the whole domain was bisected, but this process can apply to any sub-section (patch or partition) that needs to be split. When a sub-section is to be bisected, care must be taken to include only patches that overlap this sub-section in the workload approximation. In general there will be ln p bisections so the process scales as O(p ln p).

Further Optimisation of 2D FGLS Feedback guided loop scheduling is designed to be a fast and efficient scheduling algorithm. Each thread receives one contiguous chunk of iterations to process, which minimises inter-thread communication costs as no inter-thread synchronisation is required during processing. The size of each thread’s allocation is chosen to give a good load balance, so providing that the decomposition is sound each thread should process its allocation in approximately the same amount of time. The decomposition algorithm in the current implementation reported in this document has been completely re-written to reduce its asymptotic runtime considerably. However, as FGLS is such a streamlined process the inherently serial decomposition overhead can quickly start to curb the performance of a code if the iteration time is not many times greater.

H. J. Price

36

Figure 4.12: Example 2D FGLS decomposition of a Gauss load on 16 processors. Note the similarity between this and the 2D dynamic affinity shown in Figure 4.10 The load imbalance after any iteration can be estimated by calculating the standard error in the processing time for each thread in the team 3 . The load imbalance over the first 50 iterations of a run for 16 threads using 2D FGLS to work on a Gaussian load is shown in Figure 4.13. Initially the load imbalance is high as the domain is equi-partitioned. The FGLS algorithm is highly efficient and quickly finds a near-optimal decomposition. This is shown in Figure 4.12. Slight fluctuations in thread time due to small variations in thread times, system overheads, and the discrete nature of the problem result in small oscillations at around a 2.5% base level in this example. It is proposed that for a particular load, there is no benefit to be gained from repartitioning the domain if the overall load imbalance is low. To this end a repartitioning threshold level has been introduced to the algorithm in its present implementation. The domain will only be re-partitioned if the load imbalance is above this level. A threshold level of 0% is equivalent to the standard FGLS algorithm and will repartition at every iteration. If the threshold level is too high then a large load imbalance will be tolerated and the performance and scalability of the code will suffer as a consequence. It should be possible in practice to find a sensible threshold level to reduce the number of repartitioning events that occur to a minimal level whilst also maintaining an acceptable overall load balance. The introduction of selective repartitioning should also be applicable to transient loads.

3

The standard error in this instance is defined as the standard deviation in a sample of size n divided by the square root of n. It provides a convenient mechanism for estimating the statistical percentage error in a set of measurements and is commonly used by experimental scientists.

H. J. Price

37

15 14 13 12

Standard error in thread times, %

11 10 9 8 7 6 5 4 3 2 1 0 0

5

10

15

20

25

30

35

40

45

50

Iteration number

Figure 4.13: Load imbalance per iteration for 2D FGLS, expressed as the standard error in the combined computation and communication time for each thread. In this example 16 threads are working on a Gauss load.

Chapter 5

Results and Analysis Each of the scheduling and decomposition algorithms developed in this report have been benchmarked to assess their parallel scalability in a variety of scenarios. There are many parameters that can be set in the model to adjust the decomposition, load, and schedule type. Time-constraints and contention for HPC resources, have imposed an upper limit on the number of runs that could be carried out to test the array of schedule types. Two load types have been tested: constant load and Gaussian load. A constant load does not pose a problem to load balance, but can give a good indication of the serial and communication overheads associated with a particular schedule. A Gaussian load poses a load balancing problem. The performance of each of the schedule types on these loads is the primary focus of this chapter. Due to stability issues previously discussed, the decomposition types that have been studied are block, RCB and isotropic guided. Chunk size or patch side (which may be abbreviated to n for brevity throughout the analysis) can have different meanings for different schedule and decomposition types. For example in static scheduling chunk size is the width of a chunk of iterations, whereas in guided scheduling chunk size is the minimum chunk size permissible. Data has been collected across a sensible range of n for each schedule. The only schedule that does not have a tunable chunk size parameter is FGLS, as each thread by definition has only one chunk for this schedule type. For 2D FGLS the repartitioning threshold level is the parameter of interest.

5.1

Benchmark Tests

Section 3.4 discussed the three “regimes” of interest when testing the performance of the code. When computation dominates, the communication time is of little importance and therefore good scaling will be observed providing a good load balance can be achieved. When the computation and communication time are of the same order of magnitude, both a good load balance and a low number boundary elements is required for good performance. When there is minimal computation (effectively communication only), the communication speed of the hardware dominates the performance and a lower boundary size will in general give a better scaling. Ideally, one would like to choose the problem size that will allow a simulation to scale well up to the 38

H. J. Price

39

number of processors available while still executing in a reasonable amount of time. In such a situation the user may aim for upwards of 90% hardware (processor) utilisation. This situation describes the “balance” regime that has been studied. Benchmarking in this regime results in a range of computation to communication ratios of approximately 1-20, which is equivalent to parallel efficiencies in the range 50-95%. Increasing the problem size such that the communications time is a very small fraction of the iteration time will utilise the processors fully for almost all of the time. In the benchmarking, this regime gives computation to communication ratios of the order of hundreds. The scaling of problems executed in this regime is only dependent on a good load balance. One could argue that in this circumstance more processors should be used to complete the calculation faster. Each regime is represented by a particular communication to computation imbalance. Three distinct computation weighting coefficients have been chosen for the benchmarking, each of which represents one of the three regimes of interest. This gives three distinct, consistent and identifiable runs that can be used to test the performance of each schedule-load combination. The imbalance coefficient is chosen to give a range of computation to communication ratio in the desired regime: • High computation: weight coefficient = 5000 ⇒ ratio approximately 200 • Balance of computation and communication: weight coefficient = 200000 ⇒ ratio = 1-20 • Communication only:no computation ⇒ ratio = 0

5.2

Hardware Considerations

The performance of the code in each regime has been tested on up to 16 processors of a SMP machine at the EPCC. This is a 48 processor batch-queue section of a Sun Fire E15k server. 1 . The cluster consists of 52 900 MHz UltraSparc III processors in a single cabinet. Each processor has 1 Gb of memory associated with it. The level 1 cache on the UltraSparc-III is 64kbyte, 4-way set-associative with 32 byte lines. The level 2 cache on the UltraSparc-III is 8Mbyte, direct-mapped with 64 byte lines. [7] In order to prevent super-linear effects from appearing when assessing the scalability of the code it is important that the entire data array fits into the cache of a single processor. The data array is a doubled square array of double precision floating point elements (see Chapter 3). The data array is padded to prevent the aspect ratio of the decomposition from affecting the results (see which increases the size of the array Section 3.4.3). The dimension of the array should also be chosen to that it is easily divisible to allow a good range of chunk sizes to be tested for each decomposition and schedule type. With all of these factors borne in mind a square domain of side 192 points was chosen for the benchmarking experiments. Executables have been compiled from source with the Forte Developer 7 C Compiler with the -fast flag to ensure good serial optimisation on the hardware. 1

“Lomond” is actually a cc-NUMA machine, although inter-processor communication cost across the separate boards in the machine are comparable to the cost of processors on the same board. Therefore the machine can be thought of as a SMP.

H. J. Price

5.3

40

Presentation of Results

In this chapter performance data is mainly presented graphically to convey as much information as possible in a small amount of space. On the charts, colour represents number of threads and the three regimes of interest are represented by the different styles of line. Often data for low numbers of threads (i.e. two and four) can get a little convoluted, but the main focus is scaling to larger numbers of threads (i.e. eight and 16) for which the performance is almost always clearly visible, as there is a separate optimum speed-up level for each number of threads.

5.4

Constant Load

The “constant” load type requires the same amount of work to process every point in the domain. A constant load does not pose a problem to load balance, and can give a good indication of the serial and communication overheads associated with a particular schedule. The standard OpenMP schedule clauses (Section 2.2.1) should be able to balance this type of load easily provided that sensible chunk sizes are chosen. However, one might expect that these 1D schedules to under-perform if the chunks size is small (or the number of processors is large) as the aspect ratio of the 2D chunks (patches) will be excessive. This will increase the amount of communication necessary at each iteration. A two dimensional decomposition of this domain should result in a lower patch boundary to size ratio and should therefore reduce the communication overhead at each iteration. If the communication time is comparable to the computation time then one might expect the 2D decomposition and scheduling algorithms to outperform the standard 1 D algorithms.

5.4.1

1D Scheduling Algorithms

Standard OpenMP scheduling algorithms will only distribute the iterations from the outer-most loop of the nested loop. Static and dynamic scheduling have been tested.

Static Scheduling Figure 5.1 shows the parallel performance of standard OpenMP static scheduling on a constant load. It can be seen that when computation dominates, static scheduling is capable of achieving near peak performance for certain chunk-sizes. There are maxima when the domain side length of 192 is divided exactly between the threads, e.g. at n = 96 for two threads and n = 12 for 16 threads. In such cases the load is perfectly balanced between the active threads. In this regime good performance is achieved for any chunksize that divides the domain into a number of chunks that is exactly divisible by the number of threads, e.g. for eight threads a chunk size of six, eight or twelve gives a speed-up of eight. However if n = 16 then there will be twelve chunks in total, eight of which will be processed constant time in parallel at 100% hardware utilisation, and then the remaining four will be processed by four processors in parallel in the same amount of time while the

H. J. Price

41 Constant Load - OpenMP Static Scheduling

16

14

12

t=2, high comp t=2, balance t=2, comms only

Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance t=16,comms only

4

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Chunk size

Figure 5.1: Parallel speed-up of code using static scheduling on a constant load in a square domain of side length 192 (double range shown). Scaling is best when the number of chunks divides exactly by the number of threads. other four processors sit idle, thus utilising only 50% of the hardware. This equates to an efficiency of 75% which corresponds to the speed-up of six that is observed. When the communication time is of the same order of magnitude as the computation time, the code does not scale very well to high numbers of threads. The general form of the curves are very similar to those in which computation dominates, due to the identical decomposition, but the maximum speed-up observed on eight threads is around 6.5, and on 16 threads the maximum speed-up observed is only eight, which equates to 50% efficiency. The 1D decomposition that static scheduling applies to the domain creates a lot of boundary elements per chunk, so communication costs are high and scaling is poor. A balance of computation and communication gives better performance when there are two chunks per thread rather than only one chunk per thread. This is observed for 16 threads (pink dashes) between n = 12 (1 chunk/thread) and n = 6 (two chunks/thread), and also for 8 threads (blue dashes) between n = 24 and n = 12. This feature is possibly down to the implementation of static scheduling in the OpenMP library available on Lomond. When the communication is dominant factor in the performance of the code, it does not scale at all well. As there is so little computation required at every point in a chunk, the relative communication overhead rises sharply with increasing thread count (and thus boundary size). A maximum speed-up of two is observed for thread counts up to 16.

H. J. Price

42 Constant Load - OpenMP Dynamic Scheduling

16

14

12

t=2, high comp t=2, balance t=2, comms only

Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance t=16,comms only

4

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Chunk size

Figure 5.2: Parallel speed-up of code using dynamic scheduling on a constant load in a square domain of side length 192. Dynamic Scheduling Figure 5.2 shows the parallel performance of standard OpenMP dynamic scheduling on a constant load. When computation dominates the performance is very similar to that of static scheduling, previously discussed, as a perfect load balance can be easily achieved for certain thread number / chunk-size combinations. When communication becomes a factor in the “balance” regime, dynamic scheduling scales much worse than static scheduling. To attain the desired computation to communication ratio to represent this regime, the computational cost of the workload is reduced while the communication cost remains unaltered. As the relative communication cost is much higher, the iteration time becomes more sensitive to data locality. In static scheduling data locality is ensured from iteration to iteration because each thread processes the same chunks each time. In dynamic scheduling chunks are processed by whichever thread is available at the time, so there is increasing scope for the migration of data between the caches of different processors with increasing processor count. The overall cost per iteration is relatively low in the balance regime, and can be increased by a noticeable amount due to frequent cache misses. This is reflected in the poor scalability in this particular regime. This phenomenon is further amplified when the computational part of the workload is turned off, i.e. “communication only”.

5.4.2

2D Static Scheduling

Figure 5.3 shows the parallel performance of 2D static scheduling on a constant load. A block decomposition into patches of a pre-determined side length is used. The performance characteristics can be contrasted with those of 1D static scheduling shown in Figure 5.1.

H. J. Price

43

Constant Load - 2D Static Scheduling 16

14

12

t=2, high comp t=2, balance t=2, comms only

Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance t=16,comms only

4

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Patch side

Figure 5.3: Parallel speed-up of code using 2D static scheduling on a constant load in a square domain of side length 192 (2x range shown). The domain is decomposed into square patches of a prescribed side length.

patch side

no. patches

patches / side

2

1 2 4 8 12 16 24 32 48 64 80 96 112 128

36864 9216 2304 576 256 144 64 36 16 9 6 4 3 2

192.0 96.0 48.0 24.0 16.0 12.0 8.0 6.0 4.0 3.0 2.4 2.0 1.7 1.5

18432.0 4608.0 1152.0 288.0 128.0 72.0 32.0 18.0 8.0 4.5 2.9 2.0 1.5 1.1

number of threads 4 8 9216.0 2304.0 576.0 144.0 64.0 36.0 16.0 9.0 4.0 2.3 1.4 1.0 0.7 0.6

4608.0 1152.0 288.0 72.0 32.0 18.0 8.0 4.5 2.0 1.1 0.7 0.5 0.4 0.3

16 2304.0 576.0 144.0 36.0 16.0 9.0 4.0 2.3 1.0 0.6 0.4 0.3 0.2 0.1

Figure 5.4: 2D block decomposition details in a square domain of side length 192. The main block of data shows the number of patches per thread. The domain is decomposed into square patches of a given side length.

H. J. Price

44

When computation dominates the performance of 2D static scheduling is very similar to that of static scheduling because a perfect load balance can be easily achieved for certain thread number / chunksize combinations. However, with a 2D decomposition good performance can be achieved over a much broader range of patch size. The code generally performs best when the number of patches divides exactly by the number of threads. For example, for 16 threads, 16 48x48 patches can fit exactly into the 192x192 domain – this is also true for patch side 24 and lower. Data regarding the decomposition for is given in Figure 5.4 and is useful for explaining the behaviour observed. When there is a balance of computation and communication 2D static scheduling outperforms its 1D equivalent dramatically, and over a much broader range of patch sizes. For very small patches the large amount of boundary data can raise the communication cost per iteration and the communication to computation ratio drops into the “balance” regime. This accounts for the fall off at low chunk sizes. n.b. This is not observed for 1D static as there is less boundary data for the lower values on the relevant chart: it is a 1D decomposition in which the parameter is chunk width rather than square patch side, e.g. for chunk-width 2 there are 192/2 = 96 chunks which gives a total boundary size of approximately 96 ∗ 192 = 18432, for patch side 2 there are 1922 /22 = 9216 patches with a boundary area of 91216 ∗ 8 = 73728. 2D static scheduling scales much better than static scheduling when communication and computation are in equal proportion due to decreased boundary element numbers. At very high patch counts (low patch side) however, the performance drops off. Scaling for the communications only case is slightly better than that of the 1D equivalent, again due to decreased boundary element fraction.

5.4.3

2D Dynamic Scheduling

Figure 5.5 shows the parallel performance of 2D dynamic scheduling on a constant load. As in 2D static scheduling, a block decomposition into patches of a pre-determined side length is used. The chunk data regarding the decomposition for each run is shown in Figure 5.4. The performance of 2D dynamic scheduling is far better than that of dynamic scheduling. A broader range of patch sizes gives good performance, and when communication becomes a factor the efficiency remains high. 2D dynamic scheduling is much less tolerant of small patch sizes than 2D static scheduling when communication is a factor. This is due to on-the-fly chunk re-allocation decreasing data locality between successive iterations.

5.4.4

2D Guided Scheduling

There are two decomposition types that have been tested with 2D guided scheduling: 2D isotropic guided and guided RCB. Using a block decomposition with the guided scheduling algorithm is equivalent to 2D dynamic scheduling. Figure 5.6 and Figure 5.7 show the parallel performance of 2D guided scheduling on a constant load for isotropic and RCB guided decomposition respectively. The performance of 2D guided scheduling is much the same as that of 2D dynamic scheduling for the constant load, which is no surprise as they both use the same scheduling algorithm. The only difference of note is that guided scheduling can be more tolerant to smaller patch sizes. This is because the patch size in 2D dynamic scheduling refers to the dimension of every patch, whereas in 2D guided scheduling

H. J. Price

45 Constant Load - 2D Dynamic Scheduling

16

14

12

t=2, high comp t=2, balance t=2, comms only

Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance t=16,comms only

4

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Patch size

Figure 5.5: Parallel speed-up of code using 2D dynamic scheduling on a constant load in a square domain of side length 192 (2x range shown). The domain is decomposed into square patches of a prescribed side length. it refers to the minimum patch side. A small number of small patches can be beneficial in smoothing out any un-evenness in the load balance that may occur due to small run-time fluctuations in thread performance. Comparing Figure 5.6 with Figure 5.7 it can be seen that the RCB decomposition out-performs the 2D isotropic guided decomposition at very small chunk sizes. This is because the isotropic decomposition creates more chunks than the RCB decomposition – the square of the amount – and so there are more communications involved due to the increased number of boundary elements. Further, the isotropic decomposition leads to two corners of the domain that can have very elongated patches (as illustrated previously in Figure 4.2) which adds more boundary elements to the system.

5.4.5

2D Affinity Scheduling

The performance of 2D Affinity scheduling looks markedly different to that of the static, dynamic and guided schedules. This is shown in Figure 5.8 and Figure 5.9. Each thread starts with one contiguous partition of iterations that is decomposed into a local queue. Each thread receives an equally sized partition. For a constant load this implies that each thread will initially receive an equal share of the work and there will be little-to-no need to patch re-allocation to balance the load at runtime. This is definitely implied by the performance profiles. Each thread’s partition is decomposed into a queue of patches decreasing size, and during execution there is an overhead incurred in retrieving and processing each patch. If the patch size tails off to a very small area towards the end of the queue there will be a large number of patches and the overheads may

H. J. Price

46 Constant Load - 2D Guided Scheduling 2D Isotropic Decomposition

16

14

t=2, high comp

12

t=2, balance t=2, comms only Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.6: Parallel speed-up of code using 2D guided scheduling using isotropic guided decomposition with a constant load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches. Constant Load - 2D Guided Scheduling RCB Decomposition 16

14

t=2, high comp

12

t=2, balance t=2, comms only Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.7: Parallel speed-up of code using 2D guided scheduling using 2D RCB guided decomposition with a constant load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches.

H. J. Price

47 Constant Load - 2D Affinity Scheduling Isotropic Decomposition

16

14

t=2, high comp

12

t=2, balance t=2, comms only 10

t=4, high comp

Speed-up

t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.8: Parallel speed-up of code using 2D affinity scheduling using isotropic guided decomposition with a constant load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches. The speed-up of zero observed for several runs is due to run-time problems during benchmarking. Constant Load - 2D Affinity Scheduling RCB Decomposition 16

14

t=2, high comp

12

t=2, balance t=2, comms only Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.9: Parallel speed-up of code using 2D affinity scheduling using 2D RCB guided decomposition with a constant load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches.

H. J. Price

48

become noticeable. Also, this will increase the chance of patch migration between threads which will decrease data locality between successive iterations. This is represented by the poor performance for both decomposition types at low minimum patch size when used with 2D affinity scheduling. At higher minimum patch sizes the overheads are lower to process all of the patches as there are less to retrieve. Each thread starts with an equal share of the workload in its local queue, and there will be less chance of the inevitable small fluctuations in thread times causing patch re-allocation if patches at the end of the queue are larger. This will reduce synchronisation overheads that could reduce performance. It is noted that the graph reaches and maintains a maximum when there is one patch per thread. At this point the algorithm is effectively static 2D scheduling with a RCB decomposition. As the computation to communications ratio drops, each patch takes less time to process, the overheads become more noticeable and a larger minimum patch size is required to achieve good scaling.

Performance Optimisations The benefits of using one lock per thread over a single critical section to protect updates to the local queues has not been observed during benchmarking. This is probably due to the small numbers of threads over which the tests were performed. For a larger thread team there would be more contention for a critical section and the performance may start to suffer.

5.4.6

2D Dynamic Affinity Scheduling

Two-dimensional dynamic affinity has proven to scale worse than 2D affinity scheduling unless the computation time associated with a patch far outweighs the communication time. This is shown in Figure 5.10 and Figure 5.11. This is because dynamic affinity scheduling incurs an overhead due to re-partitioning and subsequent partition decomposition into patches at each iteration. Aside from this drop in performance the general form of the performance with variation in chunk size is very similar to that of affinity scheduling. Partition decomposition into queues terminating in small chunk sizes is detrimental to performance as the costs of creating, processing and re-allocating them combine to slow down the code. The potential benefits of targeting an optimal initial queue size cannot be observed in this case because for a constant load the initial partitioning will be optimal.

5.4.7

2D Feedback Guided Loop Scheduling

Figure 5.12 illustrates the performance of the 2D FGLS on a constant load. The overhead associated with repartitioning is extremely detrimental to performance. This can be seen by observing the performance of the code in the “balance” regime as the threshold is raised from zero. It is clear that repartitioning at less than 1% load imbalance is un-necessary and slows down the code considerably. For 16 processors a load imbalance of 2% seems to be optimal. It should be noted that the FGLS algorithm is extremely efficient and can give an extremely good load balance for an unbalanced load in only a few iterations. The curves in the figure level off quickly as a near-optimal decomposition is attained for a constant load by default, i.e. the domain is initially split into equal size chunks and one is given to each processor. This will mean that no re-partitioning is actually necessary for this test case, but will still occur for low threshold levels due to small fluctuations in the thread times. These fluctuations occur due to slight

H. J. Price

49 Constant Load - 2D Dynamic Affinity Scheduling Isotropic Decomposition

16

14

t=2, high comp

12

t=2, balance t=2, comms only Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.10: Parallel speed-up of code using 2D dynamic guided scheduling using isotropic guided decomposition with a constant load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches. Constant Load - 2D Dynamic Affinity Scheduling RCB Decomposition 16

14

t=2, high comp

12

t=2, balance t=2, comms only Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.11: Parallel speed-up of code using 2D dynamic affinity scheduling using 2D RCB guided decomposition with a constant load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches.

H. J. Price

50 Constant Load - 2D FGLS

16

14

12

t=2, high comp t=2, balance t=2, comms only

Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance t=16,comms only

4

2

0 0

5

10

15

20

25

Repartitioning threshold %

Figure 5.12: Parallel speed-up of code using 2D FGLS scheduling with a constant load in a square domain of side length 192. The only parameter in the FGLS algorithm is the repartitioning threshold which it the criterion for domain repartitioning at each iteration. differences in the state of the system from iteration to iteration and the discrete nature of the problem.

5.5

Gaussian Load

A Gaussian load (previously illustrated in Figure 3.2) was chosen for benchmarking as it is highly inhomogeneous and may prove difficult to load balance between large numbers of processors using 1 D decompositions without resorting to using very thin chunks. A two dimensional decomposition should allow a load balance to be achieved more easily whilst maintaining an acceptable patch aspect ratio. The constant of proportionality for the Gaussian has been chosen so that the total amount of work is equal to the total amount of work associated with the constant load used in the previous section. This helps to ensure a fair test.

5.5.1

1D Scheduling Algorithms

Static Scheduling For the particular problem size being addressed, which is a 192 point square domain, 1D static scheduling does not scale well past 16 processors for the unbalanced Gaussian load in effect. This is shown in Figure 5.13 This is because the load is extremely high in the centre of the domain, but low elsewhere. The 1D cyclic decomposition the static scheduling applies at small n does not distribute the most expensive section of the domain equally between threads. Even if the chunks are very thin, the ones nearest

H. J. Price

51 Gauss Load - OpenMP Static Scheduling

16

14

12

t=2, high comp t=2, balance t=2, comms only

Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance t=16,comms only

4

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Chunk Size

Figure 5.13: Parallel speed-up of code using static scheduling on a Gauss load in a square domain of side length 192 (full range shown). the centre of the domain will have more work than those further out and the distribution onto threads does not balance the load adequately. A larger problem size on the same number of processors would not suffer this effect to the same extent. Similarly, a smoother, more even load would not pose so much of an obstacle to the static load balance. If the communication time is of the same order of magnitude to the computation time, the high number of thin chunks required to attempt to balance the load have a very high fraction of boundary elements. The communications cost will rise sharply as a result, and as the cost of computation in the “balance” and “comms only” runs is low, the parallel efficiency of the code is low.

Dynamic Scheduling Dynamic scheduling is better suited to dealing with an uneven load than static scheduling. The chunks are not pre-allocated onto threads, so the same decomposition can be processed much more flexibly by the thread team. Figure 5.14 shows the performance of dynamic scheduling over a range of chunk sizes. When the computation cost is high relative to the communication cost, a good load balance is all that is required to achieve good scaling. In this circumstance, dynamic scheduling can deal with the Gaussian load providing that the chunks size is small. On 16 processors the maximum chunk size that can be used in the high computation test to give good scaling is four points. Above this width, single chunks contain proportionally too much work to ever allow the load to be dynamically balanced. In the “balance” regime, in which the computational load has dropped to a similar levels to that of the communications cost, dynamic scheduling performs awfully. Although a good load balance can be achieved, the many long, slender chunks formed by a decomposition that achieves this is counter-acted by the large fraction of boundary elements that need to be communicated at each iteration. For this

H. J. Price

52 Gauss Load - OpenMP Dynamic Scheduling

16

14

12

t=2, high comp t=2, balance t=2, comms only

Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance t=16,comms only

4

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Chunk Size

Figure 5.14: Parallel speed-up of code using dynamic scheduling on a Gauss load in a square domain of side length 192. reason the performance in the “communications only” test is extremely poor.

5.5.2

2D Static Scheduling

A two-dimensional decomposition used with the static scheduling algorithm far outperforms the 1D equivalent when dealing with the Gaussian load. As the domain is decomposed into square patches, the 2D load in the domain can be tiled far more evenly by the available processors. If the patch side length is small enough, then the raster-like cyclic allocation of patches onto threads can spread the work-load evenly between the available processors. Figure 5.16 shows how the speed-up of the code varies with the patch size. Performance is poor for a very small patch side, as the number of patches will be very large in this region – the square of the number in a 1D static decomposition of equivalent chunk width. For most small patch sizes, in the high computation regime, it can be seen that a good load balance can be achieved up to 16 processors. The notable exception is at patch-side 12 at which point there are 16 patches per side which is equivalent to a 1D decomposition. The block decomposition that 2D static scheduling uses produces regular square patches, so the fraction of boundary elements per patch is very low. This helps to keep the performance high when the computation to communication ratio drops into the “balance” regime, but only for the specific patch sizes that give a decent patchwork distribution of work onto threads. A 2D static decomposition that gives a good performance is shown in Figure 5.15.

H. J. Price

53

Figure 5.15: Example 2D static scheduling decomposition and patch allocation. Here the domain measures 192 points square and the patch side is 16. This gives 12 patches per side. There are 16 threads involved in this example, each represented by a different hue in the figure on the left. The resultant static allocation onto threads balances the load well.

Gauss Load - 2D Static Scheduling 16

14

12

t=2, high comp t=2, balance t=2, comms only

Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance t=16,comms only

4

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Patch Side

Figure 5.16: Parallel speed-up of code using 2D static scheduling on a Gauss load in a square domain of side length 192 (2x range shown). The domain is decomposed into square patches of a prescribed side length.

H. J. Price

54 Gauss Load - 2D Dynamic Scheduling

16

14

12

t=2, high comp t=2, balance t=2, comms only

Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance t=16,comms only

4

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Patch Side

Figure 5.17: Parallel speed-up of code using 2D dynamic scheduling on a Gauss load in a square domain of side length 192 (2x range shown). The domain is decomposed into square patches of a prescribed side length.

5.5.3

2D Dynamic Scheduling

2D dynamic scheduling offers advantages over both its 1D equivalent and 2D static scheduling. The decomposition into square patches allows the two-dimensional Gaussian load to be decomposed more regularly, and at the same time reduces the fraction of boundary elements per patch. As the scheduling algorithm can allocate patches to available processors on the fly, load balancing is also less of an issue. Inspection of the performance data illustrated in Figure 5.17 reveals that 2D dynamic scheduling can give a good load balance and keep the communications overhead down for a broad range of chunksizes. This is reflected in the high speed-up attained in the “high computation” and “balanced” tests. For 16 threads a patch side of four to 16 gives a good speed up when computation is high (and load balance is all that matters to performance). This is equivalent to decomposing the domain into 2304 to 144 patches (refer to data in Figure 5.4).

5.5.4

2D Guided Scheduling

The performance of 2D guided scheduling with a 2D isotropic guided decomposition for the Gaussian load is almost identical to that of dynamic scheduling. This is shown in Figure 5.18. For all of the reasons discussed above, 2D guided scheduling seems to be a suitable choice of scheduling type for practical utilisation with the type of nested loop structure addressed in this document. For reasons that are as yet unknown, 2D guided scheduling with a guided RCB decomposition consistently give results for the “high computation” regime that are identical to those in the “balance” regime (see Figure 5.19). One would expect the results to be vaguely similar to those of the same run but

H. J. Price

55 Gauss Load - 2D Guided Scheduling Isotropic Decomposition

16

14

t=2, high comp

12

t=2, balance t=2, comms only Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.18: Parallel speed-up of code using 2D guided scheduling using 2D isotropic guided decomposition with a Gauss load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches. with the isotropic decomposition as most of the model parameters are the same and the decompositions are not wildly disparate. The same phenomenon has been observed on repeated experiments that were conducted on separate occasions with different builds and meticulously checked model parameters. It is assumed presently that an un-discovered bug in the source code is responsible, although is is not un-feasible that this is a feature of this particular schedule/decomposition/load/problem size/hardware combination.

5.5.5

2D Affinity Scheduling

Figure 5.20 and Figure 5.21 show the performance curves for 2D affinity scheduling on a Gaussian load for both of the decomposition types as a function of minimum patch side. 2D Affinity scheduling does not scale as well as 2D dynamic scheduling in the high computation regime tests when the 2D isotropic guided scheduling decomposition algorithm is used to form the local queues of patches. When the minimum patch size is low, there are small patches in every local queue. The overheads involved in processing the small patches, outweighs any load balance improvement that the of decomposition offers. The RCB decomposition at low patch sizes gives better performance, as the total number of patches when using this decomposition is the the same as the 1D guided decomposition equivalent. The 2D isotropic guided scheduling produces approximately the square of this amount. The scaling with RCB decomposition can be quite good for reasonably small minimum patch sizes. The observed tail off at low patch sizes is less of a problem for 2D affinity scheduling than it is for 2D dynamic scheduling because the patch size corresponds to a minimum patch side. However, the overheads incurred through synchronisation required to re-allocate and process very small and inexpensive patches towards the end of the local

H. J. Price

56 Gauss Load - 2D Guided Scheduling RCB Decomposition

16

14

t=2, high comp

12

t=2, balance t=2, comms only Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.19: Parallel speed-up of code using 2D guided scheduling using 2D RCB guided decomposition with a Gauss load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches. queues will reduce the performance in this region. Further, dynamic scheduling has a regular decomposition of perfectly square patches in a square domain, so the fraction of boundary elements per patch is as low as possible for the decomposition.

5.5.6

2D Dynamic Affinity Scheduling

Figure 5.22 and Figure 5.23 show the performance of 2D dynamic affinity scheduling when dealing with a Gaussian load. It was found that an under-relaxation factor in the region of 0.5 was required to damp down wild oscillations in partition sizes between iterations. This slows down the equilibration of the partition sizes somewhat. The algorithm is designed to be able to balance the load at execution time. In the 2D implementation however, the overheads associated with decomposition at each iteration are high and even in the “high computation” regime the performance is not as good as might be expected. Small patches are necessary pre-equilibration to balance the load, but as was the case for 2D affinity scheduling, small patches incur a high overhead to retrieve. As the relative cost of the computation comes down, and the cost of computation meets it, the performance of 2D dynamic affinity scheduling is not at all good, regardless of the choice of decomposition. Overall, 2D dynamic affinity scheduling performs poorly throughout due to the high decomposition costs at each iteration.

H. J. Price

57 Gauss Load - 2D Affinity Scheduling Isotropic Decomposition

16

14

t=2, high comp

12

t=2, balance t=2, comms only Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.20: Parallel speed-up of code using 2D affinity scheduling using 2D isotropic guided decomposition with a Gauss load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches. Gauss Load - 2D Affinity Scheduling RCB Decomposition 16

14

t=2, high comp

12

t=2, balance t=2, comms only Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.21: Parallel speed-up of code using 2D affinity scheduling using 2D RCB guided decomposition with a Gauss load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches.

H. J. Price

58 Gauss Load - 2D Dynamic Affinity Scheduling Isotropic Decomposition

16

14

t=2, high comp

12

t=2, balance t=2, comms only Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.22: Parallel speed-up of code using 2D dynamic affinity scheduling using 2D isotropic guided decomposition with a Gauss load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches. Gauss Load - 2D Dynamic Affinity Scheduling RCB Decomposition 16

14

t=2, high comp

12

t=2, balance t=2, comms only Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance 4

t=16,comms only

2

0 0

8

16

24

32

40

48

56

64

72

80

88

96

Minimum Patch Side

Figure 5.23: Parallel speed-up of code using 2D dynamic affinity scheduling using RCB guided decomposition with a Gauss load in a square domain of side length 192 (2x range shown). Note: a minimum patch side is specified - there may be larger patches.

H. J. Price

59 Gauss Load - 2D FGLS

16

14

12

t=2, high comp t=2, balance t=2, comms only

Speed-up

10

t=4, high comp t=4, balance t=4, comms only

8

t=8, high comp t=8,balance t=8, comms only

6

t=16, high comp t=16, balance t=16,comms only

4

2

0 0

5

10

15

20

25

Re-partitioning threshold %

Figure 5.24: Parallel speed-up of code using 2D FGLS scheduling with a Gauss.

5.5.7

2D Feedback Guided Loop Scheduling

There are no decomposition parameters for FGLS. The only run-time parameter is the load imbalance repartitioning threshold, below which re-decomposition at an iteration is not performed. This is designed to reduce overheads, and improve performance. This was shown to be the case for a constant load earlier in this chapter. When the 2D FGLS algorithm has to deal with the uneven Gaussian load, it is actually required to do some useful work to balance the load. The performance is illustrated in Figure 5.24. When the cost of computation vastly outweighs the cost of communication (the “high computation” regime) the 2D FGLS must achieve a good load balance to give good scaling. This is shown to be the case by the high speed up observed at 0% re-partitioning threshold. With a repartitioning threshold of zero – i.e. repartitioning occurs at every iteration – scaling is good. When the threshold is lifted slightly to 1% repartitioning occurs less frequently and the performance improvement is noticeable. This effect is far more pronounced in the “balance” regime. In this regime the cost of the computation has come right down, however the repartitioning cost is fixed for a given number of threads (regardless of problem size) so the performance hit taken for re-partitioning at every iteration is exceptionally high. As the threshold is raised, the algorithms becomes much more efficient. Even though a slightly suboptimal decomposition is used, excellent performance benefits can be realised, in what has effectively become a static scheduling algorithm. Processor affinity is high, inter-thread communication is low, and an acceptable load balance is in effect. The step-up in performance with increasing threshold level occurs at the point at which the random fluctuations in load imbalance between successive iterations is exceeded. In the example shown back in Figure 4.13 in Section 4.2.5 the threshold level at which background “noise” is ignored would be around 4%. In this case the optimum threshold level can be inferred from the performance profile and appears to be at around 2% for the “balance” case.

Chapter 6

Discussion In the high computation regime, in which the amount of computation per chunk vastly outweighs the amount of communication necessary to process a chunk, all of the schedules scale well for a constant load. A good load balance can be achieved in each case, and efficiencies near 100% are observed throughout. This is shown in Figure 6.1. Dynamic scheduling performs the worst due to relatively high communications costs associated with the 1D decomposition and the lack of data affinity between iterations. 2D dynamic affinity scheduling also performs badly, but this is mainly due to the overheads incurred at each iteration due to re-partitioning. All of the other schedule/decomposition combinations are capable of performing exceptionally well for this case. 2D FGLS out-performs all others due to its extremely low overheads. Figure 6.2 shows the optimum performance for each of the schedule types for a Gauss load in the same regime. n.b. The Performance of 2D Guided scheduling with RCB decomposition is thought to be due to a bug and is ignored in the analysis. Static scheduling scales terribly for this load as it cannot decompose the domain sufficiently to balance the load between 16 processors. The performance of each of the others is in the approximate 90-100% efficiency range, indicating that they are capable of attaining a good load balance in this particular case. Interestingly 2D static scheduling scales slightly better than the others. This occurs when the domain is decomposed into a patch-work very small patches (side eight) that, with help from the symmetry of the Gaussian load, balance the load very evenly. This is effectively a cyclic decomposition with diminishingly low communication costs. 2D Dynamic affinity scheduling again performs poorly due to the constant re-partitioning overhead eating into the useful processing time. When the calculation is communication only, with only a single sweep over each point in the domain with a five-point stencil at each iteration, all of the schedule types perform exceptionally poorly. This is illustrated in Figure 6.3 for the constant load Figure 6.4 for the Gaussian load. It is important to note that exactly the same behaviour is observed for each load because in this regime the communications pattern alone determines is responsible for performance and scalability. The decomposition is the same regardless of load, and the scheduling algorithm in each case is only held up by the communications costs, not the computation cost for each patch or chunk of the domain. As there is too little work to keep the processors busy while the inter-thread communications are taking place poor parallel efficiencies result. The only performance of note in this regime comes from the optimised 2D FGLS algorithm. This algorithm allocates only one chunk per thread, has a highly accurate load balancing mechanism, and can choose not to re-partition at all between iterations to reduce the serial fraction of the cycle. These 60

H. J. Price

61

Constant Load - "High Computation" Regime 1.000

0.995

Static Dynamic

Parallel Efficiency

0.990

Static 2D Dynamic 2D Guided 2D iso 0.985

Guided 2D RCB Affinity 2D iso Affinity 2D RCB Dynamic Affinity 2D iso

0.980

Dynamic Affinity 2D RCB FGLS 2D

0.975

0.970 2

4

6

8

10

12

14

16

Number of threads

Figure 6.1: Optimum observed parallel efficiency for each schedule types acting on a constant load in the “high computation” regime. Please note the magnified scale.

Gauss Load - "High Computation" Regime 1.00

0.90

Static Dynamic

Parallel Efficiency

0.80

Static 2D Dynamic 2D Guided 2D iso 0.70

Guided 2D RCB Affinity 2D iso Affinity 2D RCB Dynamic Affinity 2D iso

0.60

Dynamic Affinity 2D RCB FGLS 2D

0.50

0.40 2

4

6

8

10

12

14

16

Number of threads

Figure 6.2: Optimum observed parallel efficiency for each schedule types acting on a Gauss load in the “high computation” regime. Please note the scale. The performance of 2D Guided scheduling with RCB decomposition is thought to be due to a bug.

H. J. Price

62 Constant Load - "Communication Only" Regime

1.00

0.90

0.80 Static

0.70 Parallel Efficiency

Dynamic Static 2D 0.60

Dynamic 2D Guided 2D iso

0.50

Guided 2D RCB Affinity 2D iso Affinity 2D RCB

0.40

Dynamic Affinity 2D iso Dynamic Affinity 2D RCB

0.30

FGLS 2D

0.20

0.10

0.00 2

4

6

8

10

12

14

16

Number of threads

Figure 6.3: Best parallel efficiency of all schedule types with a constant load in the “communication only” regime. qualities can transform 2D FGLS in to a highly optimised static scheduler with very low overheads which gives it a performance advantage over the more complicated schedulers. The behaviour in the high computation and high communications regimes each only elucidate a small amount of information on the behaviour of each of the schedule types. The ability of each to attain a good overall load balance is assessed when the computational load is high, and the baseline communication and algorithmic overheads associated with each is assessed when communications alone occur. The most interesting regime is the one in which the theoretical maximum efficiency is in the region of 90% due to the combination of serial algorithm, model size and hardware. In this circumstance peak performance demands both a good load balance, high data affinity, and minimised communications and serial overheads. This is the balance regime that has been studied. Figure 6.5 shows the best performance of each schedule type on a constant load in the case that is chosen to represent the balance regime. A broad range of performance characteristics are observed. Dynamic scheduling is the worst performer due to the high associated communications costs incurred by the one-dimensional decomposition and poor data locality. Large numbers of cache misses at each iteration curbs performance. Static scheduling has the same decomposition but increased data locality and performs a little better. The heavy burden of re-partitioning at each iteration inhibits 2D dynamic affinity scheduling, and understandably it performs its best for the constant load when there is one patch per thread, which effectively makes it 2D dynamic scheduling with additional overheads. With the lowest overheads, 2D FGLS performs the best. 2D guided, 2D affinity, 2D dynamic and 2D static perform well with reasonably large (minimum) patch sizes – around 5% less efficiently than FGLS. One point of note is the performance of 2D static scheduling on eight threads. The regular block decomposition used for this schedule cannot decompose the domain into eight patches of equal size, and any other decomposition is sub-optimal on this number of threads. This shows that the use of a RCB method to decompose the domain for 2D static and dynamic scheduling could be beneficial in some circumstances

H. J. Price

63 Gauss Load - "Communication Only" Regime

1.00

0.90

0.80 Static

0.70 Parallel Efficiency

Dynamic Static 2D 0.60

Dynamic 2D Guided 2D iso

0.50

Guided 2D RCB Affinity 2D iso Affinity 2D RCB

0.40

Dynamic Affinity 2D iso Dynamic Affinity 2D RCB

0.30

FGLS 2D

0.20

0.10

0.00 2

4

6

8

10

12

14

16

Number of threads

Figure 6.4: Best parallel efficiency of all schedule types with a Gauss load in the “communication only” regime. – an RCB decomposition is used to create one partition per thread for use in 2D affinity scheduling and good performance is seen at all thread counts in that case. For affinity scheduling the decomposition of the local allocation into a queue of chunks will only serve to incur an additional cost in this case because the load is constant throughout the domain. The relative performance of each schedule type for the balance regime run with a Gauss load is shown in Figure 6.6. The overall performance characteristics observed are very similar to those for the constant load. All of the schedule types except the one-dimensional static and dynamic algorithms are capable of balancing the load and keeping the boundary element count down to acceptable levels. The large overheads dynamic affinity has to bear in its current form counter-act the advantage gains from dynamic load balancing. In general each of the other schedulers performs reasonably well and again. However, it should be noted that the performance of 2D static scheduling is quite sensitive to the patch size, and only performs well for specific values. 2D dynamic and guided scheduling perform well over a broader range of chunk-sizes which gives them more practical value. 2D FGLS is the most efficient algorithm tested. The streamlined 2D FGLS algorithm formulation and implementation scales well with the repartitioning threshold level between approximately 2-6%. This is just enough to ignore any fluctuations around the base level, but also low enough to attain a reasonable load balance. The synchronisation necessary for chunk re-allocation between local queues in affinity scheduling may be responsible for its relatively poor performance. There is a conflict of interest here: if the minimum chunk size is small then the domain will be finely decomposed which will allow the load to be distributed more evenly on to processors, however, if the chunks are small there are more of them which will increases the synchronisation overheads.

H. J. Price

64

Constant Load - "Balance" Regime 1.00

0.90

0.80 Static

0.70 Parallel Efficiency

Dynamic Static 2D 0.60

Dynamic 2D Guided 2D iso

0.50

Guided 2D RCB Affinity 2D iso Affinity 2D RCB

0.40

Dynamic Affinity 2D iso Dynamic Affinity 2D RCB

0.30

FGLS 2D

0.20

0.10

0.00 2

4

6

8

10

12

14

16

Number of threads

Figure 6.5: Best parallel efficiency of all schedule types with a constant load in the “balance” regime.

Gauss Load - "Balance" Regime 1.00

0.90

0.80 Static

0.70 Parallel Efficiency

Dynamic Static 2D 0.60

Dynamic 2D Guided 2D iso

0.50

Guided 2D RCB Affinity 2D iso Affinity 2D RCB

0.40

Dynamic Affinity 2D iso Dynamic Affinity 2D RCB

0.30

FGLS 2D

0.20

0.10

0.00 2

4

6

8

10

12

14

16

Number of threads

Figure 6.6: Best parallel efficiency of all schedule types with a Gauss load in the “balance” regime.

H. J. Price

6.1

65

Comparison of the Feedback Guided Schedulers

2D dynamic affinity scheduling and 2D FGLS both rely on data from previous iterations to re-partition the domain at each iteration and incur overheads as a result. 2D dynamic affinity has the additional overhead of decomposing the partitions into patches. This affords it the mechanism to re-allocate iterations to idle threads during execution to balance the load dynamically during patch processing. FGLS does not have this ability and performance relies solely on a good decomposition at the start of each iteration. Figure 6.7 shows the iteration times over the course of a run with a Gauss load using 2D dynamic scheduling and 2D FGLS scheduling. In feedback guided loop scheduling the initial decomposition gives a poor load balance so the cost per iteration is initially high. As the run progresses the decomposition approaches an optimal state. The partition re-sizing at each iteration can be under-relaxed, although this clearly gives no performance advantage and 2D FGLS achieves a an optimal decomposition after only 3-5 iterations typically with no under-relaxation. For 2D dynamic affinity scheduling the cost of the first iteration is low as the domain has been decomposed prior to the start of the iterative loop. On subsequent iterations there is a near fixed cost associated with repartitioning the domain at each iteration and then building the local queues. Any load imbalance can be accounted for by chunk re-allocation, therefore the iteration time is almost constant over the course of the run in this example, and processor affinity seems to offer little or no performance gain. Under-relaxation of the partition size updates between iterations is necessary in dynamic affinity scheduling to prevent wild oscillations occurring. In the example illustrated the system reaches an equilibrium state after approximately 25 iterations. This can be seen by inspection of the decomposition between iterations (see www.qaop.org for an animation) but cannot be seen by inspecting this particular chart, as the dynamic chunk reallocation balances the load in-situ and maintains an almost constant iteration time. The decomposition algorithm for 2D FGLS (discussed in Section 4.2.5) is asymptotically inexpensive and scales as the square of the number of threads. It has consistently proved to be faster than the decomposition typically required by dynamic affinity at each iteration. This is primarily due to the thread partitions requiring further decomposition in dynamic affinity scheduling. The relative cost is reflected in the iteration times shown on the chart. The overheads come down when the re-partitioning in 2D FGLS scheduling is set to only occur if the load imbalance on the previous iteration is above a certain threshold level. The blue line in Figure 6.7 represents 2D FGLS with a threshold level of 2%. It can clearly be seen that the overhead is a large fraction of the iteration time in this particular case. In this example, re-partitioning is only necessary for the first five iterations until the performance levels off. When the threshold is crossed, and the repartitioning is turned off, a 50% speed increase is achieved. In this particular example the threshold level is again exceeded at iteration number seven and a further re-partitioning occurs, which increases the iteration time considerably. When the load is un-balanced, i.e. when dealing with a Gaussian load distribution, and a dynamic scheduling algorithm is in use, there is always a trade-off between minimising overheads by keeping the patch count small, and having many small patches to allow the algorithm to dynamically balance the load. Without knowing the load distribution in advance, it is difficult to tailor the decomposition to the problem. Dynamic affinity has the potential to get around this problem: the feedback guided re-partitioning at each iteration should eventually give each processor an even amount of work, and if a

H. J. Price

66

1.6E-02

1.4E-02

1.2E-02

Iteration time (s)

1.0E-02

8.0E-03

Dynamic Affinity 2D FGLS 2D relax = 0.2 FGLS 2D FGLS 2D, threshold 2%

6.0E-03

4.0E-03

2.0E-03

0.0E+00 0

5

10

15

20

25

30

Iteration number

Figure 6.7: Iteration times for 2D dynamic affinity and 2D FGLS schedules working on a Gauss load. Both schedules have a decomposition overhead at each iteration. Dynamic affinity scheduling can balance the load during execution by re-allocating iterations to thread, whereas in FGLS each thread must process the single chunk that it is allocated at the start of the iteration. The blue line labeled “limit” represents modified FGLS in which repartitioning only occurs when the load imbalance is below a threshold limit.

H. J. Price

67

large enough minimum patch side is used then each thread should only have a small number of patches in its local queue. Unfortunately, the overheads associated with 2D dynamic affinity scheduling in its current implementation are high, and this benefit cannot currently be realised in practise. 2D FGLS, on the other hand, is a fast an efficient, optimised algorithm. With only one patch per thread the algorithm is not susceptible to this phenomenon.

Chapter 7

Conclusions The following conclusions have been drawn from the the scheduling algorithms created, benchmarked and reported in this document: 1. The OpenMP standard static and dynamic schedule types can be inefficient at efficiently parallelising computations involving highly two-dimensional loads such as a 2D Gaussian distribution. The poor performance stems from the need to use a very small chunk size to distribute the load equally between threads. If nearest neighbour interactions are a feature of the algorithm then very thin chunks can lead to large boundary data communications overheads which reduces the usefulness of these schedule types. Dynamic affinity scheduling can perform worse than static scheduling in some cases as it does not encourage data affinity between successive iterations of the solver 2. Simple two-dimensional equivalents of static and dynamic scheduling, which use a 2D block decomposition, can offer significant performance improvements over the 1D schedule types as (i) the fraction of boundary elements present in the decomposition is reduced and (ii) the load can be split more evenly between the available threads. 2D static scheduling can scale well but is sensitive to the decomposition parameters. 2D dynamic scheduling is far less sensitive to the choice of decomposition parameters and a range of patch sizes give good performance. 3. The two-dimensional equivalents of guided scheduling constructed and tested offer a slight performance advantage over 2D static and 2D dynamic scheduling. The gradient in patch size in the domain gives the load balancing algorithm a better chance of balancing the load accurately than 2D dynamic scheduling does, so a slight performance increase is observed. 4. The 2D isotropic guided decomposition algorithm in its current formulation is responsible for creating a large number of patches. This can reduce the performance of the patch scheduler that it is used along-side. RCB decomposition results in a lower number of patches of the specified input weights patch sizes and a much smoother distribution can be attained. Both algorithms are very robust, unlike the step-size decomposition strategies (Olkhovskyy and recursive). 5. 2D affinity scheduling can offer good scaling when dealing with uneven loads. However, the data affinity that the method hopes to achieve does not improve its performance noticeably. 6. The cost of decomposing the domain at each iteration cripples the performance of 2D dynamic 68

H. J. Price

69

affinity scheduling in its current implementation. Due to the data structures used to store the geometric decomposition information, and the serial two-tier decomposition strategy, the overheads are extremely prohibitive. 7. 2D feedback guided loop scheduling with load-imbalance threshold controlled re-partitioning is a fast and efficient nested loop scheduling algorithm. It performs well over a good range of input parameters and has real practical value as it performs well for a good range of input parameters on the loads tested.

7.1

Project Management

This project was described MSc Project Proposals 2005-06 on the MSc homepage: “This project will investigate algorithms for scheduling nested loops on shared memory multiprocessors. Many algorithms exist for scheduling of single loops, some of which are included in OpenMP, for example. These can also be used for nested loops, by collapsing the the nest, but this solution is not always optimal, especially in the presence of inter-thread dependencies. This project will investigate scheduling of multiple loop nests, using both existing and new algorithms. This project will require code development from scratch, so is suitable for either C or Fortran programmers. Suitable test cases and benchmarking framework will be developed to test the algorithms, and benchmarking will be carried out on Lomond and on single nodes of HPCx. This project will give the student an opportunity to do some original research, and if the quality of the work is sufficient, may lead to a published paper.” A collection of legacy code relating to the project was made available at the start of the project. It became apparent that the aim of the project should be to extend and improve work that had been previously undertaken in the field of nested loop scheduling. The primary focus at the start of the project was the work undertaken by Olkhovskyy, a summer intern at the EPCC in 2000. The overall quality of the legacy code proved to be a set-back during the early phases of the project. The sheer volume, poor structure and naming convention and lack of documentation and commenting hindered initial progress. Rather than re-using much of the code as hoped, a great deal had to be rewritten to ensure correctness and allow extendibility into the areas of current interest. This initially put the project back somewhat during the first month. In retrospect it may have been far less effort to start from scratch from the outset, but it seemed wasteful to “re-invent the wheel” at the time. It was agreed between the author and the supervisor that the aim of the project was to attempt to reproduce the results detailed by Olkhovskyy, improve upon his 2D guided scheduling algorithm and then extend it into 2D affinity scheduling. It was not possible to accurately reproduce the performance characteristics previously observed as there was a lack of data concerning the runs that had been carried out. Further, the machine on which the

H. J. Price

70

benchmarking had been performed had long since been decommissioned. To this end each of the schedule types previously reported was re-tested in the new environment. The suite of code that was created to allow the benchmarking of the existing and novel scheduling algorithms was designed from the outset to be intuitive, transparent and easily extendible. To this end the project took on a much more development oriented slant in mid-life. The final code suite comprises considerably more functionality than was initially envisaged: • A facility to accurately time and assess the performance of the standard OpenMP and novel scheduling algorithms for a range of loads • 2D versions of static and dynamic scheduling • 2D guided scheduling with a choice of several decomposition algorithms and control over decomposition cell aspect ratios • 2D affinity scheduling in which the decomposition algorithms created could be employed to for the local queues • 2D dynamic affinity scheduling • An optimised version of 2D feedback guided loop scheduling • Full visualisation of domain decomposition and patch processing data for each of the 2D schedules • Statistical analysis of the load balance and other metrics during execution • In-execution data/visual output if required • In-built validation tests The large range of combinations of schedule type, model size, load type, decomposition type and algorithm-specific parameters have the potential to produce vast amounts of data. Much of this would be of little value to the research at hand and any interested parties. To this end the benchmarking runs that are detailed in this report (Chapter 5) were agreed upon to test the functionality and performance of each of the schedulers efficiently in the time available. The range of experiments performed has given some very useful information on the relative merits of each of the schedulers and decompositions and when they are applicable. It would have been interesting to test the performance on a wider range of loads, possibly time-dependent loads, and on alternative architectures, however a lack of time and resources in the latter half of the project prevented this. Lack of computational resources also restricted the number of processors that could be used on to assess scalability. It would also have been nice to have the code solving a real world problem such as fluid flow instead of the simple image filter, but this was not the focus of the report and would not have affected the observed characteristics of the algorithms. Concurrent Versions System (CVS) revision control software has been used throughout development, benchmarking and reporting to reduce risks associated with loss of data. Lomond was chosen as the home of the repositories as it is the most stable data resource available. Benchmarking commenced as soon as possible to do so to increased the likelihood of producing good quality clean data to present

H. J. Price

71

at the end of the project, and to allow maximum time to utilise the HPC resources. Regular minuted project meetings have helped steer the project successfully towards its goals. It is unclear whether or not any of the work is of a publishable standard. A more accessible interface to the new schedule types would be of more practical value, and further benchmarking and validation should be carried out to fully understand the functionality and applicability of each of the novel nested loop schedulers developed.

7.2

Future Work

The suite of code as it stands does not represent a final product. The are many optimisations that can be made to the algorithms to improve their performance and usability, and the performance of each has yet to be exhaustively tested. The following areas may be of interest if this work is continued at a future date: 1. The 2D isotropic guided decomposition algorithm in its current formulation produces the square of the number of patches created by the 1D guided scheduling algorithm. This may be responsible for reducing the performance of scheduling algorithms used with this decomposition. It may be beneficial to scale the number of divisions per size by a factor to give a lower number of patches 2. It is possible to decompose the domain for use in 2D static and 2D dynamic scheduling with an RCB decomposition rather than a block decomposition. This will allow the algorithms to be more flexible and produce a desired number of patches per thread. 3. Both 2D FGLS and 2D dynamic affinity scheduling involve re-partitioning of the domain between iterations. 2D dynamic affinity scheduling incurs an additional overhead as each partition must then be decomposed into a local queue of patches. It is possible that this stage of the decomposition algorithm could be parallelised to reduce the re-partitioning overhead. 4. When using 2D FGLS it has been observed that for a static load the load imbalance falls to a base level and proceeds to oscillate about that level. Presently the value of the load imbalance is used as the re-partitioning criterion. This presents a lack of generality, as the base level is model dependent. It is however easily identified by inspection of iteration times. To increase generality and usability of the code, some transient statistical analysis of the load imbalance could be used as the repartitioning criterion. A successful implementation of such a scheme would revert 2D FGLS to a robust, efficient and parameterless model. 5. The overhead associated with re-partitioning of the domain in 2D dynamic affinity and 2D FGLS scheduling can inhibit scaling at large processor counts. It is possible to reduce this effect by only re-partitioning if the load imbalance (as calculated by analysis of thread processing times) exceeds a prescribed amount in the current implementation of 2D FGLS scheduling. This is not an option in 2D dynamic affinity scheduling and the observed performance is poor as a result. If this was implemented in dynamic scheduling then the re-partitioning criterion should not be based on thread timings, as the dynamic load balancing mechanism should ensure all threads are well balanced during execution. Instead, the criterion should be based on the fraction of each threads initial allocation of iterations that is processed by that thread, which is a measure of how much inter-thread chunk migration occurs.

H. J. Price

72

6. More work should be done to minimise the overheads associated with the schedule types. It has been shown that this is critical to performance by the 2D FGLS analysis. The 2D static and 2D dynamic schedule algorithms may scale better if the overheads were reduced. It is difficult to keep the code maintainable comprehensible and extendible without the use of abstract data types, but flat data structures can offer more scope for optimisations and improve performance. This may be a consideration if 2D scheduling algorithms are to be used to parallelise loops in production codes. 7. The 2D RCB algorithms have a parameter that can be used to steer the decomposition towards cells (patches/partitions) of a target aspect ratio. In the benchmarking carried out to produce the results presented in this report, padding of the main data arrays was used to nullify the effect of aspect ratio on performance. This increased the size of the arrays and therefore reduced the problem size that could be studied (due to cache size considerations and the potential for superlinear scaling). However, in practice, one would like to put all of the available memory to good use and not pad. In this case, decomposition aspect ratio will be a parameter in the performance of the scheduling algorithm and its effect on performance should be investigated. 8. Transient loads often occur in simulations. It would be interesting to observe how each of the scheduling algorithms can cope with a transient load. Transient loads are built into the code for this purpose.

Appendix A

Code Usage The revised code that accompanies this report can be used to build a single executable that comprises a suite of nested loop scheduling algorithms and domain decomposition routines, that together can be applied to a range of in-built load types. Each of the individual scheduling, decomposition and load functions has various parameters. A set of “sensible” values for each are set by default. The key parameters for each function can be provided by the user in an input deck that can accompany the executable at runtime. An input deck presents clearly to the user the parameters that are going into the model. The input deck is parsed at the start of the program immediately after the default parameter values are set, thus taking precedence over the defaults. Some fundamental parameters (such as chunk/patch size for 2D scheduling) can be set through environment variables for convenience when undertaking scripted benchmarking, much in the way that OpenMP uses environment variables such as OMP_NUM_THREADS and OMP_SCHEDULE. One such variable that has been introduced is OMP_2D_N to represent a patch side parameter for the 2D decompositions. The existence of the environment variables can be important when running interactively, as the environment variables will automatically over-ride any default or values in the input deck. Of course when submitting a job to a job queue these environment variables will need to be set in the relevant script. In the bash shell an environment variables can be set and removed with the export and unset commands. More information is available in the read-me document that accompanies the source code.

Auto-tuning of Computation to Communication Ratio The code can assess the computation to communication ratio at each iteration. In response to a target ratio, the computation weighting coefficient can be automatically adjusted on the fly. However, the process has a fixed cost that does not scale and can dominate the performance at large processor counts at which the iteration time is very low. Auto-tuning is handy to find a coefficient that corresponds to a particular regime of computation to communication ratio for a given problem. It can be used in subsequent benchmarking with the mechanism turned off.

73

Appendix B

Glossary RCB

Recursive Coordinate Bisection

FGLS

Feedback guided loop scheduling. In which the time taken to process regions of iteration space determine the decomposition on the succeeding iteration.

domain

The n-dimensional range of iteration space over which an algorithm operates.

partition

A subdivision of a domain that is attributed to a particular thread.

patch

The 2D equivalent of a chunk (from OpenMP definition) of iteration space that will be processed as a single item by a thread.

cycle

An iteration over every point in the computational domain by the solver.

cc-NUMA

Cache-coherent non-uniform memory access multi-processor

SMP

Symmetric multi-processor

74

Bibliography [1] “OpenMP: Simple, Portable, Scalable SMP Programming”, www.openmp.org [2] O. Olkhovskyy (2001) Feedback Guided Scheduling for Two-Dimensional Loops, EPCC-SPP2001-2002 [was this published?] [3] “The Message Passing Interface (MPI) standard”, (01/12/2005)

http://www-unix.mcs.anl.gov/mpi/

[4] Eager DL and Zahorjan J (1992) “Adaptive Guided Self Scheduling”, Technical Report 92-01-01, Department of Computer Science and Engineering University of Washington. [5] Markatos, E.P. and LeBlanc, T.J. (1994) “Using processor affinity in Loop Scheduling on Shared Memory Multiprocessors”, IEEE Transactions on Distributed Systems, vol. 5, no. 4, pp.379-400. [6] Subramaniam S. and D.L. Eager, “Affinity Scheduling of Unbalanced Work loads”, in Proceedings of Supercomputing ’94, IEEE Comp. Soc. Press, pp. 214-226. [7] http://www.epcc.ed.ac.uk/computing/services/sun/documents/hpc-intro/html/Hardware.html (01/12/2005) [8] http://www.epcc.ed.ac.uk/computing/training/courses/course-faq/timing/c.html Timing from within C Programs.

(01/12/2005)

[9] T. H. Tzen and L. M. Ni, (1993) Trapezoidal Self-Scheduling: A Practical Scheduling Scheme for Parallel Computers, IEEE Transactions on Parallel and Distributed Systems, vol. 4, no. 1, pp. 87-98. [10] J.M. Bull, (1998) Feedback Guided Dynamic Loop Scheduling: Algorithms and Experiments, Proceedings of EuroPar ’98, Lecture Notes in Computer Science, Springer-Verlag. [11] “Point-to-Point Synchronisation on Shared Memory Architectures”, Carwyn Ball and Mark Bull, 2003, UKHEC Technical Report, http://www.ukhec.ac.uk/publications/reports/neigh_synch.pdf (30/11/2005) [12] S. Hummel, E. Schonberg, and L. Flynn (1992) Factoring: A practical and robust method for scheduling parallel loops Comm. of the ACM, 35(8):90–101 [13] USER NOTES ON FORTRAN PROGRAMMING (UNFP) http://www.ibiblio.org/pub/languages/fortran, 22-08-2006 section 1-2 “ Comparison between

75

H. J. Price FORTRAN and C”

76