Exploring performance of Xeon Phi co-processor

Exploring performance of Xeon Phi co-processor Mateusz Iwo Dubaniowski August 21, 2015 MSc in High Performance Computing The University of Edinburgh...

Author: Brandon Hamilton

0 downloads 0 Views 2MB Size

Report

Download PDF

Recommend Documents

Intel Xeon Phi Coprocessor

Xeon Phi TM Coprocessor

Benchmarking the Intel Xeon Phi Coprocessor

Offload Code to the Intel Xeon Phi Coprocessor

Streaming Store Instructions in the Intel Xeon Phi coprocessor

Compiler Directives for the Intel Xeon Phi Coprocessor

SIMD Enabled Functions on Intel Xeon CPU and Intel Xeon Phi Coprocessor

Intel Xeon Phi Programming Environment. Intel Xeon Phi Execution Models

Performance Evaluation of Breadth-First Search on Intel Xeon Phi

Overview of the Intel Xeon and Xeon Phi tecnologies

Intel Xeon Phi Coprocessor Intel Manycore Platform Software Stack (Intel MPSS)

Compiler Prefetching for the Intel Xeon Phi coprocessor. Rakesh Krishnaiyer Intel Compiler Lab

Intel Xeon Phi Coprocessor Intel Manycore Platform Software Stack (Intel MPSS) User's Guide (Windows*)

A Unified Interface for Benchmark Tools on the Intel Xeon Phi Processor X200 Product Family and The Intel Xeon Phi Coprocessor X200 Product Family

Exploiting Parallelism for Intel Xeon Processors & Intel Xeon Phi Coprocessors

Intel Xeon Phi 3120AIB Workstation Compute Processor. Models Intel Xeon Phi 3120AIB Compute Processor

Intel Xeon Phi Avril Alain Dominguez Intel

Intel Xeon Phi MIC Offload Programming Models

Intel Xeon Phi Core Micro-architecture

Intel Xeon Phi MIC Offload Programming Models

Exploring SIMD for Molecular Dynamics, Using Intel R Xeon R Processors and Intel R Xeon Phi TM Coprocessors

GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study

SFTL003. Optimize Your Code for the Latest Intel Xeon Processors and Intel Xeon Phi Coprocessor using Intel Parallel Studio XE for Linux *

OpenMP Programming on Intel R Xeon Phi TM Coprocessors: An Early Performance Comparison

Exploring performance of Xeon Phi co-processor Mateusz Iwo Dubaniowski

August 21, 2015

MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2015

Abstract The project aims to explore the performance of Intel Xeon Phi processor. We use various parallelisation and vectorisation methods to port a LU decomposition library to the coprocessor. The popularity of accelerators and co-processors is growing due to their good energy efficiency characteristics, and the large potential of further performance improvements. These two factors make co-processors suitable to drive the innovation in high performance computing forwards, towards the next goal of achieving the Exascalelevel computing. Due to increasing demand Intel has delivered a co-processor designed to fit the requirements of the HPC community, the Intel MIC architecture, of which the most prominent example is Intel Xeon Phi. The co-processor utilises the many-core principle. It provides a large number of slower cores supplemented with vector processing units, thus forcing high level of parallelisation upon the users. LU factorisation is an operation on matrices used in many fields to solve linear algebra, inverse matrices, and calculate matrix determinants. In this project we port a LU factorisation algorithm using Gaussian elimination method to perform the decomposition to Intel Xeon Phi co-processor. We use various parallelisation techniques including Intel LEO, OpenMP 4.0 pragmas, Intel’s Cilk array notation, and ivdep pragma. Furthermore, we examine the effect of data transfer to the co-processor on the overall execution time. The results obtained show that the best level of performance on Xeon Phi is achieved with the use of Intel Cilk array notation to vectorise, and OpenMP4.0 to parallelise the code. Intel Cilk array notation, on average across sparse and dense benchmark matrices, results in the speed-up of 27 times over the single-threaded performance of the host processor. The peak speed-up achieved with this method, across attempted benchmarks, results in performance 49 times better than that of a single thread of the host processor.

Contents Chapter 1 Introduction ........................................................................................................1 1.1 Obstacles and diversions from the original plan .....................................................4 1.2 Structure of the dissertation......................................................................................4 Chapter 2 Co-processors and accelerators in HPC ............................................................6 2.1 Importance of energy efficiency in HPC .................................................................6 2.2 Co-processors and the move to Exascale.................................................................8 2.3 Intel Xeon Phi and other accelerators ......................................................................9 2.4 Related work ...........................................................................................................10 Chapter 3 Intel MIC architecture......................................................................................13 3.1 Architecture of Intel MICs .....................................................................................13 3.2 Xeon Phi in EPCC and Hartree ..............................................................................16 3.3 Xeon Phi programming tools .................................................................................17 3.4 Intel Xeon – host node............................................................................................18 3.5 Knights Landing – the future of Xeon Phi.............................................................18 Chapter 4 LU factorization – current implementation.....................................................20 4.1 What is LU factorization? ......................................................................................20 4.2 Applications of LU factorization ...........................................................................21 4.3 Initial algorithm ......................................................................................................21 4.4 Matrix data structure...............................................................................................22 i

Chapter 5 Optimisation and parallelisation methods .......................................................24 5.1 Intel “ivdep” pragma and compiler auto-vectorisation .........................................24 5.2 OpenMP 4.0 ............................................................................................................25 5.3 Intel Cilk array notation .........................................................................................26 5.4 Offload models .......................................................................................................26 Chapter 6 Implementation of the solution........................................................................29 6.1 Initial profiling ........................................................................................................29 6.2 Parallelising the code..............................................................................................30 6.2.1 Hotspots analysis .............................................................................................31 6.3 Offloading the code ................................................................................................32 6.4 Hinting vectorisation with ivdep ............................................................................33 6.4.1 Hotspots for further vectorisation ...................................................................34 6.5 Ensuring vectorisation with Intel Cilk and OpenMP simd ...................................34 6.5.1 Intel Cilk array notation ..................................................................................35 6.5.2 OpenMP simd pragma ....................................................................................35 Chapter 7 Benchmarking the solution ..............................................................................37 7.1 Matrix format ..........................................................................................................37 7.2 University of Florida sparse matrix collection ......................................................38 7.3 Dense benchmarks ..................................................................................................39 7.4 Summary of benchmarks’ characteristics ..............................................................39 Chapter 8 Analysis of performance of Xeon Phi .............................................................41 8.1 Collection of results ................................................................................................41 8.2 Validation of the results .........................................................................................42 8.3 Overview of results.................................................................................................43 ii

8.4 Speed-up with different optimisation options........................................................45 8.5 Native speed-up on Intel Xeon and on Intel Xeon Phi ..........................................47 8.6 Offloading overhead ...............................................................................................49 8.7 Speed-up on the host with different optimisation options.....................................51 8.8 Running NICSLU on the host ................................................................................53 Chapter 9 Summary and conclusions ...............................................................................54 9.1 Future work .............................................................................................................56 Bibliography .....................................................................................................................57

iii

List of Tables Table 2-1: Overview of available co-processors and accelerators by vendor ................10 Table 3-1: Overview of versions of Intel Xeon Phi available .........................................16 Table 5-1: Outline of scheduling options available in OpenMP 4.0 [31] .......................25 Table 5-2: Intel Cilk array notation example ...................................................................26 Table 6-1: gprof profile of running the LU factorization algorithm with ranmat4500 input on 4 host threads ...............................................................................................................30 Table 6-2: Intel Cilk array notation use in lup_od_omp function ...................................35 Table 6-3: OpenMP simd pragma usage in lup_od_omp ................................................35 Table 7-1: Characteristics of benchmark matrices ..........................................................40 Table 8-1: Execution times (in seconds) of running benchmarks offloaded to Xeon Phi with different parallelisation methods ..............................................................................44 Table 8-2: Speed-up values summary against single-threaded host execution time ......46 Table 8-3: Code snippets explaining performance difference between simd pragma and Intel Cilk array notation ....................................................................................................47 Table 8-4: Overview of performance improvements due to optimisation methods on the host – Intel Xeon ...............................................................................................................52 Table 8-5: Results of running NICSLU on the host ........................................................53

iv

List of Figures Figure 1-1: Performance of Top500 list systems over the past 11 years [2] ....................2 Figure 2-1: Increasing share of co-processors/accelerators in the systems from Top500 list over the past 4 years......................................................................................................9 Figure 3-1: Simple outline of a single Intel MIC core.....................................................14 Figure 3-2: More detailed outline of Intel MIC core .......................................................15 Figure 3-3: Hartree’s Intel Xeon Phi racks ......................................................................17 Figure 4-1: Matrix data structure implementation ...........................................................23 Figure 5-1: Intel Xeon Phi execution models with Intel Xeon as the host processor [33] ...........................................................................................................................................28 Figure 5-2: Intel Xeon Phi software stack [23] ................................................................28 Figure 6-1: VTune threads utilisation in lup_od_omp function pre and post optimisation ...........................................................................................................................................32 Figure 7-1: Sample matrix and its Matrix Market Exchange format representation ......38 Figure 8-1: Speed-up of benchmarks with different parallelisation techniques used .....45 Figure 8-2: Speed-up of Xeon Phi with varying number of threads (speed-up=1, when n=8) ...................................................................................................................................48 Figure 8-3: Offloading and execution times as a percentages of the total execution time for OpenMP 4.0 simd pragma ..........................................................................................49 Figure 8-4: Offloading and execution times as a percentages of the total execution time for Intel Cilk array notation ..............................................................................................50 Figure 8-5: Speed-up of benchmarks including offload time with different parallelisation techniques used .................................................................................................................51 Figure 8-6: Speed-up on the host – Intel Xeon – with different optimisation methods and benchmarks .......................................................................................................................52

v

Acknowledgements Writing this master thesis would not have been possible if not for many remarkable people I have met during this year. With deep gratitude, I would like to thank my supervisor Adrian Jackson for his guidance, encouragement and thoughtful comments that made writing this work a real adventure. For motivation and support, I would like to thank my personal tutor Mark Bull. Further, I am also grateful and would like to thank for being given chance to take part in HPC Conference in Frankfurt and all the people I encountered there, for it exposed me to variety of ideas that have driven this work. Furthermore, I would like to thank everyone at STFC – Hartree Centre for their help and continuing support with accessing Xeon Phis in Hartree. And, of course, thanks to Pat for some proof-reading, submitting, and general support. This dissertation makes use of the template provided by Dr. David Henty, which is based on the template created by Prof. Charles Duncan.

vi

Chapter 1 Introduction Co-processors and accelerators are fast becoming the hot topic of high performance computing community. Their unique ability to achieve a high level of performance relative to their energy consumption attracts a significant amount of attention. The amount of research devoted to designing, and subsequently utilising these devices in the most efficient fashion is growing. Moreover the performance of co-processors has not been fully exploited yet. There is not a unique formula for how to achieve the best performance on accelerators. In fact, there are widely varying opinions as to how accelerators should be designed, and how they should be used, to achieve their goal of improving the performance of computers, both in terms of time, and energy consumption. The other aspect attracting growing attention to co-processors and alternative ways of boosting performance is the “power wall” of multiprocessors. The multi-core systems begin to experience decrease of the rate of performance improvements. [1] These effects can be seen on Figure 1-1, where we can see that the Top500 systems rate of performance growth has dropped over the last 3 years. The rate of growth has dropped for all: top of the list, bottom of the list systems, and the sum of the systems from Top500 list. A large majority of Top500 list systems use primarily multicore systems. Similar issue was experienced with single-core systems a decade ago, when the frequency could no longer be increased due to energy consumption and cooling limitations. Co-processors provide an alternative way to achieve the high performance with the use of highly-parallel, manycore architectures. At the same time they attain a good level of energy efficiency per unit of computation. Moreover, many systems combine traditional CPU processors and coprocessors by employing heterogeneous implementations. The usage of co-processors is becoming increasingly prominent with more and more systems employing accelerators to boost their performance. On the most recent, released in July 2015, Top500 list, a list of 500 best performing supercomputers in the world, accelerators are used in 88 out of the total of 500 systems. This is an increase from 75 systems using accelerators or co-processors in November 2014, when the previous Top500 list was released. [2] This suggests that co-processors are being adopted by more and more systems in order to achieve the best performance. The two best performing systems on the July 2015 Top500 list both use accelerators to achieve their performance.

1

Figure 1-1: Performance of Top500 list systems over the past 11 years [2] Similarly, the most energy efficient systems use accelerators or co-processors to achieve their energy consumption performance level. On the Green500 list released in June 2015, first 32 most energy efficient systems use co-processors or accelerators. This is an increase of almost 40% over the November 2014 release of Green500 list, where 23 top systems employed accelerators or co-processors. Furthermore, in the November 2014 release of the Green500 list, a first system broke the barrier of 5GFLOPS1/Watt. The system employed an accelerator to achieve this level of energy efficiency. In June 2015 release of Green500, the most energy-efficient system, RIKEN-Shoubu, achieves performance of more than 7GFLOPS/Watt. [3] Although 32 most energy efficient systems use accelerators, there are only 88 of them utilising accelerators overall on the Green500 list released in June 2015. While these numbers are growing, in general, the proportion of systems using accelerators is relatively small. Developers and scientists are often disinclined to use co-processors or

1

GFLOP – GigaFLOP – billion floating point operations per second

2

heterogeneous architectures. The process of developing for, or porting software to accelerators is complicated and not standardised. Additionally, these operations carry risk of being in conflict with optimisations, and developments for regular CPU-only architectures. The OpenMP 4.0 standard aims to unify the programming model of co-processors and heterogeneous systems by providing a standard set of pragmas for programming these devices. Furthermore, OpenMP pragmas can be easily switched off for compiling the code for a CPU-only system. Similarly, Intel attempts to provide utilities such as Intel Cilk to aid programming their Intel MIC architecture devices. AMD and NVIDIA support OpenACC, which similarly to OpenMP is a set of pragmas designed to help with programming heterogeneous systems. Furthermore, Intel’s new generation of Intel Xeon Phi co-processors will be self-bootable and compatible with Intel Xeon’s ISA2, so that the execution and programming of heterogeneous systems will become simpler. Intel Xeon Phi is Intel’s co-processor following the Intel MIC architecture. It is designed on a principle of many-core architectures, where a large number of simple cores are used to enable parallel execution of applications. It is in contrast to multi-core processors, where each chip aims to contain a lower number of more sophisticated cores. The motivations behind this approach are based on improved floating point instructions performance through a wider issue of these instructions, better memory utilisation of such systems, and lower energy consumption due to reduced clock frequency. Intel Xeon Phi co-processor is aimed primarily at scientific and research community, and is meant to meet the needs of these industries. This is in contrary to GPU3 systems, which often stem from multimedia and gaming industries, and thus might have different priorities. Consequently, Intel provides a wide range of tools for optimisation and development of scientific code with Intel Xeon Phi. These include the Intel MKL – a math operations library, or Intel Parallel Studio XE – a software development tool. LU factorisation or decomposition is a mathematical operation on matrices, which is widely used in solving linear algebra, deriving inverse or determinant of a matrix. These operations have wide industrial applications among others in design automation, machine learning, and signals processing. Since the implementations of LU decomposition tend to contain many loops and operations on rows of matrices, LU factorisation is highly parallelisable and potentially can be vectorised. The purpose of this project was to explore the performance of Intel Xeon Phi coprocessor. We aimed to analyse how the performance of Intel Xeon Phi varies, when various different parallelisation and vectorisation techniques are employed. This was

2

ISA – instruction set architecture

3

GPU – graphics processing unit

3

achieved by porting a LU factorisation library to Intel Xeon Phi using various models of execution.

1.1 Obstacles and diversions from the original plan During the progress of the project, we ran into several obstacles, which we had initially deliberated throughout the project planning stage. Initially, we aimed to port NICSLU library [4] to Intel Xeon Phi in order to explore performance of the latter. NICSLU is a library, which efficiently implements LU factorisation aimed at electronic design automation applications. However, during the course of the project it emerged that the original NICSLU code implemented in C is not compatible with Intel Xeon Phi hardware. The original code implementing NICSLU was compatible with Intel386 architecture, which allows unaligned data accesses. However, K1OM architecture implemented by Xeon Phi does not allow improperly aligned data accesses. The workspace part of NICSLU’s application memory, used for performing operations on matrix arrays, is implemented with the use of unaligned accesses. Therefore, it became impossible to allocate the workspace memory in a similar fashion on Xeon Phi as it had been done originally. The unaligned accesses to the workspace in NICSLU affect many various parts of the library. Therefore, changing the allocation method of workspace has proved to cause too many legacy inconsistencies throughout the code of the library. Since the code of NICSLU library is considerably large in size, we decided to move to a different LU factorisation library. The LU decomposition code used in this project is a combination of two implementations adapted by us for this project through introduction of OpenMP directives. [5] [6] This has allowed us to continue the project without major interruptions. The move to a different code base had been initially predicted in the risk analysis at the project preparation stage so we had a contingency plan in place. [7]

1.2 Structure of the dissertation In this report, we present the work carried out throughout the dissertation project. From this point onwards the report is structured in the following manner. Initially, in Chapter 2, we introduce the concept of co-processors and accelerators. We consider these in detail in the context of high performance computing. We mention energy efficiency, the move to Exascale, Intel Xeon Phi and compare Xeon Phi to other accelerators. Subsequently, in Chapter 3, we describe Intel MIC architecture; explain the systems we used throughout the project, and outline the future of Xeon Phi – Knights Landing. Then, in Chapter 4, we introduce LU factorisation. We describe its applications and the algorithm used to perform the computation. Moreover, we explain the data structure used to store matrices. 4

In Chapter 5, we introduce the methods used to parallelise and vectorise the solution such as OpenMP and Intel Cilk pragmas. In Chapter 6, we describe the implementation of the solution and how we proceeded about parallelising the code. In Chapter 7, we outline the benchmarks used to compare different parallelisation and vectorisation methods. In Chapter 8, we present and analyse the results obtained after running the benchmarks on Xeon Phi with different parallelisation and vectorisation options. Furthermore, we outline the speed-up and show the impact of the offloading on the execution time. Finally, in Chapter 9, we present conclusions and summary. We introduce ideas for future work that could be completed if the project continued beyond the prescribed timeframe.

5

Chapter 2 Co-processors and accelerators in HPC Co-processors and accelerators are thought of as the future of high performance computing. Their prevalence is growing and with recent advancements in engineering technology they have become a very important field of improvements in performance of HPC systems. There can be little to no surprise that this is the current state of the supercomputing industry. The limits on single thread performance force researchers and industrial institutions to focus on developments of highly parallel systems. Co-processors meet this demand by providing substantial parallelism to achieve high levels of performance. This chapter helps to establish the position of this dissertation within the wider context of the field of high performance computing. We examine the motivations and background behind investigating co-processors, and Intel Xeon Phi in particular. We outline the aspects of energy-efficiency and how accelerators contribute to the move to Exascale computing4. The move to Exascale is widely considered to be the next step in the evolution of HPC systems. Moreover, we describe recent advancements in co-processors, and how they shape the landscape of high performance computing. We present major features of accelerators that contribute to their growing popularity. Similarly, we present a selection of accelerators and co-processors widely used in modern systems. Finally, we perform a brief review of similar and related works on exploring performance of accelerators.

2.1 Importance of energy efficiency in HPC The continued progress in performance of HPC systems is undoubtedly predicated on the energy efficiency of these systems. The ability to perform operations while consuming the amount of energy acceptable by the operators is the pinnacle of high performance computing. Systems constantly need to meet the demand for higher speed and more operations performed every second. It then becomes only natural that this goal stands in contradiction to lower energy consumption. The ability to sustain the current trend in increasing performance, while at the same time sustaining reasonable energy consumption is the “holy grail” of high performance computing. The appearance of “power wall” over the past decade has forced the research to focus on parallelisation rather than further increases in frequency or decreases in the size of chips. Chip designers were no longer able to provide faster speed of computation solely due to increases in frequency. The energy consumed by, and required to cool such chips was

Exascale computing – computing systems able to achieve performance of at least one exaFLOPS. This is to perform at least 1018 floating point operations per second. 4

6

becoming unsustainable. This has resulted in an emergence of parallel architectures in all fields of computer science, and at all levels of computer architecture. In June 2015’s issue of Top500 list, over 96% of systems use chip multiprocessors with more than 6 cores. [2] Currently, however, the rate of performance improvements in new developments of multiprocessors is decreasing. This suggests that we are approaching the point where multiprocessor technologies hit the “power wall”. The benefits of placing additional cores on a single chip will become outweighed by large energy consumption of these chips, and memory bottlenecks. Consequently, further improvements in chip multiprocessors will gradually become more and more difficult to sustain. This especially applies to large-scale systems such as are used in high performance computing. These systems consist of many multiprocessors, and so each processor within such system needs to be as energy efficient as possible. The importance of energy consumption in HPC systems can be seen in various places. For instance, almost a third of energy consumed by the University of Edinburgh is consumed by ARCHER, the UK’s national supercomputer maintained by the University. [8] Consequently, various clever and sustainable cooling techniques are used throughout HPC systems to ensure that the energy efficiency is high. The above examples show that the developments in the field of HPC depend largely on decreasing energy consumption of HPC systems. Large amount of energy that supercomputers use needs to be reduced in order for supercomputers to be sustainable in the future. Traditional multiprocessors are seen as no longer being able to keep up with both demands: for better performance, and for better energy efficiency. Co-processors, however, present good power efficiency characteristics and at the same time have high theoretical peak performance. As a result, these chips have low energy consumption per unit of computation, expressed as high FLOPS5 per Watt. What is more, co-processors’ performance has not been fully explored and optimised, and there is still plenty of room for further research. Co-processors can therefore be used to deliver the highly desired energy efficiency to HPC systems. However, co-processors are not ideal for all purposes. Applications, which are not highly parallelisable, and do not scale well will not benefit from accelerators, such as Intel’s Xeon Phi. Similarly, if a very good single thread performance is required, the coprocessor might not be able to satisfy this. Xeon Phi’s single thread performance is below the currently widely accepted standards. Furthermore, many accelerators do not have sufficient functionality to run all programs (for instance GPUs cannot run standard operating systems or business applications). Similarly, inherent in Xeon Phi requirement to execute instructions in-order is an obstacle to using the co-processor with some

5

FLOPS – floating operations per second

7

applications, which might generate many out-of-order instructions. Consequently, there exist areas of computing, which might not benefit from the use of accelerators.

2.2 Co-processors and the move to Exascale Energy consumption is one of the major issues in the move to Exascale. The move to Exascale is widely considered to be the next step, the future of high performance computing. As such, the future of HPC is highly dependent on the ability to decrease energy consumption of processors. This is required so that an exascale computer, once commissioned, will be economically sustainable to maintain. The need for improved energy efficiency characteristics requires the systems to use much less energy and at the same time provide room for performance enhancements. As mentioned in the previous section, these characteristics are similar to that of coprocessors. Co-processors have high energy efficiency per operation since their operating frequency is not as high as that of regular processors. At the same time, they have a good peak speed performance due to their high level of parallelisation. As a result, they have a very good FLOP/Watt performance. Consequently, the ability to utilise the energy efficiency potential of co-processors is conditioned upon developing algorithms and programming methodologies that will be able to take advantage of their performance. The research into efficient utilisation of co-processors is needed in order to find out methods, which enable the users of these systems to program them efficiently. This becomes especially relevant since in the future exascale systems will likely contain many co-processors. There are countless methods and libraries provided for programming coprocessors. These include among others OpenMP 4.0 pragmas, OpenACC, CUDA, Intel Cilk and Intel LEO6. However, there has not been established a single standard, which would be widely accepted and recognized as the method of choice for programming the accelerators. Therefore, it becomes valuable to explore the space of these various parallelisation methods in regards to various applications in order to understand what methods provide the best performance for these applications.

6

Intel LEO – Intel Language Extensions for Offload

8

Share of systems using accelerators

20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% Jun-11

Dec-11

Jun-12

Dec-12

Jun-13

Dec-13

Jun-14

Dec-14

Jun-15

Top500 list release

Figure 2-1: Increasing share of co-processors/accelerators in the systems from Top500 list over the past 4 years Four out of top ten supercomputers in the world according to Top500 list use coprocessors to achieve their high level of performance. On Figure 2-1, we can also see the growth in the share of co-processors in the Top500 list. Although systems utilising accelerators are still in minority, we clearly notice the trend of a substantial growth in their share among the top HPC systems. The share increased by a factor of 1.7, since June 2013. Therefore, it becomes necessary to understand well the programming models, which these co-processors use. Better understanding would allow us to make better use of the HPC systems, which are based increasingly more on co-processors. Top two systems on Top500 list both use co-processors to boost their computational power. [2] Taking the rate of growth of the number of accelerators on Top500 list, we can predict that it is likely that the first Exascale computer will use co-processors to break into Exascale performance level.

2.3 Intel Xeon Phi and other accelerators There are various different implementations of accelerators or co-processors. More and more HPC systems make use of co-processors to accelerate their performance. Vendors produce different architectures to take advantage of accelerating different applications. Historically, accelerators, or as some prefer to call them, co-processors, have been used to deal with a particular application for which they were specialised. The architecture of accelerators was developed in a way that enabled them to deal with one particular task, so that the main processor did not have to execute this task. Consequently, the main CPU saves time it would otherwise spend executing code for which the processor was not optimised. Accelerators were primarily developed for multimedia applications such as GPUs for graphics processing, or audio cards for audio signals processing. The potential of having additional computational power in the system, however, has since been discovered by the HPC community. Especially, when the rise of parallelisation 9

throughout the systems is considered. Since many of the scientific applications exhibit parallelism, having a highly parallel SIMD7 co-processors available allowed the scientists to aid them with computation of these parts of the code. The GPUs often consist of huge arrays of simple processing units, which have capability of performing simple instructions. The code is then propagated throughout these units (streamed) to arrive, at the end, with the final result. The code executed by each unit is called in CUDA terminology “kernel”. GPUs are often optimised for floating point instructions. Due to their architecture that focuses on maximising the area devoted to computation on a chip, GPUs are able to perform many hundreds of floating point instructions in a lockstep in a stream. This allows GPUs to perform many floating point operations in parallel, while sacrificing branch and control instructions. As a result GPUs can achieve a very high level of floating point performance, which is often desired not only in graphics processing, but also in many scientific applications. Currently, the offloading to GPU or a dedicated accelerator becomes an increasingly widespread procedure across various applications of HPC. The most popular accelerators used in modern HPC systems from the Top500 list are shown in Table 2-1, with their respective vendors. Table 2-1: Overview of available co-processors and accelerators by vendor Vendor AMD Intel NVIDIA

Co-processor or accelerator FirePro S9150, S9050 Xeon Phi (Knights Corner) Tesla K20, K40, K80

We can see in Table 2-1, the main vendors supplying accelerators. These are Intel, NVIDIA, and AMD. They provide their own systems and methods to program these accelerators such as Intel Cilk and LEO from Intel, or NVIDIA’s CUDA. However, there is also a wide range of open environments available with OpenMP, OpenACC, and OpenCL being the three main examples. These accelerators are constantly being developed further into newer, higher performing generations. In this project, we focused on Intel Xeon Phi co-processor, and we present the architecture of the system in Chapter 3, where we also show future enhancements to be included in the new version of the accelerator to be released in the fall of 2015.

2.4 Related work There is a significant ongoing activity in researching performance of co-processors and accelerators. The exploration of performance of accelerators is considered to influence the design of future HPC systems. These systems tend to make increasing uses of accelerators. Consequently, there is more and more need for the exploration of their

7

SIMD – Single Instruction Multiple Data

10

performance. The research activity in the field of accelerators focuses both on establishing the most beneficial architectures for use in accelerators, as well as researching the most efficient methods of implementing the solutions on accelerators. Furthermore, there is an ongoing research into the applications that could benefit from being offloaded, or into localising parts of applications that are suitable candidates for offloading. Research into different aspects of accelerators has been also present in many dissertation in EPCC throughout the past years. [9] [10] In this project, we attempt to port an LU factorization application to Xeon Phi and subsequently explore its performance using various parallelisation, vectorisation, and offloading methods. Similar research has been conducted in porting climate dynamics [11], astrophysics [12], or indeed in LU factorization algorithms [13] [14] [15]. These works aim to primarily explore the performance of Xeon Phi by optimising the code for Xeon Phi. They compare the performance on Xeon Phi to other accelerators in order to understand which accelerators offer the best performance for the respective applications. There is plenty of research carried into exploring kernels that could be offloaded to accelerators and GPUs. [16] [17] [18] These focus on kernels that could be offloaded and ported to co-processors, and on exploring their performance. There exists a set of problems, which are considered to be highly parallelisable and of significant importance to exploring performance of co-processors. These problems, called a set of Berkley dwarfs, are often used in benchmarking co-processors. [17] Berkley dwarfs consist of 13 problem categories considered to be of significant importance to high performance computing community. These problems are highly scalable, can be executed with the use of computational kernels. Consequently, they are often used to benchmark HPC solutions. LU factorization is one of the Berkley dwarfs. [17] GPUs are often used in implementing highly parallel problems. Their ability to exploit large amount of data level parallelism is used in various applications that exhibit this parallelism. In order to utilise the parallelism the process of mapping applications to GPU hardware is extremely important. One example of this is the work on targetDP [17]. This highlights the benefits of abstracting hardware issues away from the programmer so that programmers do not spend excessive time on developing porting patterns instead of solving the actual problem at hand. The emergence of various different GPUs and accelerators has resulted in this problem becoming even more prevalent, when different systems require different methods of porting. Through porting a complex fluid lattice Boltzman application – Ludwig – to GPUs it is possible to demonstrate performance improvements due to using GPUs in the computation process. Applying the targetDP abstraction layer, targeting data parallel hardware of different platforms in order to enable applications to take better advantage of accelerators through abstracting memory, task, thread, and instruction level parallelism. The abstraction simplifies the process of porting the code and provides good performance for the application. [19] These methods of abstracting parallelism emerge as a response to more complex underlying GPU hardware. Similar issues exist with porting the code to Xeon Phi. The placement of threads on the cores of Xeon Phi to exploit performance becomes crucial. A case study of porting CP2K code to Xeon Phi by Iain Bethune and Fiona Reid explored some of these issues emerging 11

in the code ported to Xeon Phi. [20] [21] Porting the code to Xeon Phi proved to be relatively easy if the source is parallelised. However, the work showed that efficient placement of threads on cores is important to ensure good performance results. Finding enough parallelism to fill the threads sufficiently also showed to be a significant issue. Similarly, it confirmed that complex functions and calculations will perform worse on Xeon Phi than on modern host CPU nodes. Overall, the ported CP2K without additional optimisations showed to perform 4 times slower on Xeon Phi in comparison to 16 cores of Intel Xeon E5-2670 Sandy Bridge host node. After further optimisations the CP2K achieved similar level of performance to the host node. Finally, there is ongoing work into comparison of various methods available for optimising and parallelising the code on accelerators and GPUs. A comparison of various methods for porting code to GPUs on Cray XK7 was performed by Berry, Schuchart and Henschel of Indiana University and Oak Ridge National Laboratory. [22] They focus on analysing benefits and issues of different methods, when porting a molecular dynamics library to NVIDIA Kepler K20 GPU. They compare CUDA, OpenACC, and OpenMP+MPI implementations. In the molecular dynamics application, the use of OpenACC on the GPU resulted in a speed-up factor of 13 times over 32 threads of OpenMP run on AMD Interlagos 6276 CPU. The above examples of work undertaken in the field of accelerators and GPUs show that there is a significant potential in utilising these devices. We can see performance improvements due to introducing these devices. HPC systems can clearly benefit from their introduction. Furthermore, we notice that accelerators, including Xeon Phi, are not yet fully optimised and their optimal programming models are far from defined. Therefore, an ongoing research in this area is important to better understand the nature of programming such devices, and what can be achieved with them. Furthermore, we notice that not all applications benefit equally from the use of accelerators, and research is carried into determining a scope of potential beneficiaries of the co-processor hardware.

12

Chapter 3 Intel MIC architecture The architecture of Xeon Phi is described in this chapter. We outline the co-processor’s programming models. Furthermore, different versions of Xeon Phi are discussed, and the future releases of the co-processor are mentioned. The developments of Xeon Phi are presented in order to relate current results to future versions of Xeon Phi. Moreover, we summarise some of the tools used with Xeon Phi to aid with porting the solution to the hardware.

3.1 Architecture of Intel MICs In Chapter 2, we discussed the general architectures of co-processors and accelerators. The ideas leading and guiding their development were explained. In this section, we will focus in particular on Intel MIC architecture. Intel MIC stands for Intel Many Integrated Core architecture. It is a new generation of many-core systems introduced by Intel to compete with accelerators and GPUs in the field of high performance computing. These systems introduce a new idea of combining many older, however fully functional, cores in a ring network to accelerate computations. Intel Xeon Phi, which is an example of Intel MIC architecture, consists of up to 61 cores. These cores are each supplemented with a wide vector processing unit. The vector processing unit serves the main purpose of enabling vectorisation on a large scale. It carries the idea from GPUs, where significant parallelism is introduced due to SIMD instructions. Consequently, the vector processing units (VPUs) in Intel MICs are designed to handle the data-level parallelism achieved through vectorisation. The vector processing units on Intel MIC are 512-bit wide. This corresponds to performing up to 16 single or 8 double precision operations on a VPU simultaneously. Each of the VPUs is supplemented with 32 512-bit registers to service the VPUs. The large amount of available VPUs and vector registers need to be utilized in order to achieve efficiency on Intel Xeon Phi. Complete vectorisation is critical to achieving high level of performance on the co-processor. Similarly, higher-level parallelisation is desired too, since single scalar cores of Xeon Phi are not implemented with the state of the art architecture. In fact, the scalar unit core design in Xeon Phi is taken from Pentium P54C architectural design. The Pentium architecture does not use an up-to-date frequency of execution. As a result, it cannot achieve the speed of modern processor cores in terms of single-threaded computation. It becomes necessary to utilise parallelism available between the cores to achieve performance on Xeon Phi. Single core, single-threaded execution on Xeon Phi would not be competitive in terms of speed with modern CPUs. 13

Each core of Intel MIC supports multithreading. Up to four threads can run per core at any time. In the architecture of Intel MIC it is impossible to issue instructions from the same thread back-to-back in the same functional unit. Therefore, to fully utilise scalar units of Intel MIC at least two threads are required to run per core. To increase chances of functional units being fully optimised, we should aim to perform at four threads per core. Each core is supplemented with two levels of cache. The caches are fully coherent and level 2 cache is connected in a ring interconnect with L2 caches of the other cores of the Xeon Phi. Level 1 cache is divided into two 32 kilobyte-size parts of respectively instruction and data cache, with 3 cycle access. While level 2 cache is unified and can store 512 kilobytes of data per core. Level 2 caches are joined with a bi-directional ring interconnect to form a large common level 2 cache accessible by all the cores with latencies varying from 11 cycles upwards. Intel MIC communicates with the host node using PCI Express bus (PCIe), and the most common programming model involves instructions being offloaded to Xeon Phi from the host node. The host communicates with Xeon Phi through a PCIe Bus and data is offloaded from the host node to Xeon Phi and back. The memory available to Xeon Phi is 16GB GDDR5 memory. This is distributed over 16 memory channels. In Figure 3-1 [23], we can see an outline of a single core of Intel MIC. We can see the

Figure 3-1: Simple outline of a single Intel MIC core scalar unit and the vector unit present in each core. On the figure, we can notice how caches are structured and connected to the ring interconnect, which runs across all the 14

cores. On Figure 3-2 [24], we can see a more detailed outline of a single Intel MIC core. We see that there are 4 threads that can issue instructions. We see the 512-bit VPU, instruction and data caches, as well as the dual issue of certain scalar instructions.

Figure 3-2: More detailed outline of Intel MIC core The architecture of Intel MIC is innovative and unique in the way that it aims to reduce the size of the computational segment of the die. Thus, making space for memory, and so bringing memory closer to the computation. We can notice on Figure 3-2 that the computational logic specific to X86 architecture takes up less than 2% of the total area of the die. This is an upcoming trend in the field of high performance computing, and computing in general. As the advancements in computer architecture have progressed this far, the ALUs8 are contained often in less than 15% of the chip. [25] Most of the area is devoted to addressing efficient memory access issues. By reducing the significance of each chip to the overall system, and distributing them across a larger area, we gain the possibility to introduce more of the very fast access memories such as L1 and L2 caches. These, in turn, help to speed up the highly parallelised applications. This process follows, and extends to the field of computer architecture, the well-established principle of distributed computing, responsible among others for the Map Reduce framework. The principle, which states that it is cheaper to bring computation to data than data to computation.

8

ALU-arithmetic and logic unit

15

3.2 Xeon Phi in EPCC and Hartree Throughout the project we used Intel Xeon Phi nodes at EPCC, called hydra, as well as in Hartree. Hydra has two Intel MIC systems available and both are connected to one host node of 2 sockets of 8-core Intel Xeon processors. On the other hand, Hartree has 42 Intel Xeon Phis available. These are connected to one host each. Hartree host nodes consist of 2 sockets of 12-core Intel Xeon processors each. Both Hartree and EPCC systems use the same version of Intel MIC architecture. Both employ the 60-core versions of Intel Xeon Phi co-processor. These are Intel Xeon Phi 5110P belonging to 5100 series. This version of Xeon Phi has a base operating frequency of 1.053 GHz, and can service up to 8GB of memory over 16 channels with the bandwidth of 320 GB/s. Intel Xeon Phi 5110P is manufactured in 22nm technology. The total amount of L2 cache available to the co-processor is 30MB. In Table 3-1, we can see details of different series of Intel Xeon Phi currently available in the market. We can also see the 5100 series, which is available at EPCC and Hartree, and was used in this project. Although the versions of Intel Xeon Phi at EPCC and Hartree are the same, their toolchains varied slightly, with different versions of compilers installed on the two versions. EPCC has the Intel C compiler version 15.0.2, and this compiler was used to compile the code for Xeon Phi on hydra. Hartree’s version of the compiler is 15.0.1, and it was used to compile the programs on Hartree. Table 3-1: Overview of versions of Intel Xeon Phi available Xeon Phi No. of Base versions cores frequency 3100 series 5100 series 7100 series

57 60 61

1.1 GHz 1.053 GHz 1.238 GHz

Maximum memory size 6 GB 8 GB 16 GB

Maximum Memory bandwidth 240 GB/s 320-352 GB/s 352 GB/s

TDP9

300 W 225-245 W 270-300 W

On Figure 3-3, we can see Phase-2 Wonder iDataPlex to which Xeon Phi nodes are connected at Hartree Centre in Daresbury Laboratory.

TDP – thermal design power is the average power, in watts, the processor dissipates when operating at a base frequency with all cores active under an Intel-defined, high-complexity workload. 9

16

Figure 3-3: Hartree’s Intel Xeon Phi racks

3.3 Xeon Phi programming tools There exists a large array of tools aiming to help with programming specifically for many-core systems such as Intel MIC. These tools help to understand how many cores are utilised, and which parts of the program can undergo further optimisations in order to maximise vectorisation, or parallelisation. Since efficient parallelisation and vectorisation are key requirements of efficient performance on Intel Xeon Phis, these tools are especially beneficial. Tools aimed at Intel MIC architecture used throughout this project include:   

Intel VTune Amplifier – help with parallelisation and utilisation of cores Intel Vectorization Adviser – help with vectorisation and improving efficiency of vectorisation Allinea DDT – debugger for multi-core and multi-threaded systems

17

3.4 Intel Xeon – host node Since we compare the results to the performance of Intel Xeon processors, in this section we present the details of Intel Xeon processor used in this project as the host node for Xeon Phi. The host node we used for benchmarking and comparison to Intel Xeon Phi is the host node of hydra. This is Intel Xeon processor with 2 sockets, each socket consists of 8 cores, and there is one thread per core. This is a 16-threads system. Therefore, the maximum number of threads we used on the host node was 16. Intel Xeon on hydra has 32K of data and instruction level 1 cache each, 256K level 2 cache, and 20480K level 3 cache. It is the Intel Xeon E5-2650 version of Intel Xeon processor. The cores run with frequency of 2GHz, in comparison to Intel Xeon Phi’s frequency of 1.053GHz. The above show different characteristics of Intel Xeon Phi and Intel Xeon processors. We note the large difference in frequency of operation and number of threads per core available on each architecture, as well as the total number of cores. These specifications, furthermore, outline the basic differences between the multi-core processors and manycore processors. The former being represented by Intel Xeon on hydra host node, while the latter is represented by Intel Xeon Phi.

3.5 Knights Landing – the future of Xeon Phi The version of Intel Xeon Phi, on which this project has been completed, is codenamed Knights Corner (KNC). Towards the end of 2015, a new version of Intel MIC architecture, and consequently a new version of Intel Xeon Phi, is going to be released, codenamed Knights Landing (KNL). [26] The new version of Xeon Phi will introduce significant developments, which will enable to extend the concept of many-core architecture to a larger set of problems. This shows that the MIC architecture is being adopted as the correct way to proceed with increasing the performance of processors and computing systems in general. KNL, in contrast to KNC, will be self-bootable and compatible with Intel x86 executables. This means that the new version of Xeon Phi will be able to execute the same machine code as Intel Xeon processors, and potentially both can use the same compilers. This development comes, among others, from the fact that improving performance when optimising for Xeon Phi has shown to improve performance on the host node as well. This is something that we have experienced throughout the course of this project too. Optimising code for execution on Xeon Phi often resulted in faster execution on Intel Xeon as well. Furthermore, the ability of KNL to self-boot means that it will be able to run without connecting to the host. It will be self-standing and will not require the host to run, so the costs associated with offloading could be potentially diminished. Also, systems including only Xeon Phi processors will become possible, when the need for a host disappears.

18

Moreover, KNL will have up to 72 cores, up from 61, which was the maximum number for KNC. It will have introduced ability to schedule instructions from the same thread back-to-back, so that utilisation of multithreading would improve. Each core will have two instead of one vector processing units associated with the core. Furthermore, there is a change planned in interconnect from the current ring interconnect to a 2D-mesh interconnect. Also, KNL will contain a high performance, high-speed memory on chip, which furthers the principle of moving data to computation, rather than computation to data. These are the most significant changes implemented in the new version of the Intel MIC architecture. In principle, the above changes mean that KNL will not be a co-processor, but a standalone many-core processor that can efficiently run parallelised code on its own. It will still have much slower single-threaded performance than regular CPUs, but there will be a possibility of Xeon Phi running as the main processor. This signifies a large step in the development of many-core systems, and will have a noteworthy impact on the HPC community. The design of systems and guiding principles for programming such systems will need to shift in order to utilise many-core systems efficiently. The need for parallelism would become more reemphasized. Similarly, we might begin to see the shift in commercial and home user processors to the architectures more resembling many-core systems.

19

Chapter 4 LU factorization – current implementation We explain what lower upper (LU) factorization is, what are its main applications and its significance. Furthermore, in this chapter, we describe the Gaussian elimination method, as the algorithm used throughout this project to perform LU factorization. Finally, we explain the data structure used to represent the matrices in memory throughout the computation. LU factorization is interchangeably called LU decomposition in literature and throughout this report.

4.1 What is LU factorization? LU factorization is a method of decomposing a single matrix into two matrices. In doing so, we express the original matrix as a product of two triangular matrices. One of these matrices is an upper triangular matrix, while the other is a lower triangular matrix. This means that one of the matrices has only “zero” entries below the main diagonal, while the other has such entries only above the main diagonal. Below is the mathematical representation and an example of LU factorization. 𝐴 = 𝐿𝑈 𝑎11 [𝑎21 𝑎31

𝑎12 𝑎22 𝑎32

𝑎13 𝑙11 𝑎23 ] = [𝑙21 𝑎33 𝑙31

0 𝑙22 𝑙32

0 𝑢11 0 ][ 0 𝑙33 0

𝑢12 𝑢22 0

𝑢13 𝑢23 ] 𝑢33

In the above equations, we can also see examples of the lower triangular and upper triangular matrices. In order to ensure that the matrix can be factorized, it might be necessary to reorder the rows of the matrix. In such case LU decomposition is called LU factorization with partial pivoting. We arrange the matrix in such a way that its elements will be able to factorize in the LU factorization process. This means that in fact the LU factorization will have a following mathematical representation: 𝑃𝐴 = 𝐿𝑈 Where P is the pivoting matrix, which ensures that the ordering of rows is such that the LU factorization will materialize. In such form, an LU decomposition for any square matrix exists.

20

4.2 Applications of LU factorization Lower upper decomposition is widely used in algorithms and mathematics. It is used in various optimisation problems, because it is an efficient method of solving linear equations. Having derived LU factorization, it is easy to subsequently solve a linear equations system based on that decomposition. Similarly, LU factorization is often used in deriving an inverse of a matrix in a quick manner, especially for larger matrices. The computation of a matrix determinant can be sped up with the use of LU factorization. These matrix operations gain significant performance improvements, in the case of larger matrices, due to the use of LU factorization products in the process of performing these operations. Subsequently, there exist many algorithms, which specialise in performing LU decomposition on particular kinds of matrices, or on matrices derived from particular problems. There are plenty of industrial applications in which computation of LU factorization is used. A small subset of these is presented in this section. LU decomposition is used in EDA10 to aid with placement and routing of components on circuit boards. Since the components and distances between them are often modelled with the use of matrices and linear algebra, using LU factorization is often beneficial. Moreover, LU decomposition is used to simulate such circuits in order to predict their behaviour after producing the chips. The manufacturing process of chips is very costly, and computer simulations help to minimise these costs through optimisation of testing and design stage. LU factorization forms a great part of this process. Furthermore, it is used in climate simulations, machine learning, and virtually any application that involves linear algebra for its computation. For example, LU factorization is used in singular value decomposition calculation, which in turn is widely used in signal and data processing. As a result, the computation of LU decomposition is widely used in many fields and applicable to many real world problems.

4.3 Initial algorithm In this section, we describe the initial structure and outline of the algorithm. In the next section, we mention the data structures used in the algorithm to represent matrices. Initially, the implemented algorithm was a simple version of LU factorization algorithm using Gaussian elimination method to factorize matrices. Gaussian elimination uses basic matrix transformations in order to remove non-zero entries from the relevant part of the matrix. Elementary row transformations are performed until the matrix is transformed into an upper triangular matrix. At the same time, the algorithm creates a unit lower

10

EDA – electronic design automation

21

triangular matrix11. This method of LU factorization is called the Doolittle method. [27] The obtained triangular matrices are the results of LU factorization. [28] The operations required to perform Gaussian elimination consist of multiple nested loops, and use an additional function for matrix multiplication in order to pivot the matrix. There is an “if statement”, which ensures numerical stability by stopping the algorithm from further calculations on critically small numbers. The general core idea behind the algorithm has not changed throughout the implementation. We mainly introduced optimisations, refactoring, and pragmas in order to observe how the performance changes as a result of these modifications combined with offloading to Xeon Phi. These changes constitute the exploration space of Xeon Phi performance in the context of this project.

4.4 Matrix data structure This section explains the data structure used to store matrices in the application implementing LU factorization. Matrices were implemented as two dimensional arrays. In order to ensure spatial locality for efficient cache accesses, and to improve ability to vectorise, the arrays were constructed in two layers. Firstly, a big one-dimensional array was created, which contained the whole matrix. Subsequently, an array of pointers to the first element of each row within the big onedimensional array was created. This is shown on Figure 4-1. We can see on the figure, that the whole two dimensional array is assigned, with one malloc call, to a[0]. Subsequently, a one dimensional array of size n is assigned relevant pointers to a[0] at intervals corresponding to the number of columns. Therefore, we end up with an array of pointers to relevant parts of the single large two dimensional array that contains the actual matrix. Orange arrows on Figure 4-1 represent the malloc calls for particular variable, while blue arrows show where each particular pointer points.

11

Unit lower triangular matrix – a lower triangular matrix with “1”s on the diagonal

22

a

a[0]

a[0]+n

a[0]+2n

a[0]+3n

a[0]+4n

…

a[0]

a[0][0]

a[0][1]

a[0][2]

a[0][3]

a[0][4]

…

a[0]+n

a[1][0]

a[1][1]

a[1] [2]

a[1][3]

a[1][4]

…

a[0]+2n

a[2][0]

a[2][1]

a[2][2]

a[2][3]

a[2][4]

…

Figure 4-1: Matrix data structure implementation The data structure presented in Figure 4-1 results in matrix being stored in one long row in memory. Therefore, the spatial locality is emphasised, and preserved. Consequently, vectorisation is easy to detect, and implement.

23

Chapter 5 Optimisation and parallelisation methods In this chapter we describe techniques attempted to optimise the solution for Intel Xeon Phi processor. These include techniques used to improve the speed of execution of the program in general, as well as specific features of the programming languages that improved the performance. Additionally, several methods of parallelising the code were attempted including pragmas or auto-vectorisation performed by the compiler. We discuss in this chapter the methods we attempted and analysed throughout the project in order to improve the execution of the program. The variations between different methods enable us to explore the performance of Intel Xeon Phi. The methods attempted throughout this project include OpenMP4.0 pragmas, Intel Cilk array notation and ivdep vectorisation pragmas. Furthermore, we discuss the compiler options and associated optimisations which were attempted in the process of improving the performance of executing the code on Xeon Phi. In the next chapter, we show how techniques described in this chapter are applied to LU factorization code in order to explore the performance of Xeon Phi.

5.1 Intel “ivdep” pragma and compiler auto-vectorisation Intel ivdep pragma is a non-binding pragma, which is used in Intel compiler to aid the process of auto-vectorisation. It prevents the compiler from treating assumed dependencies as proven dependencies. [29] Thus, it helps the compiler to vectorise the code. Assumed dependencies are based on variables that are independent of a loop index. Due to the independency from the loop index, there could be dependencies on those variables between loop iterations, and the compiler conservatively assumes that they are dependencies. However, the ivdep pragma could be used to specify that such “assumed” dependencies are not in fact dependencies at all, and can be safely vectorised by the compiler. This aids the process of auto-vectorisation executed by the compiler. The compiler is more inclined to vectorise loops which include ivdep pragma, however, there are still other kinds of dependencies and obstacles to vectorising the code, which ivdep pragma cannot overcome. The ivdep pragma is available in Intel compilers, however, there is also a gcc compiler version of the pragma that implements the ivdep pragma for that compiler. 24

5.2 OpenMP 4.0 OpenMP [30] is a standard used across the industry and academia to aid parallelising code with the use of pragmas. It is an API for shared memory programming using C/C++ or FORTRAN programming languages. It provides a set of pragmas, directives and library routines, which are used to specify parallelism in applications. The standard OpenMP pragmas include parallelisation directives, which distribute tasks among various threads in multithreaded systems. These could contribute to distributing various tasks to different processors, but equally these allow single for-loops to be distributed among various threads. This is done with the use of parallel for pragma presented below: #pragma omp parallel for

Additionally, OpenMP provides a set of schedulers for the distribution of loops between threads. The schedules allow to distribute the workload adequately depending on the application and workload of each particular loop. As a result the scheduling options help to balance the workload between various threads of the system. This is an important feature of OpenMP, which we exploit to optimise the performance of code on Xeon Phi. The list of different scheduling options available in OpenMP and their descriptions is presented in Table 5-1. Table 5-1: Outline of scheduling options available in OpenMP 4.0 [31] Scheduling Description option clause Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the static number of threads multiplied by the chunk size. By default, chunk size is loop_count/number_of_threads. Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the dynamic next block of loop iterations from the top of the work queue. By default, the chunk size is 1. Similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle load imbalance between iterations. The guided optional chunk parameter specifies them minimum size chunk to use. By default the chunk size is approximately loop_count/number_of_threads. The decision regarding scheduling is delegated to the compiler. The auto programmer gives the compiler the freedom to choose any possible mapping of iterations to threads in the team. However, in the context of many-core systems, such as Xeon Phi, it is important to utilise the vector processing units available on the chip. Therefore, in OpenMP 4.0, there is a simd pragma which aids the compiler with vectorising the code marked with the pragma.

25

Simd pragma informs the compiler that the code enclosed by the pragma can be safely vectorised, and does not contain any dependencies. OpenMP is an efficient method of parallelising the code in a shared memory context and across many threads. It also includes mechanisms for aiding with vectorisation. The usage of OpenMP is relatively simple and it is available across multiple platforms. Moreover, it is not bound to a specific implementation or compiler. Most compilers include their own implementation of OpenMP. Furthermore, OpenMP usage on Intel MIC architecture allows to populate 4 threads per core, which helps Xeon Phi to run at its peak efficiency.

5.3 Intel Cilk array notation Intel Cilk array notation [32] is Intel’s proprietary extension to C language, which allows for better expression of SIMD parallelism. It involves extensions to array notation, which allow a programmer to explicitly state vectorization while writing the code. On Table 5-2, we present how “for-loop” could be replaced with Intel Cilk array notation. Table 5-2: Intel Cilk array notation example for loop

Intel Cilk array notation equivalent

for(i=0; i