Experiences with Intel Xeon Phi in the Max-Planck Society

Experiences with Intel Xeon Phi in the Max-Planck Society Markus Rampp (RZG) & Hendryk Bockelmann (DKRZ) ENES Workshop on “Exascale Technologies & I...

Author: Dwight Jones

0 downloads 0 Views 2MB Size

Report

Download PDF

Recommend Documents

Intel Xeon Phi Programming Environment. Intel Xeon Phi Execution Models

Intel Xeon Phi Coprocessor

Intel Xeon Phi Avril Alain Dominguez Intel

Exploiting Parallelism for Intel Xeon Processors & Intel Xeon Phi Coprocessors

Benchmarking the Intel Xeon Phi Coprocessor

Overview of the Intel Xeon and Xeon Phi tecnologies

Intel Xeon Phi 3120AIB Workstation Compute Processor. Models Intel Xeon Phi 3120AIB Compute Processor

Intel Xeon Phi MIC Offload Programming Models

Intel Xeon Phi Core Micro-architecture

Intel Xeon Phi MIC Offload Programming Models

Streaming Store Instructions in the Intel Xeon Phi coprocessor

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors

Using Intel Math Kernel Library on Intel Xeon Phi Coprocessors

Offload Code to the Intel Xeon Phi Coprocessor

Concurrent Task Execution on the Intel Xeon Phi

Compiler Directives for the Intel Xeon Phi Coprocessor

SIMD Enabled Functions on Intel Xeon CPU and Intel Xeon Phi Coprocessor

Performance Evaluation of Breadth-First Search on Intel Xeon Phi

Intel Xeon Phi Coprocessor Intel Manycore Platform Software Stack (Intel MPSS)

Xeon Phi TM Coprocessor

Intel Xeon Phi Coprocessor Intel Manycore Platform Software Stack (Intel MPSS) User's Guide (Windows*)

Compiler Prefetching for the Intel Xeon Phi coprocessor. Rakesh Krishnaiyer Intel Compiler Lab

A Unified Interface for Benchmark Tools on the Intel Xeon Phi Processor X200 Product Family and The Intel Xeon Phi Coprocessor X200 Product Family

Exploring SIMD for Molecular Dynamics, Using Intel R Xeon R Processors and Intel R Xeon Phi TM Coprocessors

Experiences with Intel Xeon Phi in the Max-Planck Society

Markus Rampp (RZG) & Hendryk Bockelmann (DKRZ)

ENES Workshop on “Exascale Technologies & Innovation in HPC for Climate Models”

Outline ●

coprocessors/accelerators in HPC: technology basics, motivation and context

●

some practical experiences with Xeon Phi (MIC) and remarks on GPU

●

conclusions

Acknowledgements: A. Duran, M. Klemm, G. Zitzlsberger (Intel), A. Koehler, P. Messmer (NVidia), F. Merz, F. Thomas (IBM), T. Dannert, A. Marek, K. Reuter, S. Heinzel (RZG), T. Feher, M. Haefele, R. Hatzky (HLST/IPP)

Introduction High-performance computing facilities in the Max-Planck Society (MPG) ●

DKRZ (Hamburg): earth sciences

●

RZG (Garching near Munich): materials and bio sciences, astrophysics, plasma physics, ...

→ operation of HPC system(s) + HPC application groups + data services + ...

Main characteristics of applications MPG develops and operates highly scalable codes (10k ...100k cores) ●

GENE (IPP), FHI-aims (FHI), VERTEX (MPA), GADGET (MPA, HITS), …

●

ICON, MPIOM, ECHAM (MPI-M), FESOM, COSMO, EMAC, METRAS (AWI, HZG, Uni-HH)

●

Fortran, C, C++; MPI, hybrid MPI/OpenMP

●

all major HPC platforms (x86, IBM Power, BlueGene, CRAY, NEC SX, ...)

●

significant external resources (e.g. EU Tier-0, US, …)

●

decade(s) of research and development, typically: O(10k … 100k) lines of code

RZG, DKRZ actively co-develop and optimize applications together with MPG scientists → we have to support these “legacy” applications and prepare them for upcoming HPC architectures towards exascale: GPU, manycore, … ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Technology basics Basic principle (today's GPUs, MICs): ●

accelerator cards for standard cluster nodes: PCIe (or on-board HyperTransport/QPI)

●

many (~50...500) “lightweight” cores (~ 1 GHz)

●

high thread concurrency, fast (local) memories

System architecture: ●

currently: x86 “Linux-clusters” with nodes comprising ●

2 x CPUs (2x6-12 cores) + 2 x MIC/GPU

=> speedup := T(2n CPU) / T(2n CPU+2n GPU) ●

1 TFlop/s 8 GB RAM 150 GB/s

future: smaller CPU component (extreme: “host-less”, true many-core chips: Knights Landing), OpenPower architecture, …

Programming paradigms: A) offload (execute parts of code on coprocessor)

~6 GB/s ~30 GB/s

0.25TFlop/s 32 GB RAM 40 GB/s

~6 GB/s

B) native MIC (execute code only on coprocessor) C) symmetric cluster (distribute code)

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Technology basics Overview: Xeon Phi & GPU vs. CPU Intel MIC (Knights Corner) NVidia GPU (Kepler) → Nov 2012 → 2012/2013

Intel Xeon (Ivy Bridge) → Q3/2013

model

Xeon Phi 5110p

K20x

Xeon E5-2680v2

processors

60 Pentium x86_64 cores (1 GHz)

14 GK110 streaming multiprocessors (0.7 GHz)

10 Xeon x86_64 cores (2.8 GHz)

per-processor concurrency

4 hyperthreads x 8-wide (512 192 CUDA cores (SIMT) bit) SIMD units

2x (add,mult) 4-wide (256 bit) SIMD units [x2 hyperthreads]

total nominal concurrency

1920 = 60x4x8

2688 = 14x192

80 = 10x4[x2]

performance (DP) ~ 1 TFlops

~ 1 TFlops

~ 0.2 TFlops

memory

6 GB

64 GB

data transfer with PCIe Gen2 (8 GB/s) host CPU

PCIe Gen2 (8 GB/s)

---

programming model/ software stack

CUDA, (OpenCL), OpenACC OpenMP + SIMD NVidia libraries, tools vectorization,OpenCL, ...

8 GB

OpenMP + SIMD vectorization, OpenCL, Intel compilers, libraries, tools + proprietary offload directives

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Programming Paradigms A) Offload (like GPU): ●

use CPU for program control, communication and maximum single-thread performance

●

“offload” data-parallel parts to accelerator for maximum throughput performance

●

programming approach:

●

●

MPI across nodes (CPU processes), OpenMP & SIMD on MIC

●

explicit (LEO) or automatic (MKL DGEMM) offload directives

load-balancing, asynchronous computations, hiding/overlapping data transfer (PCIe bandwidth) most prospective for tapping all node resources (RAM, compute, communication) becoming obsolete in the future? → Knights Landing

●

batch-system friendly

ENES Workshop, HH, Mar 17-19, 2014

!DIR$ !DIR$ OFFLOAD OFFLOAD target(mic:MIC_DEVICE) target(mic:MIC_DEVICE) in(etanp:alloc_if(.true.),free_if(.false.)) in(etanp:alloc_if(.true.),free_if(.false.)) & & !DIR$ in(omcck:alloc_if(.true.),free_if(.false.)) & !DIR$ in(omcck:alloc_if(.true.),free_if(.false.)) & !DIR$ in(qcck:alloc_if(.true.),free_if(.false.)) & !DIR$ in(qcck:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqq0:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqq0:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqq1:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqq1:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqqc:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqqc:alloc_if(.true.),free_if(.false.)) & !DIR$ in(tabe:alloc_if(.true.),free_if(.false.)) & !DIR$ in(tabe:alloc_if(.true.),free_if(.false.)) & !DIR$ in(kijminc:alloc_if(.true.),free_if(.false.)) & !DIR$ in(kijminc:alloc_if(.true.),free_if(.false.)) & !DIR$ in(nnmax, & !DIR$ in(nnmax, kmaxc, kmaxc, kRange, kRange, ijmaxc, ijmaxc, nenukjt) nenukjt) & !DIR$ nocopy(pi0im_s1:alloc_if(.true.),free_if(.false.)) !DIR$ nocopy(pi0im_s1:alloc_if(.true.),free_if(.false.)) & & !DIR$ nocopy(pi0re_s1:alloc_if(.true.),free_if(.false.)) !DIR$ nocopy(pi0re_s1:alloc_if(.true.),free_if(.false.)) call call charged(1, charged(1, 1, 1, 2, 2, etanp, etanp, omcck, omcck, qcck, qcck, cqq0, cqq0, cqq1, cqq1, cqqc, cqqc, & & tabe, tabe, pi0im_s1, pi0im_s1, pi0re_s1, pi0re_s1, nnmax, nnmax, kmaxc, kmaxc, kRange, kRange, & & kijminc, kijminc, ijmaxc, ijmaxc, nenukjt) nenukjt) M. Rampp & H. Bockelmann

Programming Paradigms B) Native MIC: ●

use MIC as a standalone processor for the complete program

●

memory limits (~ 8GB)

●

programming approach: MPI, OpenMP, (Cilk, TBB, OpenCL, ...)

●

very useful for prototyping, algorithmic evaluation, ... on a single coprocessor

●

preview to KNL / future exascale system (reminiscent of BlueGene)

●

effective at scale today ? → CPUs essentially act as a communication bottleneck

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Programming Paradigms C) Symmetric, heterogeneous cluster (CPU + MIC): ●

use CPUs and MICs as a heterogeneous cluster (distribute MPI ranks across both components)

●

memory limits, load balancing, PCIe (and QPI) bandwidth & latency

●

programming approach: MPI, OpenMP, …

●

hybrid MPI/OpenMP appears mandatory for load-balancing MPI tasks on CPU and MIC

●

useful for coarse-grained (“workflow”-like) parallelism ?

●

effective at scale ?

●

QPI bottleneck (remedy: 2 InfiniBand HCAs per node)

V. Karpusenko, A. Vladimirov: Configuration and Benchmarks of Peer-to-Peer Communication over Gigabit Ethernet and InfiniBand in a Cluster with Intel Xeon Phi Coprocessors (2014)

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Hardware performance characteristics Compute and memory performance “Knights Corner” vs “IvyBridge” (8 byte/DP) ●

Intel Xeon E5-2680v2 (10 core / 2.8 GHz) 2.8 x 4 lanes x 2 ops = 22.4 GFlops/s/core (1 thread per core) 40 GB/s (STREAM)

●

Intel Xeon Phi 5110P (60 core / 1.053GHz , ECC on) 1.053 x 8 lanes x 2 ops = 16.8 GFlop/s/core (4 threads per core) 160 GB/s (STREAM)

=> a balanced 4.5x performance increase per node => 5.3x weaker performance per HW-thread high scalability and SIMD-vectorization required ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Hardware performance characteristics Basic scalability measurements

OpenMP overhead (native)

Extensive Microbenchmarks: see arXiv:1310.5842

STREAM memory bandwidth (native)

Results consistent with, e.g.: OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison. T. Cramer, D. Schmidl, M. Klemm, D. an Mey. (2012) ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Motivation (… why bother?) 1) Compute performance: substantial nominal gains in compute performance (wrt. multi-core CPU): 5x...10x...100x ? 2x...3x sustained speedups (GPU vs. multi-core CPU: this is nowadays called “apples-to-apples” comparison in the GPU community!!!) porting and achieving application performance requires hard work: porting an HPC application to Xeon Phi is a project (like GPU)

Cost effectiveness at fixed budget (based on TDP values and assuming scalable applications)

2) Energy efficiency: substantial nominal energy-efficiency gains: 2x...3x (a must for exascale: 50x...100x required!) sustained application speedups of 2x are reasonable

3) Existing resources and technology-readiness significant GPU- and MIC-based resources around in the world → competition aspects: grants, impact, ...

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Porting MPG applications to MIC & GPU Context: assessment of accelerator technology for the MPG (RZG, starting 2012) ●

●

porting of HPC applications codes developed in the MPG to GPU/MIC assessment of existing GPU/MIC applications (e.g. MD: GROMACS, NAMD, …) relevant for the MPG

=> input for configuration of the new HPC system of MPG (“spend x% of budget for MIC and/or GPU”)

General strategy and methodology ●

we are targeting heterogeneous GPU/MIC-cluster applications, leveraging the entire resource

●

programming ●

MIC: guided auto-vectorization and moderate code changes (loop interchange, …) only!

●

GPU: CUDA kernels (no choice so far)

●

performance comparison: we always compare with highly optimized (SIMD, multi-core) CPU code!

●

Platforms: ●

●

Intel Xeon Phi (5110p, 7120p) vs. Sandy/Ivy Bridge (E5-2670 [email protected] GHz, E5-2680v2 [email protected]) NVidia Kepler (K20x) vs. Intel Sandy/Ivy Bridge (E5-2670 [email protected] GHz, E5-2680v2 [email protected])

ENES Workshop, HH, Mar 17-19, 2014

GPEC Poisson solver GPEC (MPI f. Plasmaphysics, K. Lackner) ●

●

a real-time solver (~ 1 ms) for the Grad-Shafranov equation (2-dim. MHD equilibrium in a Tokamak): required for diagnostics (“shape” and position of the magnetic field), steering: ASDEX, ITER parallel algorithm based on the classical Fourier method for boundary value problems (e.g. Numerical Recipes) implemented on CPU (Rampp et al., Fusion Science & Technology, 2012)

●

GPU: abandoned (Zhukov, Master Thesis, TUM, 2010)

●

MIC: experimental, no specific optimization efforts native mode (1 hour for “porting”) motivation: combined “microbenchmarks”: ●

●

M x 1D-DFT (size N)

CPU (1)

MIC (1)

CPU(10)

MIC(180)

DFT

1.7

28.6

0.2

0.6

DGEMV

0.97

4.89

0.2

0.14

0.68

11.22

0.15

0.48

3.58

47.0

0.59

1.55

N x tridiagonal solve (size M): vectorized & TRID parallelized over N (inspired by R. Fischer, 1998) total

●

matrix-vector multiplication size

4(N+M)2

(M,N) = (32,64) … (256,512) : “weak scaling” CPU → MIC ENES Workshop, HH, Mar 17-19, 2014

MNDO (QM/MM) MNDO (MPI f. Coal Research, W. Thiel) ●

semiempirical quantum chemistry calculations

●

highly optimized for multicore CPUs, shared-memory FORTRAN code

●

hotspots: Eigenvalue solver w/ Matrix-size O(1k...10k), DGEMM, Jacobi rotations

●

GPU experience:

●

●

successfully ported to single CPU-GPU (Wu, Koslowski & Thiel, JCTC, 2012)

●

ported to 2-GPU (X. Wu, K. Reuter) employing magma/1.4: 2-GPU SYEVD

●

~ 2x speedups wrt. CPU (node)

MIC results ●

ported to single MIC by K. Reuter (RZG) within a day

●

offload mode (Jacobi) + automatic offload (DGEMM)

●

performance competitive with single GPU (K20x): ~ 2x speedup wrt. CPU (socket) → required: multi-MIC EV solver (MAGMA-MIC, MKL ?)

ENES Workshop, HH, Mar 17-19, 2014

VERTEX VERTEX ●

(H.-Th. Janka, Max-Planck-Inst. for Astrophysics)

radiation hydrodynamics code (Boltzmann equation) for Type-II supernova simulations from first-principles (Rampp & Janka, A&A 2002, Buras et al. A&A 2006)

●

FORTRAN code w/ hybrid MPI/OpenMP parallelization

●

in production and under continuous development since 2001

➔

➔

ported to all major HPC architectures (CPU, Tier-0 class) achieved scalability up to O(100k) cores: 0.25 PFlop/s sustained on 131000 cores SuperMUC@LRZ 10% FP efficiency (Marek et al. Proc. of ParCO 2013)

➔

achieved 2x application speedup (strong scaling) on GPU (Dannert et al. Proc. of ParCO 2013, arXiv:1310.1485)

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

VERTEX Results on Xeon Phi (offload mode): main characteristics of the kernel (local physics): ●

numerical integration on a 5 dimensional grid (~150x10 6 grid points, reduction to 3 dims)

●

high degree of data parallelism and algorithmic intensity

●

CPU implementation: outer loops: OpenMP, inner loops: SIMD, + reductions

●

small amount of data transfer: ~ 150 DP Flop per transferred byte ( SIMD-vectorization and scalability of the CPU version improved used a stripped-down prototype of the kernel, not counting OpenMP overheads => not production-ready on Xeon-Phi ENES Workshop, HH, Mar 17-19, 2014

VERTEX The GPU experience: achieved 2x sustained application speedup on GPU (Dannert et al. Proc. of ParCO 2013, arXiv:1310.1485) ●

fast CUDA-C kernels: 7x speedup (vs CPU, 54x vs. single core) of individual routines not trivial to implement and optimize for GPU (e.g. reductions) CUDA-FORTRAN not competitive (PGI compiler on the CPU → work in progress with PGI) → CUDA C version encapsulated via ISO-C bindings of Fortran 2003 → Fortran wrappers for all major CUDA C functions (cudaMalloc, cudaFree...)

●

asynchronous scheduling scheme CPU-GPU (application-specific)

=> a major effort for porting and optimization: ●

A. Marek (RZG, member of the VERTEX development team): 6 months (FTE)

CPU thread 1

CPU thread 2

CPU thread 3

rate 1

rate 1

rate 1

rate 2

rate 2

rate 1

GPU 1

rate 4

rate KJT

rate KJT

rate 3

rate 2

GPU 1

GPU 1

rate 4

●

rate KJT GPU 1

rate KJT

rate 3

CPU thread 4

rate 3 rate 4

rate 2 rate 3 rate 4

a speedup of 2x...3x (strong-scaling!) is for VERTEX not easily achievable by other means (CPU optimization)

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Further projects GENE, Eulerian Gyrokinetics

(IPP, RZG):

●

suspended due to poor FFT performance on MIC

●

O(1000) independent, short 1d FFTs (r2c) from MKL

●

●

using MKL (DFTI interface), w/ internal parallelization over individual FFTs PCIe limited (cf. Dannert et al. arXiv:1310.1485)

Einstein@home, GW data analysis methods

(AEI)

●

successful on GPU (Allen et al. arXiv:1303.0028)

●

MIC: abandoned due to poor FFT performance on MIC

NSCOUETTE, pseudospectral Navier-Stokes solver (MPI-DS, U. Erlangen, IST, RZG) ●

●

hybrid code (Shi et al. arXiv:1311.2481), promising for MIC offload hotspot (linear systems) has recently disappeared after thorough CPU optimization → FFTs dominate again

JOREK, MHD (IPP/HLST) ●

work in progress, depends on sparse solvers from the PaStiX library

●

some promising first results (native mode) on vectorization

ENES Workshop, HH, Mar 17-19, 2014

by T. Feher (HLST/IPP)

Summary and Conclusions MIC and GPU technology can no more be ignored for HPC ●

existing resources and applications, technology evolution (SIMD, threads, memory, ...)

●

current MIC/GPU clusters can in principle deliver 2x...4x speedups for HPC applications

●

in practice: GPU: yes (MPG and community codes), Xeon Phi: where are the success stories ?

●

porting and optimization requires considerable efforts: ●

GPU: >6 months (HPC specialists with intimate knowledge of the code), sustainability ? (→ OpenACC, OpenMP, ... ?)

●

MIC: initial porting is relatively straightforward, achieving good performance is not

Worth the effort? computational scientists (our) point of view: definitely yes developed thorough expertise on technology => consulting rethinking of algorithms and implementations pays off scientific application developers point of view: … not immediately obvious 2x...3x speedups do not enable qualitatively new science objectives => reluctance to “sacrifice” human efforts, code maintainability, … regular CPUs still do a very good job and roadmaps promise further performance increases

=> … business as usual ? ENES Workshop, HH, Mar 17-19, 2014

Summary and Conclusions New HPC system of the MPG: Hydra@RZG ●

4000 Ivy/Sandy Bridge nodes

●

676 GPUs (K20x), 24 Xeon Phi (5110p)

Employing Xeon Phi as … a vehicle to improve SIMD (> 256bit) and multi-threading (> 40) in our codes a vehicle to advance hybrid programming and emphasize high-level programming models (vs. GPU/CUDA) … but how to economically utilize hundreds of Xeon Phis in a large HPC cluster today ? a preview to Knights Landing (the next “BlueGene”?) and similar developments towards exascale: The full public information on Knights Landing is this (software.intel.com, Jan 2014) ●

Knights Landing is the code name for the 2nd generation product in the Intel® Many Integrated Core Architecture

●

Knights Landing targets Intel’s 14 nanometer manufacturing process

●

Knights Landing will be productized as a processor (running the host OS) and a coprocessor (a PCIe end-point device)

●

Knights Landing will feature on-package, high-bandwidth memory

●

Flexible memory modes for the on package memory include: flat, cache, and hybrid modes

●

Knights Landing will support Intel® Advanced Vector Extensions AVX-512, details of which are already published.

●

Any information beyond that is rumour and speculation which we cannot confirm or deny.

rumour and speculations: 72 “Airmont” (Atom) cores, 16 GB Mem, 500 GB/s, 3 TFlop/s, GA: 2014/2015, ... (wikipedia, see also http://www.realworldtech.com/knights-landing-details/) ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

BACKUP

Motivation (… why bother?) 1) Compute performance ●

substantial nominal gains in compute performance: 5x … 10x (DP floating point) ●

●

●

time to solution matters, strong scaling within nodes appears prospective weak scaling is often becoming less relevant/attractive

caveat: sustained performance on “real-world”, scientific applications from aggressive marketing (ECC-off, single-core speedups, “free lunch” of porting) ... … towards more realistic attitudes ( … thanks to competition NVidia-Intel ?):

from: Accelerating Computational Science Symposium, ORNL (2012)

2x...3x sustained speedups (GPU vs. multi-core CPU: this is nowadays called “apples-to-apples” comparison in the GPU community!!!) porting and achieving decent application performance requires hard work: porting an application to Xeon Phi is a project (like GPU) from: Intel Xeon Phi Product Family Performance Rev 1.4 12/30/13

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Motivation (… why bother?) 2) Energy efficiency ●

●

substantial nominal energy-efficiency gains (GFlops/Watt): 2x...3x (a must for exascale: 50x...100x required!) caveat: sustained efficiency on “real-world” clusters

Cost effectiveness at fixed budget (based on TDP values and assuming scalable applications): ●

investment: exchange ratio r < 1: n nodes (2 CPU)→ r x n nodes (2 CPU + 2 GPU) => s ≥ (1/r)

●

(time to solution, Flops/€)

operation: n x (2x115 W) →r x n x (2x115 W + 2x235 W) => s ≥ r2 x 3.04

(energy to solution, Flops/W)

=> sustained application speedups of 2x are reasonable (assuming r ~ 0.5) ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Motivation (… why bother?) 3) Existing resources and technology evolution ●

●

by many, the technology is considered inevitable for the future (→ Exascale) competition aspects: ●

●

computing time grants relevance/impact of codes (e.g. GENE, FHI-aims, GROMACS, ...)

caveat: the price to pay for application development ?

Highlights of the 42nd TOP500 List SC13, Denver, CO

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Rate KJT: structure of the code

VERTEX

~ 2500 lines of code 2 times execution (neutral and charged current) dimensions: #Ebins: ~ 17 k-loop: ~ 22000 #radius bins: ~ 300 ⇒ ~ 150e6 iterations AI ~ 10

ENES Workshop, HH, Mar 17-19, 2014

do for all flavours h do i = 1, #Ebins do j = 1, #Ebins do k = kijmin(i,j), kijmax(i,j) do l = 1, #radiusBins enddo

10 times

enddo enddo here reduction rate(l,i,h) = … enddo enddo

kills parallelism

M. Rampp & H. Bockelmann

Scheduling (CPU) CPU thread 1

CPU thread 2

CPU thread 3

CPU thread 4

rate 1

rate 1

rate 1

rate 1

rate 2

rate 2

rate 2

rate 2

rate 3

rate 3

rate 3

rate 3

rate 4

rate 4

rate 4

rate 4

rate KJT

rate KJT

rate KJT

rate KJT

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Naive Scheduling (GPU) CPU thread 1

CPU thread 2

CPU thread 3

CPU thread 4

rate 1

rate 1

rate 1

rate 1

rate 2

rate 2

rate 2

rate 2

rate 3

rate 3

rate 3

rate 3

rate 4

rate 4

rate 4

rate 4

rate KJT GPU 1

rate KJT GPU 1

rate KJT GPU 1

rate KJT GPU 1

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Advanced Scheduling (GPU) CPU thread 1

CPU thread 2

CPU thread 3

rate 1

rate 1

rate 1

rate 2

rate 4 rate KJT

rate KJT

rate 1

GPU 1

rate KJT

rate 2

GPU 1

rate 3

GPU 1

rate 4

ENES Workshop, HH, Mar 17-19, 2014

rate KJT GPU 1

rate 2

rate 3

CPU thread 4

rate 3 rate 4

rate 2 rate 3 rate 4

M. Rampp & H. Bockelmann