Experiences with Intel Xeon Phi in the Max-Planck Society

Experiences with Intel Xeon Phi in the Max-Planck Society Markus Rampp (RZG) & Hendryk Bockelmann (DKRZ) ENES Workshop on “Exascale Technologies & I...
Author: Dwight Jones
0 downloads 0 Views 2MB Size
Experiences with Intel Xeon Phi in the Max-Planck Society

Markus Rampp (RZG) & Hendryk Bockelmann (DKRZ)

ENES Workshop on “Exascale Technologies & Innovation in HPC for Climate Models”

Outline ●

coprocessors/accelerators in HPC: technology basics, motivation and context



some practical experiences with Xeon Phi (MIC) and remarks on GPU



conclusions

Acknowledgements: A. Duran, M. Klemm, G. Zitzlsberger (Intel), A. Koehler, P. Messmer (NVidia), F. Merz, F. Thomas (IBM), T. Dannert, A. Marek, K. Reuter, S. Heinzel (RZG), T. Feher, M. Haefele, R. Hatzky (HLST/IPP)

Introduction High-performance computing facilities in the Max-Planck Society (MPG) ●

DKRZ (Hamburg): earth sciences



RZG (Garching near Munich): materials and bio sciences, astrophysics, plasma physics, ...

→ operation of HPC system(s) + HPC application groups + data services + ...

Main characteristics of applications MPG develops and operates highly scalable codes (10k ...100k cores) ●

GENE (IPP), FHI-aims (FHI), VERTEX (MPA), GADGET (MPA, HITS), …



ICON, MPIOM, ECHAM (MPI-M), FESOM, COSMO, EMAC, METRAS (AWI, HZG, Uni-HH)



Fortran, C, C++; MPI, hybrid MPI/OpenMP



all major HPC platforms (x86, IBM Power, BlueGene, CRAY, NEC SX, ...)



significant external resources (e.g. EU Tier-0, US, …)



decade(s) of research and development, typically: O(10k … 100k) lines of code

RZG, DKRZ actively co-develop and optimize applications together with MPG scientists → we have to support these “legacy” applications and prepare them for upcoming HPC architectures towards exascale: GPU, manycore, … ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Technology basics Basic principle (today's GPUs, MICs): ●

accelerator cards for standard cluster nodes: PCIe (or on-board HyperTransport/QPI)



many (~50...500) “lightweight” cores (~ 1 GHz)



high thread concurrency, fast (local) memories

System architecture: ●

currently: x86 “Linux-clusters” with nodes comprising ●

2 x CPUs (2x6-12 cores) + 2 x MIC/GPU

=> speedup := T(2n CPU) / T(2n CPU+2n GPU) ●

1 TFlop/s 8 GB RAM 150 GB/s

future: smaller CPU component (extreme: “host-less”, true many-core chips: Knights Landing), OpenPower architecture, …

Programming paradigms: A) offload (execute parts of code on coprocessor)

~6 GB/s ~30 GB/s

0.25TFlop/s 32 GB RAM 40 GB/s

~6 GB/s

B) native MIC (execute code only on coprocessor) C) symmetric cluster (distribute code)

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Technology basics Overview: Xeon Phi & GPU vs. CPU Intel MIC (Knights Corner) NVidia GPU (Kepler) → Nov 2012 → 2012/2013

Intel Xeon (Ivy Bridge) → Q3/2013

model

Xeon Phi 5110p

K20x

Xeon E5-2680v2

processors

60 Pentium x86_64 cores (1 GHz)

14 GK110 streaming multiprocessors (0.7 GHz)

10 Xeon x86_64 cores (2.8 GHz)

per-processor concurrency

4 hyperthreads x 8-wide (512 192 CUDA cores (SIMT) bit) SIMD units

2x (add,mult) 4-wide (256 bit) SIMD units [x2 hyperthreads]

total nominal concurrency

1920 = 60x4x8

2688 = 14x192

80 = 10x4[x2]

performance (DP) ~ 1 TFlops

~ 1 TFlops

~ 0.2 TFlops

memory

6 GB

64 GB

data transfer with PCIe Gen2 (8 GB/s) host CPU

PCIe Gen2 (8 GB/s)

---

programming model/ software stack

CUDA, (OpenCL), OpenACC OpenMP + SIMD NVidia libraries, tools vectorization,OpenCL, ...

8 GB

OpenMP + SIMD vectorization, OpenCL, Intel compilers, libraries, tools + proprietary offload directives

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Programming Paradigms A) Offload (like GPU): ●

use CPU for program control, communication and maximum single-thread performance



“offload” data-parallel parts to accelerator for maximum throughput performance



programming approach:





MPI across nodes (CPU processes), OpenMP & SIMD on MIC



explicit (LEO) or automatic (MKL DGEMM) offload directives

load-balancing, asynchronous computations, hiding/overlapping data transfer (PCIe bandwidth) most prospective for tapping all node resources (RAM, compute, communication) becoming obsolete in the future? → Knights Landing



batch-system friendly

ENES Workshop, HH, Mar 17-19, 2014

!DIR$ !DIR$ OFFLOAD OFFLOAD target(mic:MIC_DEVICE) target(mic:MIC_DEVICE) in(etanp:alloc_if(.true.),free_if(.false.)) in(etanp:alloc_if(.true.),free_if(.false.)) & & !DIR$ in(omcck:alloc_if(.true.),free_if(.false.)) & !DIR$ in(omcck:alloc_if(.true.),free_if(.false.)) & !DIR$ in(qcck:alloc_if(.true.),free_if(.false.)) & !DIR$ in(qcck:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqq0:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqq0:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqq1:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqq1:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqqc:alloc_if(.true.),free_if(.false.)) & !DIR$ in(cqqc:alloc_if(.true.),free_if(.false.)) & !DIR$ in(tabe:alloc_if(.true.),free_if(.false.)) & !DIR$ in(tabe:alloc_if(.true.),free_if(.false.)) & !DIR$ in(kijminc:alloc_if(.true.),free_if(.false.)) & !DIR$ in(kijminc:alloc_if(.true.),free_if(.false.)) & !DIR$ in(nnmax, & !DIR$ in(nnmax, kmaxc, kmaxc, kRange, kRange, ijmaxc, ijmaxc, nenukjt) nenukjt) & !DIR$ nocopy(pi0im_s1:alloc_if(.true.),free_if(.false.)) !DIR$ nocopy(pi0im_s1:alloc_if(.true.),free_if(.false.)) & & !DIR$ nocopy(pi0re_s1:alloc_if(.true.),free_if(.false.)) !DIR$ nocopy(pi0re_s1:alloc_if(.true.),free_if(.false.)) call call charged(1, charged(1, 1, 1, 2, 2, etanp, etanp, omcck, omcck, qcck, qcck, cqq0, cqq0, cqq1, cqq1, cqqc, cqqc, & & tabe, tabe, pi0im_s1, pi0im_s1, pi0re_s1, pi0re_s1, nnmax, nnmax, kmaxc, kmaxc, kRange, kRange, & & kijminc, kijminc, ijmaxc, ijmaxc, nenukjt) nenukjt) M. Rampp & H. Bockelmann

Programming Paradigms B) Native MIC: ●

use MIC as a standalone processor for the complete program



memory limits (~ 8GB)



programming approach: MPI, OpenMP, (Cilk, TBB, OpenCL, ...)



very useful for prototyping, algorithmic evaluation, ... on a single coprocessor



preview to KNL / future exascale system (reminiscent of BlueGene)



effective at scale today ? → CPUs essentially act as a communication bottleneck

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Programming Paradigms C) Symmetric, heterogeneous cluster (CPU + MIC): ●

use CPUs and MICs as a heterogeneous cluster (distribute MPI ranks across both components)



memory limits, load balancing, PCIe (and QPI) bandwidth & latency



programming approach: MPI, OpenMP, …



hybrid MPI/OpenMP appears mandatory for load-balancing MPI tasks on CPU and MIC



useful for coarse-grained (“workflow”-like) parallelism ?



effective at scale ?



QPI bottleneck (remedy: 2 InfiniBand HCAs per node)

V. Karpusenko, A. Vladimirov: Configuration and Benchmarks of Peer-to-Peer Communication over Gigabit Ethernet and InfiniBand in a Cluster with Intel Xeon Phi Coprocessors (2014)

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Hardware performance characteristics Compute and memory performance “Knights Corner” vs “IvyBridge” (8 byte/DP) ●

Intel Xeon E5-2680v2 (10 core / 2.8 GHz) 2.8 x 4 lanes x 2 ops = 22.4 GFlops/s/core (1 thread per core) 40 GB/s (STREAM)



Intel Xeon Phi 5110P (60 core / 1.053GHz , ECC on) 1.053 x 8 lanes x 2 ops = 16.8 GFlop/s/core (4 threads per core) 160 GB/s (STREAM)

=> a balanced 4.5x performance increase per node => 5.3x weaker performance per HW-thread high scalability and SIMD-vectorization required ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Hardware performance characteristics Basic scalability measurements

OpenMP overhead (native)

Extensive Microbenchmarks: see arXiv:1310.5842

STREAM memory bandwidth (native)

Results consistent with, e.g.: OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison. T. Cramer, D. Schmidl, M. Klemm, D. an Mey. (2012) ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Motivation (… why bother?) 1) Compute performance: substantial nominal gains in compute performance (wrt. multi-core CPU): 5x...10x...100x ? 2x...3x sustained speedups (GPU vs. multi-core CPU: this is nowadays called “apples-to-apples” comparison in the GPU community!!!) porting and achieving application performance requires hard work: porting an HPC application to Xeon Phi is a project (like GPU)

Cost effectiveness at fixed budget (based on TDP values and assuming scalable applications)

2) Energy efficiency: substantial nominal energy-efficiency gains: 2x...3x (a must for exascale: 50x...100x required!) sustained application speedups of 2x are reasonable

3) Existing resources and technology-readiness significant GPU- and MIC-based resources around in the world → competition aspects: grants, impact, ...

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Porting MPG applications to MIC & GPU Context: assessment of accelerator technology for the MPG (RZG, starting 2012) ●



porting of HPC applications codes developed in the MPG to GPU/MIC assessment of existing GPU/MIC applications (e.g. MD: GROMACS, NAMD, …) relevant for the MPG

=> input for configuration of the new HPC system of MPG (“spend x% of budget for MIC and/or GPU”)

General strategy and methodology ●

we are targeting heterogeneous GPU/MIC-cluster applications, leveraging the entire resource



programming ●

MIC: guided auto-vectorization and moderate code changes (loop interchange, …) only!



GPU: CUDA kernels (no choice so far)



performance comparison: we always compare with highly optimized (SIMD, multi-core) CPU code!



Platforms: ●



Intel Xeon Phi (5110p, 7120p) vs. Sandy/Ivy Bridge (E5-2670 [email protected] GHz, E5-2680v2 [email protected]) NVidia Kepler (K20x) vs. Intel Sandy/Ivy Bridge (E5-2670 [email protected] GHz, E5-2680v2 [email protected])

ENES Workshop, HH, Mar 17-19, 2014

GPEC Poisson solver GPEC (MPI f. Plasmaphysics, K. Lackner) ●



a real-time solver (~ 1 ms) for the Grad-Shafranov equation (2-dim. MHD equilibrium in a Tokamak): required for diagnostics (“shape” and position of the magnetic field), steering: ASDEX, ITER parallel algorithm based on the classical Fourier method for boundary value problems (e.g. Numerical Recipes) implemented on CPU (Rampp et al., Fusion Science & Technology, 2012)



GPU: abandoned (Zhukov, Master Thesis, TUM, 2010)



MIC: experimental, no specific optimization efforts native mode (1 hour for “porting”) motivation: combined “microbenchmarks”: ●



M x 1D-DFT (size N)

CPU (1)

MIC (1)

CPU(10)

MIC(180)

DFT

1.7

28.6

0.2

0.6

DGEMV

0.97

4.89

0.2

0.14

0.68

11.22

0.15

0.48

3.58

47.0

0.59

1.55

N x tridiagonal solve (size M): vectorized & TRID parallelized over N (inspired by R. Fischer, 1998) total



matrix-vector multiplication size

4(N+M)2

(M,N) = (32,64) … (256,512) : “weak scaling” CPU → MIC ENES Workshop, HH, Mar 17-19, 2014

MNDO (QM/MM) MNDO (MPI f. Coal Research, W. Thiel) ●

semiempirical quantum chemistry calculations



highly optimized for multicore CPUs, shared-memory FORTRAN code



hotspots: Eigenvalue solver w/ Matrix-size O(1k...10k), DGEMM, Jacobi rotations



GPU experience:





successfully ported to single CPU-GPU (Wu, Koslowski & Thiel, JCTC, 2012)



ported to 2-GPU (X. Wu, K. Reuter) employing magma/1.4: 2-GPU SYEVD



~ 2x speedups wrt. CPU (node)

MIC results ●

ported to single MIC by K. Reuter (RZG) within a day



offload mode (Jacobi) + automatic offload (DGEMM)



performance competitive with single GPU (K20x): ~ 2x speedup wrt. CPU (socket) → required: multi-MIC EV solver (MAGMA-MIC, MKL ?)

ENES Workshop, HH, Mar 17-19, 2014

VERTEX VERTEX ●

(H.-Th. Janka, Max-Planck-Inst. for Astrophysics)

radiation hydrodynamics code (Boltzmann equation) for Type-II supernova simulations from first-principles (Rampp & Janka, A&A 2002, Buras et al. A&A 2006)



FORTRAN code w/ hybrid MPI/OpenMP parallelization



in production and under continuous development since 2001





ported to all major HPC architectures (CPU, Tier-0 class) achieved scalability up to O(100k) cores: 0.25 PFlop/s sustained on 131000 cores SuperMUC@LRZ 10% FP efficiency (Marek et al. Proc. of ParCO 2013)



achieved 2x application speedup (strong scaling) on GPU (Dannert et al. Proc. of ParCO 2013, arXiv:1310.1485)

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

VERTEX Results on Xeon Phi (offload mode): main characteristics of the kernel (local physics): ●

numerical integration on a 5 dimensional grid (~150x10 6 grid points, reduction to 3 dims)



high degree of data parallelism and algorithmic intensity



CPU implementation: outer loops: OpenMP, inner loops: SIMD, + reductions



small amount of data transfer: ~ 150 DP Flop per transferred byte ( SIMD-vectorization and scalability of the CPU version improved used a stripped-down prototype of the kernel, not counting OpenMP overheads => not production-ready on Xeon-Phi ENES Workshop, HH, Mar 17-19, 2014

VERTEX The GPU experience: achieved 2x sustained application speedup on GPU (Dannert et al. Proc. of ParCO 2013, arXiv:1310.1485) ●

fast CUDA-C kernels: 7x speedup (vs CPU, 54x vs. single core) of individual routines not trivial to implement and optimize for GPU (e.g. reductions) CUDA-FORTRAN not competitive (PGI compiler on the CPU → work in progress with PGI) → CUDA C version encapsulated via ISO-C bindings of Fortran 2003 → Fortran wrappers for all major CUDA C functions (cudaMalloc, cudaFree...)



asynchronous scheduling scheme CPU-GPU (application-specific)

=> a major effort for porting and optimization: ●

A. Marek (RZG, member of the VERTEX development team): 6 months (FTE)

CPU thread 1

CPU thread 2

CPU thread 3

rate 1

rate 1

rate 1

rate 2

rate 2

rate 1

GPU 1

rate 4

rate KJT

rate KJT

rate 3

rate 2

GPU 1

GPU 1

rate 4



rate KJT GPU 1

rate KJT

rate 3

CPU thread 4

rate 3 rate 4

rate 2 rate 3 rate 4

a speedup of 2x...3x (strong-scaling!) is for VERTEX not easily achievable by other means (CPU optimization)

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Further projects GENE, Eulerian Gyrokinetics

(IPP, RZG):



suspended due to poor FFT performance on MIC



O(1000) independent, short 1d FFTs (r2c) from MKL





using MKL (DFTI interface), w/ internal parallelization over individual FFTs PCIe limited (cf. Dannert et al. arXiv:1310.1485)

Einstein@home, GW data analysis methods

(AEI)



successful on GPU (Allen et al. arXiv:1303.0028)



MIC: abandoned due to poor FFT performance on MIC

NSCOUETTE, pseudospectral Navier-Stokes solver (MPI-DS, U. Erlangen, IST, RZG) ●



hybrid code (Shi et al. arXiv:1311.2481), promising for MIC offload hotspot (linear systems) has recently disappeared after thorough CPU optimization → FFTs dominate again

JOREK, MHD (IPP/HLST) ●

work in progress, depends on sparse solvers from the PaStiX library



some promising first results (native mode) on vectorization

ENES Workshop, HH, Mar 17-19, 2014

by T. Feher (HLST/IPP)

Summary and Conclusions MIC and GPU technology can no more be ignored for HPC ●

existing resources and applications, technology evolution (SIMD, threads, memory, ...)



current MIC/GPU clusters can in principle deliver 2x...4x speedups for HPC applications



in practice: GPU: yes (MPG and community codes), Xeon Phi: where are the success stories ?



porting and optimization requires considerable efforts: ●

GPU: >6 months (HPC specialists with intimate knowledge of the code), sustainability ? (→ OpenACC, OpenMP, ... ?)



MIC: initial porting is relatively straightforward, achieving good performance is not

Worth the effort? computational scientists (our) point of view: definitely yes developed thorough expertise on technology => consulting rethinking of algorithms and implementations pays off scientific application developers point of view: … not immediately obvious 2x...3x speedups do not enable qualitatively new science objectives => reluctance to “sacrifice” human efforts, code maintainability, … regular CPUs still do a very good job and roadmaps promise further performance increases

=> … business as usual ? ENES Workshop, HH, Mar 17-19, 2014

Summary and Conclusions New HPC system of the MPG: Hydra@RZG ●

4000 Ivy/Sandy Bridge nodes



676 GPUs (K20x), 24 Xeon Phi (5110p)

Employing Xeon Phi as … a vehicle to improve SIMD (> 256bit) and multi-threading (> 40) in our codes a vehicle to advance hybrid programming and emphasize high-level programming models (vs. GPU/CUDA) … but how to economically utilize hundreds of Xeon Phis in a large HPC cluster today ? a preview to Knights Landing (the next “BlueGene”?) and similar developments towards exascale: The full public information on Knights Landing is this (software.intel.com, Jan 2014) ●

Knights Landing is the code name for the 2nd generation product in the Intel® Many Integrated Core Architecture



Knights Landing targets Intel’s 14 nanometer manufacturing process



Knights Landing will be productized as a processor (running the host OS) and a coprocessor (a PCIe end-point device)



Knights Landing will feature on-package, high-bandwidth memory



Flexible memory modes for the on package memory include: flat, cache, and hybrid modes



Knights Landing will support Intel® Advanced Vector Extensions AVX-512, details of which are already published.



Any information beyond that is rumour and speculation which we cannot confirm or deny.

rumour and speculations: 72 “Airmont” (Atom) cores, 16 GB Mem, 500 GB/s, 3 TFlop/s, GA: 2014/2015, ... (wikipedia, see also http://www.realworldtech.com/knights-landing-details/) ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

BACKUP

Motivation (… why bother?) 1) Compute performance ●

substantial nominal gains in compute performance: 5x … 10x (DP floating point) ●





time to solution matters, strong scaling within nodes appears prospective weak scaling is often becoming less relevant/attractive

caveat: sustained performance on “real-world”, scientific applications from aggressive marketing (ECC-off, single-core speedups, “free lunch” of porting) ... … towards more realistic attitudes ( … thanks to competition NVidia-Intel ?):

from: Accelerating Computational Science Symposium, ORNL (2012)

2x...3x sustained speedups (GPU vs. multi-core CPU: this is nowadays called “apples-to-apples” comparison in the GPU community!!!) porting and achieving decent application performance requires hard work: porting an application to Xeon Phi is a project (like GPU) from: Intel Xeon Phi Product Family Performance Rev 1.4 12/30/13

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Motivation (… why bother?) 2) Energy efficiency ●



substantial nominal energy-efficiency gains (GFlops/Watt): 2x...3x (a must for exascale: 50x...100x required!) caveat: sustained efficiency on “real-world” clusters

Cost effectiveness at fixed budget (based on TDP values and assuming scalable applications): ●

investment: exchange ratio r < 1: n nodes (2 CPU)→ r x n nodes (2 CPU + 2 GPU) => s ≥ (1/r)



(time to solution, Flops/€)

operation: n x (2x115 W) →r x n x (2x115 W + 2x235 W) => s ≥ r2 x 3.04

(energy to solution, Flops/W)

=> sustained application speedups of 2x are reasonable (assuming r ~ 0.5) ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Motivation (… why bother?) 3) Existing resources and technology evolution ●



by many, the technology is considered inevitable for the future (→ Exascale) competition aspects: ●



computing time grants relevance/impact of codes (e.g. GENE, FHI-aims, GROMACS, ...)

caveat: the price to pay for application development ?

Highlights of the 42nd TOP500 List SC13, Denver, CO

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Rate KJT: structure of the code

VERTEX

~ 2500 lines of code 2 times execution (neutral and charged current) dimensions: #Ebins: ~ 17 k-loop: ~ 22000 #radius bins: ~ 300 ⇒ ~ 150e6 iterations AI ~ 10

ENES Workshop, HH, Mar 17-19, 2014

do for all flavours h do i = 1, #Ebins do j = 1, #Ebins do k = kijmin(i,j), kijmax(i,j) do l = 1, #radiusBins enddo

10 times

enddo enddo here reduction rate(l,i,h) = … enddo enddo

kills parallelism

M. Rampp & H. Bockelmann

Scheduling (CPU) CPU thread 1

CPU thread 2

CPU thread 3

CPU thread 4

rate 1

rate 1

rate 1

rate 1

rate 2

rate 2

rate 2

rate 2

rate 3

rate 3

rate 3

rate 3

rate 4

rate 4

rate 4

rate 4

rate KJT

rate KJT

rate KJT

rate KJT

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Naive Scheduling (GPU) CPU thread 1

CPU thread 2

CPU thread 3

CPU thread 4

rate 1

rate 1

rate 1

rate 1

rate 2

rate 2

rate 2

rate 2

rate 3

rate 3

rate 3

rate 3

rate 4

rate 4

rate 4

rate 4

rate KJT GPU 1

rate KJT GPU 1

rate KJT GPU 1

rate KJT GPU 1

ENES Workshop, HH, Mar 17-19, 2014

M. Rampp & H. Bockelmann

Advanced Scheduling (GPU) CPU thread 1

CPU thread 2

CPU thread 3

rate 1

rate 1

rate 1

rate 2

rate 4 rate KJT

rate KJT

rate 1

GPU 1

rate KJT

rate 2

GPU 1

rate 3

GPU 1

rate 4

ENES Workshop, HH, Mar 17-19, 2014

rate KJT GPU 1

rate 2

rate 3

CPU thread 4

rate 3 rate 4

rate 2 rate 3 rate 4

M. Rampp & H. Bockelmann

Suggest Documents