CPU alternatives for future high-performance systems

www.bsc.es CPU alternatives for future high-performance systems Nikola Puzović Barcelona Supercomputing Center Motivation Custom Pure Vector CPUs ...
Author: Alyson Phelps
41 downloads 0 Views 4MB Size
www.bsc.es

CPU alternatives for future high-performance systems Nikola Puzović Barcelona Supercomputing Center

Motivation

Custom Pure Vector CPUs

Commodity

???

Server CPUs HPC Accelerators Nikola Puzovic - CPU Alternatives for Future HPC Systems

2

Outline A little bit of history – From vector CPUs to commodity components

Killer mobile processors – Overview of current trends for mobile CPUs

Our experiences – Low-power prototypes in BSC

Disclaimer: All references to unavailable products are speculative, taken from web sources. There is no commitment from ARM, Samsung, Intel, or others implied .

Nikola Puzovic - CPU Alternatives for Future HPC Systems

3

In the beginning ... there were only supercomputers Built to order – Very few of them

Special purpose hardware – Very expensive

Control Data Cray-1 – 1975, 160 MFLOPS • 80 units, 5-8 M$

Cray X-MP – 1982, 800 MFLOPS

Cray-2 – 1985, 1.9 GFLOPS

Cray Y-MP – 1988, 2.6 GFLOPS

...Fortran+ Vectorizing Compilers

Nikola Puzovic - CPU Alternatives for Future HPC Systems

4

Then, commodity took over special purpose

ASCI White, Lawrence Livermore Lab.

ASCI Red, Sandia – 1997, 1 Tflops (Linpack), 9298 processors at 200 Mhz, 1.2 Tbytes, 850 kWatts – Intel Pentium Pro • Upgraded to Pentium II Xeon, 1999, 3.1 Tflops

– 2001, 7.3 TFLOPS, 8192 proc. RS6000 at 375 Mhz, 6 Terabytes, (3+3) MWatts – IBM Power 3

Message-Passing Programming Models Nikola Puzovic - CPU Alternatives for Future HPC Systems

5

“Killer microprocessors” 10.000

MFLOPS

Cray-1, Cray-C90

NEC SX4, SX5

1000

Alpha AV4, EV5 Intel Pentium IBM P2SC HP PA8200

100

10 1974

1979

1984

1989

1994

1999

Microprocessors killed the Vector supercomputers – They were not faster ... – ... but they were significantly cheaper and greener

10 microprocessors approx. 1 Vector CPU – SIMD vs. MIMD programming paradigms Nikola Puzovic - CPU Alternatives for Future HPC Systems

6

Finally, commodity hardware + commodity software MareNostrum – Nov 2004, #4 Top500 • 20 Tflops, Linpack

– IBM PowerPC 970 FX • Blade enclosure

– Myrinet + 1 GbE network – SuSe Linux

Nikola Puzovic - CPU Alternatives for Future HPC Systems

7

2008 – 1 PFLOPS – IBM RoadRunner Los Alamos National Laboratory (USA) Hybrid architecture – 1 x AMD dual-core Master blade – 2 x PowerXCell 8i Worker blade

Hybrid MPI + Task off-load model 296 racks – 6.480 Opteron processors – 12.960 Cell processors • 128-bit SIMD

Infiniband interconnect – 288-port switches

2.35 MWatt

(425 MFLOPS / W) Nikola Puzovic - CPU Alternatives for Future HPC Systems

8

2009 - Cray Jaguar (1.8 PFLOPS) Oak Ridge National Laboratory (USA) Multi-core architecture – Hybrid MPI + OpenMP programming

230 racks 224.256 AMD Opteron processors – 6 cores / chip

Cray Seastar2+ interconnect – 3D-mesh using AMD Hypertransport

7 MWatt

(257 MFLOPS / W)

Nikola Puzovic - CPU Alternatives for Future HPC Systems

9

2012 – Cray Titan (17.6 PFLOPS) DOE/SC/Oak Ridge National Laboratory – Jaguar GPU upgrade

200 racks 224.256 Cray XK7 nodes – 16-core AMD Opteron – Nvidia Testa K20X GPU

8.2 Mwatts (2.142 MFLOPS/W) Nikola Puzovic - CPU Alternatives for Future HPC Systems

10

Outline A little bit of history – From vector CPUs to commodity components

Killer-mobile processors – Overview of current trends for mobile CPUs

Our experiences – Low-power prototypes in BSC

Nikola Puzovic - CPU Alternatives for Future HPC Systems

11

The next step in the commodity chain

HPC

Servers

Desktop

Total cores in Nov„12 Top500 – 14.9M Cores

Tablets sold 2012 Mobile

– > 100M Tablets

Smartphones sold 2012 – > 712M Phones Nikola Puzovic - CPU Alternatives for Future HPC Systems

12

Current trends in mobile CPUs We want to see how mobile SoCs behave with HPC apps – Current test systems have limited memory – Use a set of HPC-specific micro-kernels

Micro-kernels – Stress different architectural features – Cover a wide range of HPC application domains – Reduce porting effort to new architectures

Single core and multi-core evaluations – Goal is to see how mobile CPUs change through generations… – …and to compare them to modern HPC cores

Nikola Puzovic - CPU Alternatives for Future HPC Systems

13

ARM Cortex-A9 Smartphone CPU OoO superscalar processor – Issue width of 4

VFP for 64-bit Floating Point – DP: 1 FMA each 2 cycles

The first ARM CPU truly usable for testing HPC workloads

Nikola Puzovic - CPU Alternatives for Future HPC Systems

14

NVIDIA Tegra2 Dual-core Cortex-A9 @ 1GHz – VFP for 64-bit Floating Point • 2 GFLOPS (1 FMA / 2 cycles)

Low-power Nvidia GPU – OpenGL only, CUDA not supported

Several (not useful for HPC) accelerators – Video encoder-decoder – Audio processor – Image processor

SECO Q7 board

2 GFLOPS ~ 0.5 Watt Nikola Puzovic - CPU Alternatives for Future HPC Systems

15

NVIDIA Tegra3 Quad-core Cortex-A9 @ 1.3GHz – VFP for 64-bit Floating Point • 5.2 GFLOPS (1 FMA / 2 cycles)

– NEON for 32-bit floating Point SIMD

Low-power Nvidia GPU – 3x faster than Tegra2 • For graphics only

– CUDA not supported

SECO Q7 board

Nikola Puzovic - CPU Alternatives for Future HPC Systems

16

ARM Cortex-A15 Next generation of Cortex – – – –

Improved uArch Improved performance Virtualization support Improved multiprocessing capabilities

Floating point performance – DP: 1 FMA per cycle

Nikola Puzovic - CPU Alternatives for Future HPC Systems

17

Samsung Exynos 5 Dual Dual-core ARM Cortex-A15 @ (up to 1.7 GHz) – VFP for 64-bit Floating Point • 6.8 GFLOPS (1 FMA / cycle)

– NEON for 32-bit floating Point SIMD

Quad-core ARM Mali T604 – Compute capable • OpenCL 1.1 • 68 GFLOPS (SP)

Shared memory between CPU and GPU

Nikola Puzovic - CPU Alternatives for Future HPC Systems

18

MicroKernels Benchmark

Properties

Vector Operation (vecop)

Common operation in regular codes

Dense Matrix-Matrix Multiplication (dmmm)

Common operation: measures data reuse and compute performance

3D stencil (3dstc)

Strided memory accesses (7-point 3D stencil)

2D Convolution (2dcon)

Spatial locality

Fast Fourier Transform (fft)

Peak floating-point, variable-stride accesses

Reduction (red)

Varying levels of parallelism (Scalar sum)

Histogram (hist)

Histogram with local privatisation, requires reduction stage

Merge Sort (msort)

Barrier operations

N-Body (nbody)

Irregular memory accesses

Atomic Monte-Carlo Dynamics (amcd)

Embarrassingly parallel: peak compute performance

Sparse Vector-Matrix Multiplication (spwm)

Load imbalance

Nikola Puzovic - CPU Alternatives for Future HPC Systems

19

Performance – Double precision FP (single core)

Performance normalized to Tegra2 board (single core) – Benefits due to increased frequency and improved micro-architecture – Better memory technology gives additional performance benefits

Nikola Puzovic - CPU Alternatives for Future HPC Systems

20

Performance – Double precision FP (multi-core)

Performance normalized to Tegra2 board (OpenMP) – Threads used: 2 in Tegra2, 4 in Tegra3, 2 in Exynos – New generation of ARM cores shows benefits despite smaller number of cores being used Nikola Puzovic - CPU Alternatives for Future HPC Systems

21

Energy Efficiency – Double precision FP (multi-core)

Energy efficiency in GOps/Watt – Gains proportional to improvements in execution time – Increased frequency and complex architecture do add power… – …but “overhead” is too large for this to have an effect Nikola Puzovic - CPU Alternatives for Future HPC Systems

22

Results – single core summary Double Precision

5

4 Speedup

4 Speedup

Single Precision

5

3 2

3 2

1

1

0

0

Tegra 2

Tegra 3

Exynos 5 Dual

Tegra 2

Tegra 3

Exynos 5 Dual

Results as expected – T3 faster than T2 – frequency increase – Exynos faster than T3 – uArch improvements

Nikola Puzovic - CPU Alternatives for Future HPC Systems

23

ARMv8 architecture Cortex-A57 64-bit processor – Improved performance in all workloads • Up to 3x announced at same power budget

– Interoperability with ARM Mali GPUs

Double Floating Point performance – DP support in the Neon instruction set • 128-bit words

– Should double performance wrt Cortex-A15

big.LITTLE processing – Cortex-A53 – Low-performance, low-power companion on A57 – Can pair A57s with A53s Nikola Puzovic - CPU Alternatives for Future HPC Systems

24

What if… Double Precision (single core) 6

Speedup

5 4 3 2 1 0

Tegra 2

Tegra 3

Exynos 5 Dual

ARMv8

ARMv8 could improve DFP performance – We keep the same frequency for our projection (pessimistic) – Increase the DP capability 2x (optimistic?) – Just a speculation, but could happen… Nikola Puzovic - CPU Alternatives for Future HPC Systems

25

What about current HPC CPUs? 16

Double Precision (single core)

14

Speedup

12 10

4.7 x

8

2.3 x

6 4

2 0 Tegra 2

Tegra 3 Exynos 5 ARM v8 Dual

Sandy Bridge

Comparison with single core of Intel SandyBridge-EP E5-2670 @ 2.6 GHz Current “mobile champion” still far away – Next generation mobile CPUs could bring significant improvements – Also, back in time, SX5 was significantly faster than Pentium II… Nikola Puzovic - CPU Alternatives for Future HPC Systems

26

The Killer Mobile processorsTM 1.000.000 Alpha

MFLOPS

100.000

Intel AMD

10.000

Nvidia Tegra

Samsung Exynos

1.000

4-core ARMv8 1.5 GHz

100 1990

1995

2000

2005

2010

2015

History may be about to repeat itself … – Mobile processor are not faster … – … but they are significantly cheaper and greener

Nikola Puzovic - CPU Alternatives for Future HPC Systems

27

Then and now Then: Vector vs Commodity

Now: Commodity vs Mobile

Today‟s situation looks very familiar – “Mobile vs. Server” similar to “Server vs. Vector” – Significantly lower cost of mobile CPUs (thousands vs hundreds of $) – Same programming model, larger scale • Will need more parallelism (probably less than one order of magnitude)

Off course, this does not prove anything – Mobile CPUs will become a viable alternative, but there‟s no guarantee that they will make it to mainstream HPC systems Nikola Puzovic - CPU Alternatives for Future HPC Systems

28

Outline A little bit of history – From vector CPUs to commodity components

Killer-mobile processors – Overview of current trends for mobile CPUs

Our experiences – Low-power prototypes in BSC

Nikola Puzovic - CPU Alternatives for Future HPC Systems

29

GFLOPS / W

BSC ARM-based prototype roadmap

Pedraforca: ARM + GPU Tibidabo: ARM multicore

2011

Integrated ARM + GPU

2012

2013

2014

Prototypes are critical to accelerate software development – System software stack + applications Nikola Puzovic - CPU Alternatives for Future HPC Systems

30

Tibidabo: The first ARM multicore cluster Q7 Tegra 2 2 x Cortex-A9 @ 1GHz 2 GFLOPS 5 Watts (?) 0.4 GFLOPS / W

2 Racks 32 blade containers 256 nodes 512 cores

9x 48-port 1GbE switch

512 GFLOPS 3.4 Kwatt 0.15 GFLOPS / W

Q7 carrier board 2 x Cortex-A9 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W

1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W

Proof of concept – It is possible to deploy a cluster of smartphone processors

Enable software stack development Nikola Puzovic - CPU Alternatives for Future HPC Systems

31

Tibidabo: scalability and energy efficiency HPC applications scale out of the box on tibidabo – Strong scaling depends on the size of input set

HPL – good weak scaling – 120 MFLOPS/Watt

Specfem3D – Improvements over x86 cluster in energy efficiency (up to 3x) D. Goddeke et. al. “Energy-efficiency vs. performance of the numerical solution of PDEs: an application study on a low-power ARM-based cluster”, Journal of Computational Physics Nikola Puzovic - CPU Alternatives for Future HPC Systems

32

Pedraforca: ARM+GPU cluster Stage One – Test cluster of CARMA kits • Tegra3 SoC • Quadro 1000M

– 1 GbE interconnect

Stage Two – ARM multicore SoC (NVIDIA) – NVIDIA GPU

In progress…

Nikola Puzovic - CPU Alternatives for Future HPC Systems

33

Mont-Blanc project goals To develop an European Exascale approach Based on embedded power-efficient technology

Objetives – Develop a first prototype system, limited by available technology – Design a Next Generation system, to overcome the limitations – Develop a set of Exascale applications targeting the new system

Nikola Puzovic - CPU Alternatives for Future HPC Systems

34

Mont-Blanc prototype Exynos 5 Dual – Integrated CPU + GPU – Dual core Cortex-A15 + ARM Mali T604 GPU

Integrated GPU has many advantages – Shared memory with CPU • Even cache coherent!

– No power wasted on PCIe bus – No power wasted on GDDR5 memory – Higher energy efficiency + lower cost

Nikola Puzovic - CPU Alternatives for Future HPC Systems

35

High density packaging architecture Standard BullX blade enclosure Multiple compute nodes per blade – Additional level of interconnect, on-blade network

Deployment expected later this year Nikola Puzovic - CPU Alternatives for Future HPC Systems

36

Are we building BlueGene again? Yes ... – Exploit Pollack's Rule in presence of abundant parallelism • Many small cores vs. Single fast core

... and No – Heterogeneous computing • On-chip GPU

– Commodity vs. Special purpose • Higher volume • Many vendors • Lower cost

– Lots of room for improvement • No SIMD / vectors yet ...

– Build on Europe's embedded strengths Nikola Puzovic - CPU Alternatives for Future HPC Systems

37

Conclusions Commodity vs Mobile

Vector vs Commodity

Killer mobile processors – Not yet there, but getting very close

We will see a supercomputer with mobile SoCs soon – Mont-Blanc prototype @ BSC – Question is if it will become mainstream www.montblanc-project.eu

MontBlancEU

@MontBlanc_EU

Nikola Puzovic - CPU Alternatives for Future HPC Systems

38