www.bsc.es
CPU alternatives for future high-performance systems Nikola Puzović Barcelona Supercomputing Center
Motivation
Custom Pure Vector CPUs
Commodity
???
Server CPUs HPC Accelerators Nikola Puzovic - CPU Alternatives for Future HPC Systems
2
Outline A little bit of history – From vector CPUs to commodity components
Killer mobile processors – Overview of current trends for mobile CPUs
Our experiences – Low-power prototypes in BSC
Disclaimer: All references to unavailable products are speculative, taken from web sources. There is no commitment from ARM, Samsung, Intel, or others implied .
Nikola Puzovic - CPU Alternatives for Future HPC Systems
3
In the beginning ... there were only supercomputers Built to order – Very few of them
Special purpose hardware – Very expensive
Control Data Cray-1 – 1975, 160 MFLOPS • 80 units, 5-8 M$
Cray X-MP – 1982, 800 MFLOPS
Cray-2 – 1985, 1.9 GFLOPS
Cray Y-MP – 1988, 2.6 GFLOPS
...Fortran+ Vectorizing Compilers
Nikola Puzovic - CPU Alternatives for Future HPC Systems
4
Then, commodity took over special purpose
ASCI White, Lawrence Livermore Lab.
ASCI Red, Sandia – 1997, 1 Tflops (Linpack), 9298 processors at 200 Mhz, 1.2 Tbytes, 850 kWatts – Intel Pentium Pro • Upgraded to Pentium II Xeon, 1999, 3.1 Tflops
– 2001, 7.3 TFLOPS, 8192 proc. RS6000 at 375 Mhz, 6 Terabytes, (3+3) MWatts – IBM Power 3
Message-Passing Programming Models Nikola Puzovic - CPU Alternatives for Future HPC Systems
5
“Killer microprocessors” 10.000
MFLOPS
Cray-1, Cray-C90
NEC SX4, SX5
1000
Alpha AV4, EV5 Intel Pentium IBM P2SC HP PA8200
100
10 1974
1979
1984
1989
1994
1999
Microprocessors killed the Vector supercomputers – They were not faster ... – ... but they were significantly cheaper and greener
10 microprocessors approx. 1 Vector CPU – SIMD vs. MIMD programming paradigms Nikola Puzovic - CPU Alternatives for Future HPC Systems
6
Finally, commodity hardware + commodity software MareNostrum – Nov 2004, #4 Top500 • 20 Tflops, Linpack
– IBM PowerPC 970 FX • Blade enclosure
– Myrinet + 1 GbE network – SuSe Linux
Nikola Puzovic - CPU Alternatives for Future HPC Systems
7
2008 – 1 PFLOPS – IBM RoadRunner Los Alamos National Laboratory (USA) Hybrid architecture – 1 x AMD dual-core Master blade – 2 x PowerXCell 8i Worker blade
Hybrid MPI + Task off-load model 296 racks – 6.480 Opteron processors – 12.960 Cell processors • 128-bit SIMD
Infiniband interconnect – 288-port switches
2.35 MWatt
(425 MFLOPS / W) Nikola Puzovic - CPU Alternatives for Future HPC Systems
8
2009 - Cray Jaguar (1.8 PFLOPS) Oak Ridge National Laboratory (USA) Multi-core architecture – Hybrid MPI + OpenMP programming
230 racks 224.256 AMD Opteron processors – 6 cores / chip
Cray Seastar2+ interconnect – 3D-mesh using AMD Hypertransport
7 MWatt
(257 MFLOPS / W)
Nikola Puzovic - CPU Alternatives for Future HPC Systems
9
2012 – Cray Titan (17.6 PFLOPS) DOE/SC/Oak Ridge National Laboratory – Jaguar GPU upgrade
200 racks 224.256 Cray XK7 nodes – 16-core AMD Opteron – Nvidia Testa K20X GPU
8.2 Mwatts (2.142 MFLOPS/W) Nikola Puzovic - CPU Alternatives for Future HPC Systems
10
Outline A little bit of history – From vector CPUs to commodity components
Killer-mobile processors – Overview of current trends for mobile CPUs
Our experiences – Low-power prototypes in BSC
Nikola Puzovic - CPU Alternatives for Future HPC Systems
11
The next step in the commodity chain
HPC
Servers
Desktop
Total cores in Nov„12 Top500 – 14.9M Cores
Tablets sold 2012 Mobile
– > 100M Tablets
Smartphones sold 2012 – > 712M Phones Nikola Puzovic - CPU Alternatives for Future HPC Systems
12
Current trends in mobile CPUs We want to see how mobile SoCs behave with HPC apps – Current test systems have limited memory – Use a set of HPC-specific micro-kernels
Micro-kernels – Stress different architectural features – Cover a wide range of HPC application domains – Reduce porting effort to new architectures
Single core and multi-core evaluations – Goal is to see how mobile CPUs change through generations… – …and to compare them to modern HPC cores
Nikola Puzovic - CPU Alternatives for Future HPC Systems
13
ARM Cortex-A9 Smartphone CPU OoO superscalar processor – Issue width of 4
VFP for 64-bit Floating Point – DP: 1 FMA each 2 cycles
The first ARM CPU truly usable for testing HPC workloads
Nikola Puzovic - CPU Alternatives for Future HPC Systems
14
NVIDIA Tegra2 Dual-core Cortex-A9 @ 1GHz – VFP for 64-bit Floating Point • 2 GFLOPS (1 FMA / 2 cycles)
Low-power Nvidia GPU – OpenGL only, CUDA not supported
Several (not useful for HPC) accelerators – Video encoder-decoder – Audio processor – Image processor
SECO Q7 board
2 GFLOPS ~ 0.5 Watt Nikola Puzovic - CPU Alternatives for Future HPC Systems
15
NVIDIA Tegra3 Quad-core Cortex-A9 @ 1.3GHz – VFP for 64-bit Floating Point • 5.2 GFLOPS (1 FMA / 2 cycles)
– NEON for 32-bit floating Point SIMD
Low-power Nvidia GPU – 3x faster than Tegra2 • For graphics only
– CUDA not supported
SECO Q7 board
Nikola Puzovic - CPU Alternatives for Future HPC Systems
16
ARM Cortex-A15 Next generation of Cortex – – – –
Improved uArch Improved performance Virtualization support Improved multiprocessing capabilities
Floating point performance – DP: 1 FMA per cycle
Nikola Puzovic - CPU Alternatives for Future HPC Systems
17
Samsung Exynos 5 Dual Dual-core ARM Cortex-A15 @ (up to 1.7 GHz) – VFP for 64-bit Floating Point • 6.8 GFLOPS (1 FMA / cycle)
– NEON for 32-bit floating Point SIMD
Quad-core ARM Mali T604 – Compute capable • OpenCL 1.1 • 68 GFLOPS (SP)
Shared memory between CPU and GPU
Nikola Puzovic - CPU Alternatives for Future HPC Systems
18
MicroKernels Benchmark
Properties
Vector Operation (vecop)
Common operation in regular codes
Dense Matrix-Matrix Multiplication (dmmm)
Common operation: measures data reuse and compute performance
3D stencil (3dstc)
Strided memory accesses (7-point 3D stencil)
2D Convolution (2dcon)
Spatial locality
Fast Fourier Transform (fft)
Peak floating-point, variable-stride accesses
Reduction (red)
Varying levels of parallelism (Scalar sum)
Histogram (hist)
Histogram with local privatisation, requires reduction stage
Merge Sort (msort)
Barrier operations
N-Body (nbody)
Irregular memory accesses
Atomic Monte-Carlo Dynamics (amcd)
Embarrassingly parallel: peak compute performance
Sparse Vector-Matrix Multiplication (spwm)
Load imbalance
Nikola Puzovic - CPU Alternatives for Future HPC Systems
19
Performance – Double precision FP (single core)
Performance normalized to Tegra2 board (single core) – Benefits due to increased frequency and improved micro-architecture – Better memory technology gives additional performance benefits
Nikola Puzovic - CPU Alternatives for Future HPC Systems
20
Performance – Double precision FP (multi-core)
Performance normalized to Tegra2 board (OpenMP) – Threads used: 2 in Tegra2, 4 in Tegra3, 2 in Exynos – New generation of ARM cores shows benefits despite smaller number of cores being used Nikola Puzovic - CPU Alternatives for Future HPC Systems
21
Energy Efficiency – Double precision FP (multi-core)
Energy efficiency in GOps/Watt – Gains proportional to improvements in execution time – Increased frequency and complex architecture do add power… – …but “overhead” is too large for this to have an effect Nikola Puzovic - CPU Alternatives for Future HPC Systems
22
Results – single core summary Double Precision
5
4 Speedup
4 Speedup
Single Precision
5
3 2
3 2
1
1
0
0
Tegra 2
Tegra 3
Exynos 5 Dual
Tegra 2
Tegra 3
Exynos 5 Dual
Results as expected – T3 faster than T2 – frequency increase – Exynos faster than T3 – uArch improvements
Nikola Puzovic - CPU Alternatives for Future HPC Systems
23
ARMv8 architecture Cortex-A57 64-bit processor – Improved performance in all workloads • Up to 3x announced at same power budget
– Interoperability with ARM Mali GPUs
Double Floating Point performance – DP support in the Neon instruction set • 128-bit words
– Should double performance wrt Cortex-A15
big.LITTLE processing – Cortex-A53 – Low-performance, low-power companion on A57 – Can pair A57s with A53s Nikola Puzovic - CPU Alternatives for Future HPC Systems
24
What if… Double Precision (single core) 6
Speedup
5 4 3 2 1 0
Tegra 2
Tegra 3
Exynos 5 Dual
ARMv8
ARMv8 could improve DFP performance – We keep the same frequency for our projection (pessimistic) – Increase the DP capability 2x (optimistic?) – Just a speculation, but could happen… Nikola Puzovic - CPU Alternatives for Future HPC Systems
25
What about current HPC CPUs? 16
Double Precision (single core)
14
Speedup
12 10
4.7 x
8
2.3 x
6 4
2 0 Tegra 2
Tegra 3 Exynos 5 ARM v8 Dual
Sandy Bridge
Comparison with single core of Intel SandyBridge-EP E5-2670 @ 2.6 GHz Current “mobile champion” still far away – Next generation mobile CPUs could bring significant improvements – Also, back in time, SX5 was significantly faster than Pentium II… Nikola Puzovic - CPU Alternatives for Future HPC Systems
26
The Killer Mobile processorsTM 1.000.000 Alpha
MFLOPS
100.000
Intel AMD
10.000
Nvidia Tegra
Samsung Exynos
1.000
4-core ARMv8 1.5 GHz
100 1990
1995
2000
2005
2010
2015
History may be about to repeat itself … – Mobile processor are not faster … – … but they are significantly cheaper and greener
Nikola Puzovic - CPU Alternatives for Future HPC Systems
27
Then and now Then: Vector vs Commodity
Now: Commodity vs Mobile
Today‟s situation looks very familiar – “Mobile vs. Server” similar to “Server vs. Vector” – Significantly lower cost of mobile CPUs (thousands vs hundreds of $) – Same programming model, larger scale • Will need more parallelism (probably less than one order of magnitude)
Off course, this does not prove anything – Mobile CPUs will become a viable alternative, but there‟s no guarantee that they will make it to mainstream HPC systems Nikola Puzovic - CPU Alternatives for Future HPC Systems
28
Outline A little bit of history – From vector CPUs to commodity components
Killer-mobile processors – Overview of current trends for mobile CPUs
Our experiences – Low-power prototypes in BSC
Nikola Puzovic - CPU Alternatives for Future HPC Systems
29
GFLOPS / W
BSC ARM-based prototype roadmap
Pedraforca: ARM + GPU Tibidabo: ARM multicore
2011
Integrated ARM + GPU
2012
2013
2014
Prototypes are critical to accelerate software development – System software stack + applications Nikola Puzovic - CPU Alternatives for Future HPC Systems
30
Tibidabo: The first ARM multicore cluster Q7 Tegra 2 2 x Cortex-A9 @ 1GHz 2 GFLOPS 5 Watts (?) 0.4 GFLOPS / W
2 Racks 32 blade containers 256 nodes 512 cores
9x 48-port 1GbE switch
512 GFLOPS 3.4 Kwatt 0.15 GFLOPS / W
Q7 carrier board 2 x Cortex-A9 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W
1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W
Proof of concept – It is possible to deploy a cluster of smartphone processors
Enable software stack development Nikola Puzovic - CPU Alternatives for Future HPC Systems
31
Tibidabo: scalability and energy efficiency HPC applications scale out of the box on tibidabo – Strong scaling depends on the size of input set
HPL – good weak scaling – 120 MFLOPS/Watt
Specfem3D – Improvements over x86 cluster in energy efficiency (up to 3x) D. Goddeke et. al. “Energy-efficiency vs. performance of the numerical solution of PDEs: an application study on a low-power ARM-based cluster”, Journal of Computational Physics Nikola Puzovic - CPU Alternatives for Future HPC Systems
32
Pedraforca: ARM+GPU cluster Stage One – Test cluster of CARMA kits • Tegra3 SoC • Quadro 1000M
– 1 GbE interconnect
Stage Two – ARM multicore SoC (NVIDIA) – NVIDIA GPU
In progress…
Nikola Puzovic - CPU Alternatives for Future HPC Systems
33
Mont-Blanc project goals To develop an European Exascale approach Based on embedded power-efficient technology
Objetives – Develop a first prototype system, limited by available technology – Design a Next Generation system, to overcome the limitations – Develop a set of Exascale applications targeting the new system
Nikola Puzovic - CPU Alternatives for Future HPC Systems
34
Mont-Blanc prototype Exynos 5 Dual – Integrated CPU + GPU – Dual core Cortex-A15 + ARM Mali T604 GPU
Integrated GPU has many advantages – Shared memory with CPU • Even cache coherent!
– No power wasted on PCIe bus – No power wasted on GDDR5 memory – Higher energy efficiency + lower cost
Nikola Puzovic - CPU Alternatives for Future HPC Systems
35
High density packaging architecture Standard BullX blade enclosure Multiple compute nodes per blade – Additional level of interconnect, on-blade network
Deployment expected later this year Nikola Puzovic - CPU Alternatives for Future HPC Systems
36
Are we building BlueGene again? Yes ... – Exploit Pollack's Rule in presence of abundant parallelism • Many small cores vs. Single fast core
... and No – Heterogeneous computing • On-chip GPU
– Commodity vs. Special purpose • Higher volume • Many vendors • Lower cost
– Lots of room for improvement • No SIMD / vectors yet ...
– Build on Europe's embedded strengths Nikola Puzovic - CPU Alternatives for Future HPC Systems
37
Conclusions Commodity vs Mobile
Vector vs Commodity
Killer mobile processors – Not yet there, but getting very close
We will see a supercomputer with mobile SoCs soon – Mont-Blanc prototype @ BSC – Question is if it will become mainstream www.montblanc-project.eu
MontBlancEU
@MontBlanc_EU
Nikola Puzovic - CPU Alternatives for Future HPC Systems
38