Multi-Core-Architectures for Numerical Simulation

Multi-Core-Architectures for Numerical Simulation H. Köstler, J. Habich, J. Götz, M. Stürmer, S. Donath, T. Gradl, D. Ritter, C. Feichtinger, K. Iglbe...
Author: Noel Newman
4 downloads 0 Views 5MB Size
Multi-Core-Architectures for Numerical Simulation H. Köstler, J. Habich, J. Götz, M. Stürmer, S. Donath, T. Gradl, D. Ritter, C. Feichtinger, K. Iglberger (LSS Erlangen und RRZE)

U. Rüde (LSS Erlangen, [email protected]) In collaboration with RRZE and many more Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de

Siemens Simulation Center 11. November 2009 1

Overview Intro Who we are How fast are computers today? Technological Trends GPUs, Cell, and others Example: Flow Simulation with Lattice Boltzmann Methods Computational Haemodynamics using the PlayStation Conclusions

2

The LSS Mission Development and Analysis of Computer Methods for Applications in Science and Engineering Applications from Physical and Engineering Sciences

Computer Science LSS

Mathematics 3

Who is at LSS (and does what?) Complex Flows K. Iglberger •

C. Feichtinger



K. Pickl



S. Donath



S. Bogner

• •

C. Mihoubi J. Götz



!"##"#$%"&'(")"#'$



S. Ganguly

Supercomputing J. Götz T. Gradl

B. Gmeiner

M. Stürmer

S. Geißelsöder

F. Deserno D. Ritter

Alumni Prof. G. Horton (Univ. of Magdeburg) Prof. El Mostafa Kalmoun (Cadi Ayyad University, Marocco)

Numerical Algorithms

Dr. M. Kowarschik (Siemens Health Care) Dr. M Mohr (Geophysik, TU München)

H. Köstler • T. Dreher

• D. Bartuschat

• Dr. W. Degen

• S. Strobl

• T. Preclik

• Li Yi

Dr. F. Hülsemann (EDF, Paris) Dr. B. Bergen (Los Alamos, USA) Dr. N. Thürey (ETZH Zürich) Dr. J. Härdtlein (Bosch GmbH)

Laser Simulation Prof. Dr. C. Pflaum

C. Möller (Navigon) Dr. U. Fabricius (Elektrobit) Dr. Th. Pohl (Siemens Health Care)

B. Berneker

Kai Hertel

J. Treibig (RRZE)

M. Wohlmuth

J. Werner

C. Freundl (YAGER Development)

C. Jandl

4

How much is a PetaFlops? 106 = 1 MegaFlops: Intel 486 33MHz PC (~1989) 109 = 1 GigaFlops: Intel Pentium III 1GHz (~2000) If every person on earth does one operation every 6 seconds, all humans together have 1 GigaFlops performance (less than a current laptop)

1012= 1 TeraFlops: HLRB-I 1344 Proc., ~ 2000 1015= 1 PetaFlops >100 000 Proc. Cores Roadrunner/Los Alamos: Jun 2008

HLRB-I: 2 TFlops

• If every person on earth runs a 486 PC, we all together have an aggregate Performance of 6 PetaFlops.

HLRB-II: 63 TFlops 5

Where Does Computer Architecture Go? Computer architects have capitulated: It may not be possible anymore to exploit progress in semiconductor technology for automatic performance improvements Even today a single core CPU is a highly parallel system: superscalar execution, complex pipeline, ... and additional tricks Internal parallelism is a major reason for the performance increases until now, but ... There is a limited amount of parallelism that can be exploited automatically Multi-core systems concede the architects´ defeat: Architects fail to build faster single core CPUs given more transistors Clock rate increases only slowly (due to power considerations) Therefore architects have started to put several cores on a chip: programmers must use them directly

6

What are the consequences? For the application developers “the free lunch is over” Without explicitly parallel algorithms, the performance potential cannot be used any more it will become increasingly important to use instruction level parallelism (such as vector units)

For perfromance critical applications: CPUs will have 2, 4, 8, 16, ..., 128, ..., ??? cores maybe sooner than we are ready for this In the high end will have to deal with systems with millions of cores 7

Trends in Computer Architecture On Chip Parallelism for everyone instruction level SIMD-like vectorization multicore (with caches or local memory) Off Chip parallelism for large scale parallel systems Accelerator hardware GPUs Cell processor Limits to clock rate Limits to memory bandwidth Limits to memory latency

8

Multi-Core Aktivitäten am LSS Architekturen IBM Cell GPU • Nvidia • AMD/ATI Konventionelle Mehrkernarchitekturen (Intel, AMD) Anwendungen Finite Elemente - PDE Mehrgitterverfahren (Strömungslöser LBM-Verfahren Bildverarbeitung, medizintechnische Anwendungen 3D-Realzeitsimulation für industrielle Steuerung Siehe Papers, Berichte, Master- und Bachelorarbeiten: http://www10.informatik.uni-erlangen.de/Publications/ 9

Multi Core Architectures IBM-Sony-Toshiba Cell Processor GPU: Nvidia or AMD/ATI

10

The STI Cell Processor hybrid multicore processor based on IBM Power architecture (simplified) PowerPC core runs operating system controls execution of programs multiple co-processors (8, on Sony PS3 only 6 available) operate on fast, private on-chip memory optimized for computation vectorization: „float4“ data type DMA controller copies data from/to main memory • multi-buffering can hide main memory latencies completely for streaming-like applications • loading local copies has low and known latencies memory with multiple channels and banks can be exploited if many memory transactions are in-flight

11

IBM Cell Processor Available cell systems: Roadrunner Blades Playstation 3

12

Cell Architecture: 9 cores on a chip

13

GPUs massively parallel SIMD-like execution on several hundred compute units typical performance values (Nvidia Fermi, soon): 2.7 TFlop single precision possible 630 Gflop double precision 4+x GByte memory „on board“ 150+x GByte/sec memory bandwidth additionally vectorization in „warps“ (16 floats)

14

ATI Radeon HD 4870 Costs: 150 ! Interface: PCI-E 2.0 x16 Shader Clock: 750 MHz Memory Clock: 900 MHz Memory Bandwidth: 115 GB/s FLOPS: 1200 GFLOPS Max Power Draw: 160 W Framebuffer: 1024 MB Memory Bus: 256 bit Shader Processors: 800

15

Nvidia GeForce GTX 295 Costs: 450 ! Interface: PCI-E 2.0 x16 Shader Clock: 1242 MHz Memory Clock: 999 MHz Memory Bandwidth: 2x112 GB/s FLOPS: 2x894 GFLOPS Max Power Draw: 289 W Framebuffer: 2x896 MB Memory Bus: 2x448 bit Shader Processors: 2x240

16

GPU: AMD Stream Processor

17

AMD Stream Architecture (cont‘d)

ATI Radeon 3870(RV670) / Firestream 9170

18

Example 1: Flow Simulation on Cell

19

LBM Optimized for Cell memory layout optimized for DMA transfers information propagating between patches is reordered on the SPE and stored sequentially in memory for simple and fast exchange

code optimization kernels hand-optimized in assembly language SIMD-vectorized streaming and collision branch-free handling of bounce-back boundary conditions

20

Simulation of Metal Foams Free Surface Flows Applications: Engineering: metal foam simulations Computer graphics: special effects

Based on LBM: Mesoscopic approach to solving the NS equations Good for complex boundary conditions Details: D3Q19 model, BGK collision and grid compression

21

Performance Results LBM performance on a single core (8x8x8 channel flow) 49,0

50,0 37,5 25,0 12,5

10,4 4,8

2,0

0 Xeon 5160

PPE

SPE*

straight-forward C code SIMD-optimized assembly *on Local Store without DMA transfers

22

Performance Results 100,0

93

94

94

95

3

4

5

6

81

82,5

65,0

47,5

42

30,0 1

2

23

Performance Results 50,0

43,8

37,5 25,0 12,5

21,1 9,1

11,7

0 Xeon 5160*

Playstation 3 1 core 1 CPU

*performance optimized code by LB-DC

24

Programming the Cell-BE the hard way control SPEs using management libraries issue DMAs by language extensions do address calculations manually exchange main memory addresses, array sizes etc. synchronization using mailboxes, signals or libraries

frameworks Accelerated Library Framework (ALF) and Data, Communication, and Synchronization (DaCS) by IBM Rapidmind SDK

accelerated libraries single-source-compiler IBM’s xlc-cbe-sse, uses OpenMP

25

Naive SPU implementation: A[] = A[]*c volatile vector float ls_buffer[8] __attribute__((aligned(128))); void scale(

unsigned long long gs_buffer, // main memory address of vector int number_of_chunks,

// number of chunks of 32 floats

float factor

// scaling factor

) {

vector float v_fac = spu_splats(factor);

// create SIMD vector with all // four elements being factor

for ( int i = 0 ; i < number_of_chunks ; ++i ) { mfc_get( ls_buffer , gs_buffer , 128 , 0 ,0,0);

// DMA reading i-th chunk

mfc_write_tag_mask( 1

Suggest Documents