Programming Techniques for Supercomputers

Programming Techniques for Supercomputers Prof. Dr. G. Wellein(a,b) Dr. G. Hager(a) M. Wittmann(a) (a)HPC Services – Regionales Rechenzentrum Erlange...
12 downloads 0 Views 2MB Size
Programming Techniques for Supercomputers Prof. Dr. G. Wellein(a,b) Dr. G. Hager(a) M. Wittmann(a) (a)HPC

Services – Regionales Rechenzentrum Erlangen für Informatik

(b)Department

University Erlangen-Nürnberg Sommersemester 2016

Audience & Contact 

Audience  



Computational Engineering, Computer Science Physics, Engineering, Materials Science,…

Contact: 

Gerhard Wellein:



Georg Hager:



Markus WIttmann:

April 11, 2016

[email protected] [email protected] 09131 85 28136 [email protected] 09131 85 28973 [email protected] 09131 85 20104

PTfS 2016

2

Organization & Format 

Lecture/Tutorial is completely documented in our Moodle LMS:  http://goo.gl/7fueKj or  http://moodle.rrze.uni-erlangen.de/moodle/course/view.php?id=346



Please enroll into the lecture and specify your matriculation number  Homework assignments, announcements etc. all handled via moodle



4 hours of lecture:  Monday AND  Thursday



14:15 – 15:45 in E1.12 10:15 – 11:45 in E1.12

Please interrupt and ask questions !

April 11, 2016

PTfS 2016

3

Organization and Format  2 hours of tutorial:  Wed. 10:15 – 11:45 at 0.01-142 OR  Thur. 12:15 – 13:45 at 01.153-113 

Exercise "sheets" (homework) available every Wednesday in Moodle



Exercises start NEXT WEEK



You also need CIP pool accounts (ask CIP admins!)



First tutorials (next week): Intro to systems handling (logging in via SSH, X forwarding, using compilers, batch jobs) of RRZE cluster

April 11, 2016

PTfS 2016

4

Format of course  Schein:  Lecture only: 5 ECTS  Oral Exam of material covered in the lecture  Register in “meincampus”

 Lecture & Exercises: (5 + 2,5) ECTS  Oral Exam of material covered in lecture AND excercises  Register for lecture AND exercise in “meincampus”

 Exam dates: Mid of July / Beginning of October

 Prerequisite for exercises:  Basic programming knowledge in C/C++ or FORTRAN  Using LINUX / UNIX OS environments

April 11, 2016

PTfS 2016

5

Scope of course Ability to write efficient parallel programs for (super)computers  Introduction to architecture of    

Single core/processor Multi-Core processors Shared memory nodes Distributed memory computers

 GPU / accelerator

    

 x86_64 based architectures  x86_64 multi-cores  Single node (RRZE)  Compute clusters (RRZE) and MPP (IBM BlueGene, CRAY series)  nVIDIA / Intel Xeon/Phi

Efficient programming and optimization strategies Concepts, Potentials & Pitfalls of Parallel Computing Shared Memory Parallel Programming  OpenMP Distributed Memory Parallel Programming  MPI Hybrid programming  MPI+OpenMP

Performance Analysis & Modeling throughout all topics… April 11, 2016

PTfS 2016

6

Scope of the course  Introduction  Colored slides,…  Performance: Measuring & Reporting, Standard benchmarks: Kernels & more  Architecture: Pipelining, Superscalarity, SIMD, Memory Hierarchy  Code transformations and optimization techniques

 Foundations of parallel processing  Parallel processing (1):  Multi-Core – parallel processing for the masses  Shared-memory system architectures & programming techniques  multi-core, multi-socket, multi-everything, UMA, ccNUMA,…

 Parallel processing (2):  Distributed-memory system architectures & programming techniques  networks, clusters, MPI, …

 Parallel processing (3):  Hybrid programming techniques: MPI + OpenMP

 Parallel processing (4):  GPU: nVIDIA & CUDA  Intel Xeon/Phi April 11, 2016

PTfS 2016

Performance Analysis and Modeling

 Single Core:

7

Scope of the course !$OMP PARALLEL DO do k = 1 , Nk Parallelize do j = 1 , Nj; do i = 1 , Ni y(i,j,k)= b*( x(i-1,j,k)+ x(i+1,j,k)+ x(i,j-1,k)+ x(i,j+1,k)+ x(i,j,k-1)+ x(i,j,k+1)) enddo; enddo enddo !$OMP END PARALLEL DO Establish limit simple performance model

Parallelize

Single core performance optimization

April 11, 2016

PTfS 2016

8

Supporting material  Books:  G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Computational Science Series, 2010. ISBN 978-1439811924  see moodle for a very early version  10 copies are available in the library  discounted copies – ask us

 J. Hennessy and D. Patterson: Computer Architecture. A Quantitative Approach. Morgan Kaufmann Publishers, Elsevier, 2003. ISBN 1-55860724-2  W. Schönauer: Scientific Supercomputing. (cf. http://www.rz.uni-karlsruhe.de/~rx03/book/)

April 11, 2016

PTfS 2016

9

Supporting material  Documentation:    

http://www.openmp.org http://www.mpi-forum.org http://developer.intel.com/products/processor/manuals http://developer.amd.com/documentation/guides

 The big ones and more useful HPC related information:  http://www.top500.org

April 11, 2016

PTfS 2016

10

Related teaching activities  Regular seminar on “Efficient numerical simulation on multicore processors” (MuCoSim)  5 ECTS  2 hrs per week  2 talks + written summary  Topics from code optimization, code parallelization and code benchmarking on latest multicore / manycore CPUs and GPUs  This semester: Tuesday 16:00 – 17:30 – E-studio RRZE (2.037)

April 11, 2016

PTfS 2016

11

Introduction (1) The Big Ones and the working horses

Supercomputer – A good definition ?!  “Supercomputer is a computer that is only one generation behind what large-scale users want.” Neil Lincoln, architect for the CDC Cyber 205 and others  A supercomputer does not fit under the desktop!  Absolute, rare compute power is not a reasonable measure…

 Assume:  Computer is being used for numerical simulation  Compute power of a system is measured by Floating Point Operations (MULT, ADD) for a specific numeric benchmark

 TOP500 list

April 11, 2016

PTfS 2016

13

Most powerful computers in the world: TOP500  Top 500: Survey of the 500 most powerful supercomputers  http://www.top500.org  Solve large dense system of linear equations: A x = b („LINPACK“)

 Published twice a year (ISC in Germany, SC in USA) Today’ s Laptop  Established in 1993 (CM5/1024): 60 GFlop/s (Top1)  Since Nov. 2013 (Tinahe): 33.800.000 GFlop/s (Top1)  Performance increase: 95 % p.a. over almost 2 decades! (1993 – 2013)  Performance measure: MFlop/s, GFlop/s, TFlop/s, PFlop/s  Number of FLOATING POINT operations per second  FLOATING POINT operations: double precision (64 bit) Add & Mult ops  106: MFlop/s; 109: GFlop/s; 1012: TFlop/s; 1015: PFlop/s ; 1018: EFlop/s April 11, 2016

PTfS 2016

14

TOP10 as of November 2013 Rpeak: Theoretical performance Rmax: LINPACK performance

 Challenges:  Extreme parallelism: 3.1 * 106 cores  Power consumption 17.8 MW 1 MW  1.5 Mio € p.a.  Efficiency (Rmax/Rpeak) 0.60,…,0.93

Source: www.top500.org

April 11, 2016

PTfS 2016

 Heterogeneous hardware 15

List 1 (Jun 1993) to 41 (Jun 2013) Performance Projection 1 Eflop/s

100 Eflop/s 10 Eflop/s

1 Eflop/s 100 Pflop/s 10 Pflop/s

SUM

1 Pflop/s 100 Tflop/s

N=1

10 Tflop/s

1 Tflop/s

6-8 years

100 Gflop/s

N=500

10 Gflop/s

1 Gflop/s 100 Mflop/s 1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

2018

2020

By courtesy of Hans Meuer April 11, 2016

ISC ’13 in Leipzig

TOP10 as of November 2015

Performance increase / technology change slowed down considerably….

Only 3 new entries in 2 years:: CRAY XC40

Source: www.top500.org

April 11, 2016

PTfS 2016

17

Latest “linear” projection – Nov. 2015

Performance increase slows down considerably in all parts of the TOP500 list!

April 11, 2016

PTfS 2016

18

HPC Centers in Germany:

Jülich Supercomputing Center BlueGene/Q 5.8 PFlop/s

A view from Erlangen

Hannover

Berlin

FZ Jülich

Erlangen/ Nürnberg (0.2 PF/s)

HLRS-Stuttgart

LRZ-München

HLR Stuttgart:: 7.4 PF (CRAY XC 40)

April 11, 2016

IBM Cluster: 2*3 PF PTfS 2016

19

SuperMUC – LRZ Garching: TOP 4 (June 2012)  Thin nodes:     

18 Islands with 512 nodes each 2 Intel Xeon E5-2680 processors (8 cores & 2.7 GHz baseline) per node 147,456 cores 3.2 PF/s Peak 2.9 PF/s LINPACK

 Fat nodes:  1 Island: 205 nodes  4 Intel Xeon E7-4870 10 C per node  256 GB/node

 Total power consumption: 2.5 MW – 3 MW  Upgrade to 6 PF/s (Peak) in 2014 (add nodes with Intel Haswell proc.) April 11, 2016

PTfS 2016

20

SuperMUC – far more than some islands..

April 11, 2016

PTfS 2016

21

RRZE: „Emmy”-cluster 

544 compute nodes (10.880 cores) with  

2 Intel Xeon E5-2660v2 (Ivy Bridge) 2.2 GHz (10 cores)  20 cores/ node + SMT cores 64 GB main memory – NO local disks

 16 accelerator nodes same CPUs  

8 nodes with 2 x NVIDIA K20 GPGPUs 8 nodes with 2 x Intel Xeon Phi

 Vendor: NEC (Dual-Twin – Supermicro)    

Peak performance: Power consumption: ~160 KW 234 TFlop/s (all devices) (backdoor heat exchanger) 191 TFlop/s LINPACK (CPUs) Full QuadDataRate Infiniband fat tree #210 in TOP500 / Nov. 2013 BW ~ 3 GB/s / direction and < 2 µs latency

Parallel Filesystem: 400 TB+ (max. 7 GB/s) Operating system: LINUX

April 11, 2016

PTfS 2016

22

RRZE: „LiMa”-cluster 

500 compute nodes (6.000 cores) with   

2 Intel Westmere 2.66 GHz Hexacores  12 cores/ node + SMT cores 24 GB main memory – NO local disks Vendor: NEC (Dual-Twin – Supermicro)



Power consumption: ~160 KW Closed Racks with cold water heat exchanger inside



Full QuadDataRate Infiniband fat tree interconnect BW ~ 3 GB/s / direction and < 2 µs latency

 

Parallel Filesystem: 130 TB+ accessible with 3 GB/s Operating system: LINUX Peak performance: 63.8 TFlop/s (Rpeak) @ 2.66GHz LINPACK (Rmax): 57.3 TFlop/s (#130 in TOP500 / Nov. 2010)

April 11, 2016

PTfS 2016

23

Prepare computer access: Send email to [email protected] containing name, IDM account, student IB number Tour through computer room

April 11, 2016

PTfS 2016

24

Suggest Documents