Programming Techniques for Supercomputers Prof. Dr. G. Wellein(a,b) Dr. G. Hager(a) M. Wittmann(a) (a)HPC
Services – Regionales Rechenzentrum Erlangen für Informatik
(b)Department
University Erlangen-Nürnberg Sommersemester 2016
Audience & Contact
Audience
Computational Engineering, Computer Science Physics, Engineering, Materials Science,…
Contact:
Gerhard Wellein:
Georg Hager:
Markus WIttmann:
April 11, 2016
[email protected] [email protected] 09131 85 28136
[email protected] 09131 85 28973
[email protected] 09131 85 20104
PTfS 2016
2
Organization & Format
Lecture/Tutorial is completely documented in our Moodle LMS: http://goo.gl/7fueKj or http://moodle.rrze.uni-erlangen.de/moodle/course/view.php?id=346
Please enroll into the lecture and specify your matriculation number Homework assignments, announcements etc. all handled via moodle
4 hours of lecture: Monday AND Thursday
14:15 – 15:45 in E1.12 10:15 – 11:45 in E1.12
Please interrupt and ask questions !
April 11, 2016
PTfS 2016
3
Organization and Format 2 hours of tutorial: Wed. 10:15 – 11:45 at 0.01-142 OR Thur. 12:15 – 13:45 at 01.153-113
Exercise "sheets" (homework) available every Wednesday in Moodle
Exercises start NEXT WEEK
You also need CIP pool accounts (ask CIP admins!)
First tutorials (next week): Intro to systems handling (logging in via SSH, X forwarding, using compilers, batch jobs) of RRZE cluster
April 11, 2016
PTfS 2016
4
Format of course Schein: Lecture only: 5 ECTS Oral Exam of material covered in the lecture Register in “meincampus”
Lecture & Exercises: (5 + 2,5) ECTS Oral Exam of material covered in lecture AND excercises Register for lecture AND exercise in “meincampus”
Exam dates: Mid of July / Beginning of October
Prerequisite for exercises: Basic programming knowledge in C/C++ or FORTRAN Using LINUX / UNIX OS environments
April 11, 2016
PTfS 2016
5
Scope of course Ability to write efficient parallel programs for (super)computers Introduction to architecture of
Single core/processor Multi-Core processors Shared memory nodes Distributed memory computers
GPU / accelerator
x86_64 based architectures x86_64 multi-cores Single node (RRZE) Compute clusters (RRZE) and MPP (IBM BlueGene, CRAY series) nVIDIA / Intel Xeon/Phi
Efficient programming and optimization strategies Concepts, Potentials & Pitfalls of Parallel Computing Shared Memory Parallel Programming OpenMP Distributed Memory Parallel Programming MPI Hybrid programming MPI+OpenMP
Performance Analysis & Modeling throughout all topics… April 11, 2016
PTfS 2016
6
Scope of the course Introduction Colored slides,… Performance: Measuring & Reporting, Standard benchmarks: Kernels & more Architecture: Pipelining, Superscalarity, SIMD, Memory Hierarchy Code transformations and optimization techniques
Foundations of parallel processing Parallel processing (1): Multi-Core – parallel processing for the masses Shared-memory system architectures & programming techniques multi-core, multi-socket, multi-everything, UMA, ccNUMA,…
Parallel processing (2): Distributed-memory system architectures & programming techniques networks, clusters, MPI, …
Parallel processing (3): Hybrid programming techniques: MPI + OpenMP
Parallel processing (4): GPU: nVIDIA & CUDA Intel Xeon/Phi April 11, 2016
PTfS 2016
Performance Analysis and Modeling
Single Core:
7
Scope of the course !$OMP PARALLEL DO do k = 1 , Nk Parallelize do j = 1 , Nj; do i = 1 , Ni y(i,j,k)= b*( x(i-1,j,k)+ x(i+1,j,k)+ x(i,j-1,k)+ x(i,j+1,k)+ x(i,j,k-1)+ x(i,j,k+1)) enddo; enddo enddo !$OMP END PARALLEL DO Establish limit simple performance model
Parallelize
Single core performance optimization
April 11, 2016
PTfS 2016
8
Supporting material Books: G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Computational Science Series, 2010. ISBN 978-1439811924 see moodle for a very early version 10 copies are available in the library discounted copies – ask us
J. Hennessy and D. Patterson: Computer Architecture. A Quantitative Approach. Morgan Kaufmann Publishers, Elsevier, 2003. ISBN 1-55860724-2 W. Schönauer: Scientific Supercomputing. (cf. http://www.rz.uni-karlsruhe.de/~rx03/book/)
April 11, 2016
PTfS 2016
9
Supporting material Documentation:
http://www.openmp.org http://www.mpi-forum.org http://developer.intel.com/products/processor/manuals http://developer.amd.com/documentation/guides
The big ones and more useful HPC related information: http://www.top500.org
April 11, 2016
PTfS 2016
10
Related teaching activities Regular seminar on “Efficient numerical simulation on multicore processors” (MuCoSim) 5 ECTS 2 hrs per week 2 talks + written summary Topics from code optimization, code parallelization and code benchmarking on latest multicore / manycore CPUs and GPUs This semester: Tuesday 16:00 – 17:30 – E-studio RRZE (2.037)
April 11, 2016
PTfS 2016
11
Introduction (1) The Big Ones and the working horses
Supercomputer – A good definition ?! “Supercomputer is a computer that is only one generation behind what large-scale users want.” Neil Lincoln, architect for the CDC Cyber 205 and others A supercomputer does not fit under the desktop! Absolute, rare compute power is not a reasonable measure…
Assume: Computer is being used for numerical simulation Compute power of a system is measured by Floating Point Operations (MULT, ADD) for a specific numeric benchmark
TOP500 list
April 11, 2016
PTfS 2016
13
Most powerful computers in the world: TOP500 Top 500: Survey of the 500 most powerful supercomputers http://www.top500.org Solve large dense system of linear equations: A x = b („LINPACK“)
Published twice a year (ISC in Germany, SC in USA) Today’ s Laptop Established in 1993 (CM5/1024): 60 GFlop/s (Top1) Since Nov. 2013 (Tinahe): 33.800.000 GFlop/s (Top1) Performance increase: 95 % p.a. over almost 2 decades! (1993 – 2013) Performance measure: MFlop/s, GFlop/s, TFlop/s, PFlop/s Number of FLOATING POINT operations per second FLOATING POINT operations: double precision (64 bit) Add & Mult ops 106: MFlop/s; 109: GFlop/s; 1012: TFlop/s; 1015: PFlop/s ; 1018: EFlop/s April 11, 2016
PTfS 2016
14
TOP10 as of November 2013 Rpeak: Theoretical performance Rmax: LINPACK performance
Challenges: Extreme parallelism: 3.1 * 106 cores Power consumption 17.8 MW 1 MW 1.5 Mio € p.a. Efficiency (Rmax/Rpeak) 0.60,…,0.93
Source: www.top500.org
April 11, 2016
PTfS 2016
Heterogeneous hardware 15
List 1 (Jun 1993) to 41 (Jun 2013) Performance Projection 1 Eflop/s
100 Eflop/s 10 Eflop/s
1 Eflop/s 100 Pflop/s 10 Pflop/s
SUM
1 Pflop/s 100 Tflop/s
N=1
10 Tflop/s
1 Tflop/s
6-8 years
100 Gflop/s
N=500
10 Gflop/s
1 Gflop/s 100 Mflop/s 1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
By courtesy of Hans Meuer April 11, 2016
ISC ’13 in Leipzig
TOP10 as of November 2015
Performance increase / technology change slowed down considerably….
Only 3 new entries in 2 years:: CRAY XC40
Source: www.top500.org
April 11, 2016
PTfS 2016
17
Latest “linear” projection – Nov. 2015
Performance increase slows down considerably in all parts of the TOP500 list!
April 11, 2016
PTfS 2016
18
HPC Centers in Germany:
Jülich Supercomputing Center BlueGene/Q 5.8 PFlop/s
A view from Erlangen
Hannover
Berlin
FZ Jülich
Erlangen/ Nürnberg (0.2 PF/s)
HLRS-Stuttgart
LRZ-München
HLR Stuttgart:: 7.4 PF (CRAY XC 40)
April 11, 2016
IBM Cluster: 2*3 PF PTfS 2016
19
SuperMUC – LRZ Garching: TOP 4 (June 2012) Thin nodes:
18 Islands with 512 nodes each 2 Intel Xeon E5-2680 processors (8 cores & 2.7 GHz baseline) per node 147,456 cores 3.2 PF/s Peak 2.9 PF/s LINPACK
Fat nodes: 1 Island: 205 nodes 4 Intel Xeon E7-4870 10 C per node 256 GB/node
Total power consumption: 2.5 MW – 3 MW Upgrade to 6 PF/s (Peak) in 2014 (add nodes with Intel Haswell proc.) April 11, 2016
PTfS 2016
20
SuperMUC – far more than some islands..
April 11, 2016
PTfS 2016
21
RRZE: „Emmy”-cluster
544 compute nodes (10.880 cores) with
2 Intel Xeon E5-2660v2 (Ivy Bridge) 2.2 GHz (10 cores) 20 cores/ node + SMT cores 64 GB main memory – NO local disks
16 accelerator nodes same CPUs
8 nodes with 2 x NVIDIA K20 GPGPUs 8 nodes with 2 x Intel Xeon Phi
Vendor: NEC (Dual-Twin – Supermicro)
Peak performance: Power consumption: ~160 KW 234 TFlop/s (all devices) (backdoor heat exchanger) 191 TFlop/s LINPACK (CPUs) Full QuadDataRate Infiniband fat tree #210 in TOP500 / Nov. 2013 BW ~ 3 GB/s / direction and < 2 µs latency
Parallel Filesystem: 400 TB+ (max. 7 GB/s) Operating system: LINUX
April 11, 2016
PTfS 2016
22
RRZE: „LiMa”-cluster
500 compute nodes (6.000 cores) with
2 Intel Westmere 2.66 GHz Hexacores 12 cores/ node + SMT cores 24 GB main memory – NO local disks Vendor: NEC (Dual-Twin – Supermicro)
Power consumption: ~160 KW Closed Racks with cold water heat exchanger inside
Full QuadDataRate Infiniband fat tree interconnect BW ~ 3 GB/s / direction and < 2 µs latency
Parallel Filesystem: 130 TB+ accessible with 3 GB/s Operating system: LINUX Peak performance: 63.8 TFlop/s (Rpeak) @ 2.66GHz LINPACK (Rmax): 57.3 TFlop/s (#130 in TOP500 / Nov. 2010)
April 11, 2016
PTfS 2016
23
Prepare computer access: Send email to
[email protected] containing name, IDM account, student IB number Tour through computer room
April 11, 2016
PTfS 2016
24