Multi-core processors and multithreading

Multi-core processors and multithreading Evolution of processor architectures: growing complexity of CPUs and its impact on the software landscape Le...
Author: Avis Hodges
24 downloads 0 Views 2MB Size
Multi-core processors and multithreading

Evolution of processor architectures: growing complexity of CPUs and its impact on the software landscape Lecture 2

Multi-core processors and multithreading Paweł Szostek CERN

Inverted CERN School of Computing, 23-24 February 2015

1

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Multi-core processors and multithreading: part 1

ADVANCED TOPICS IN THE COMPUTER ARCHITECTURES 2

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

CPU evolution  In the past manufacturers were increasing frequency  Transistors were invested into larger caches and more powerful cores  From 2005 transistors are spent on new cores → 10 years of paradigm change (see Herb Sutter’s The Free Lunch is over)

 Thermal Design Power (TDP) is stalled at ~150W Why higher clock speed increases the power consumption? 3

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Interlude: power dissipation  In the past, there were no power dissipation issues  Heat density (W/cm3) in a modern CPU approaches the same level as in nuclear reactor [1]  “Tricks” needed to limit power usage (TurboBoost®, AVX frequencies, more transistors for infrequent use)  This can lead to caveats, see AVX [1]: David Chisnall The Dark Silicon Problem and What it Means for CPU Designers

4

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Interlude: manufacturing technology 120nm

Flu virus

5

14nm process transistor iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Simultaneous Multi-Threading  Problem: when executing a stream of instructions, even with outof-order execution, a CPU cannot keep all the execution units constantly busy  Can be caused by many reasons: hazards, front-end stalls, homogenous instruction stream etc.

6

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Simultaneous Multi-Threading (II)  Solution: we can utilize idle execution units with a different thread  SMT is a hardware feature that can be turned on/off in the BIOS  Most of the hardware resources (including caches) are shared  Needs a separate fetching unit  Can both speed up and slow down execution (see next slide)

7

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Simultaneous Multi-Threading (III)  Workloads from HEPSPEC06 benchmark  Many instances of single-threaded processes run in parallel  Different scalability and reactions to SMT  Cache utilization is the most important factor in SMT impact

8

iCSC2015, Pawel Szostek, CERN

SMT

Multi-core processors and multithreading

Simultaneous Multi-Threading (IV)  Idea: we might want to exploit SMT by running a main thread and a helper thread on the same physical core  Example: list or tree traversal  the role of the helper thread is to prefetch the data  helper thread works in front of the main thread by accessing data ahead of the main thread  think of it as an interesting example of exploiting the hardware 9

source: J. Zhou et al. “Improving Database Performance on Simultaneous Multithreading Processors” iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Non-Uniform Memory Access  Multi-processor architecture, where memory access time depends on location of the memory wrt. the processor  Makes accesses fast, when the memory is “close” to the processor  There is a performance hit when accessing the “foreign” memory  Lowers down the pressure on the memory bus 10

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Cluster-on-die  Problem: with increasing number of cores there is more and more concurrent accesses to the shared memories (LLC and RAM)  Solution: split the memory on one socket into two nodes

11

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Intel architectural extensions Extension Generation/year

Value added

MMX

Pentium MMX/1997

64b registers with packed data types, only integer operations

SSE

Pentium III/1999

128b registers (XMM), 32b float only

SSE2

Pentium 4 /2001

SIMD math on any data type

SSE3

Prescott/2004

DSP-oriented math instructions

AVX

Sandy Bridge/2011 256b registers (YMM), 3op instructions

AVX2

Haswell/2013

Integer instructions in YMM registers, FMA

AVX512

Skylake/2016

512b registers

Hardware evolves → programmers and compilers need to adapt 12

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Intel extensions example – AVX2  AVX2 is the latest extension from Intel  Among others, it introduces FMA3 – multiply-accumulate operation with 3 operands ($0 = $0x$2 + $1) – useful for evaluating a polynomial (you remember Horner’s method?)  Creative application – Padé approximant 𝑎𝑎0 + 𝑎𝑎1 𝑥𝑥 + 𝑎𝑎2 𝑥𝑥 2 + ⋯ + 𝑎𝑎𝑛𝑛 𝑥𝑥 𝑛𝑛 𝑅𝑅 𝑥𝑥 = 1 + 𝑏𝑏1 𝑥𝑥 + 𝑏𝑏2 𝑥𝑥 2 + ⋯ + 𝑏𝑏𝑚𝑚 𝑥𝑥 𝑚𝑚

𝑎𝑎0 + 𝑥𝑥(𝑎𝑎1 + 𝑥𝑥 𝑎𝑎2 + 𝑥𝑥 … + 𝑥𝑥𝑎𝑎𝑛𝑛 … ) = 1 + 𝑥𝑥(𝑏𝑏1+ 𝑥𝑥(𝑏𝑏2 + 𝑥𝑥 … + 𝑥𝑥𝑏𝑏𝑚𝑚 … )  VDT is a math vector library using Padé approximant – libm plug&play replacement with speed-ups reaching 10x 13

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

 Common ways to improve CPU performance Technique

Advantages

Disadvantages

Frequency scaling

Immediate scaling

Does not work any more (see: dark silicon)

Hyper-threading

Medium overhead, up to 30% performance improvement

Can double workload’s memory footprint, possible cache pollution

Architectural changes

Increase versatility and Huge design overhead, performance, works well with happen ~every 3 years existing software

Microarchitectural changes

Transparent for the users

Huge design overhead

More cores

Low design overhead, easy to implement, great scalability

Requires heavily-parallel software

14

iCSC2015, Pawel Szostek, CERN

Slide inspiration: A. Nowak “Multicore Architectures”

CPU improvements summary

Multi-core processors and multithreading

Multi-core processors and multithreading: part 2

PARALLEL ARCHITECTURES ON THE SOFTWARE SIDE 15

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Concurrency vs. parallelism

Do concurrent (not parallel) programs need synchronization to access shared resources? Why? 16

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Race conditions

What will be value of n after both threads finish their work? 17

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Race conditions (II)

18

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Thread-level parallelism in Python  C++ parallelism skipped on purpose – already covered at CSC  Python is not a performance-oriented language, but can be made less slow  We can still use threading module to benefit from parallel IO operations via threads by relying on OS  Example is deferred to the synchronization slides. But wait! Is there a real parallelism in Python? What about the Global Interpreter Lock?

19

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Thread-level parallelism in Python (II)  We can easily run many processes with multiprocessing package to leverage parallelism easily, not very efficiently though  high memory footprint  no resource sharing  every worker is a separate process from multiprocessing(.dummy) import Pool def f(x): return x*x if __name__ == '__main__': pool = Pool(processes=4) result = pool.map(f, xrange(10))

20

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

CSC Refresher: vector operations  Problem: all the arithmetic operations are executed one element at a time  Solution: introduce vector operations and vector registers

What is the maximal speed-up from vectorization? Why is it hard to obtain it in practice? 21

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Auto-vectorization in gcc  Vectorization candidate: (inner) loops.  Will only work with more recent gcc versions (>4.6)  By default, auto-vectorization in gcc is disabled  There are tens of optimization flag, but it’s good to retain at least a couple:  -mtune=ARCH, -march=ARCH  -O2, -O3, -Ofast  -ftree-vectorize

22

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Vectorization reports  Compiler can tell us which loop was not vectorized and why  gcc: -ftree-vectorize-verbose=[0-9]  icc: -vec-report=[0-7]  List of vectorizable loops available on-line: https://gcc.gnu.org/projects/tree-ssa/vectorization.html Analyzing loop at vect.cc:14 vect.cc:14: note: not vectorized: control flow in loop. vect.cc:14: note: bad loop form. vect.cc:6: note: vectorized 0 loops in function. 23

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Intel architectural extensions (II)  Compiler is capable of producing different versions of the same function for different architectures (so called Automatic CPU dispatch)  A run-time check is added to the output code  in ICC –axARCH can be used instead

ICC

GCC

24

__attribute__ ((target(“default”))) int foo() { return 0; }

__declspec(cpu_specific(generic)) int foo() { return 0; }

__attribute__((target(“sse4.2”))) int foo() { return 1; }

__declspec(cpu_specific(core_i7_sse4_2)) int foo() { return 1; }

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Vectorization in C++  Possible to use intrinsics, but very cumbersome and “write-only”  Many libraries to approach vectorization, the choice is not easy  Example: Agner Fog’s Vector Class float a[8], b[8], c[8]; … for (int i=0; i4GB)  RISC architecture  Common software ecosystem with x86-64, uses same management standards Energy efficiency scalability  CISC also expanding in this direction figure source: D. Abdurachmanov et al. “Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi”

34

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Take-home messages  Moore’s law is doing fine. Transistors will be invested into more cores, bigger caches and wider vectors (512b)  NUMA and COD are another “complex stuff” that a programmer has to keep in mind  Parallelization is possible not only with C++  Not everything that looks like an improvement gives you better performance (e.g. AVX)  Multi-threaded applications always require synchronization to protect shared resources  Auto-vectorization is a speed-up for free 35

iCSC2015, Pawel Szostek, CERN

Suggest Documents