Multi-core processors and multithreading

Multi-core processors and multithreading Evolution of processor architectures: growing complexity of CPUs and its impact on the software landscape Le...

Author: Avis Hodges

24 downloads 0 Views 2MB Size

Report

Download PDF

Recommend Documents

Multicore digital signal processors

Multicore: Commercial Processors

HETEROGENEOUS MULTICORE PROCESSORS

Study of Multicore processors: Advantages and Challenges

A Survey of Processors with Explicit Multithreading

Per-Thread Cycle Accounting in Multicore Processors

High Performance Linux Cluster and Multicore Nehalem Processors

ALGORITHM DESIGN ON MULTICORE PROCESSORS FOR MASSIVE-DATA ANALYSIS

Supporting Soft Real-Time Parallel Applications on Multicore Processors

Exploiting the Role of Hardware Prefetchers in Multicore Processors

1 Dynamic Cache Pooling in 3D Multicore Processors

PREDICTION STRATEGIES FOR POWER-AWARE COMPUTING ON MULTICORE PROCESSORS

Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors

PERFORMANCE OF PRIVATE CACHE REPLACEMENT POLICIES FOR MULTICORE PROCESSORS

Resource-conscious Scheduling for Energy Efficiency on Multicore Processors

ASIC Design of Shared Vector Accelerators for Multicore Processors

Enabling Improved Power Management in Multicore Processors through Clustered DVFS

THE IMPACT OF DYNAMICALLY HETEROGENEOUS MULTICORE PROCESSORS ON THREAD SCHEDULING

Easily Adaptable On-Chip Debug Architecture for Multicore Processors

Operating System Management of Shared Caches on Multicore Processors

Memory-aware Scheduling for Energy Efficiency on Multicore Processors

Multiprocessors and Multithreading

DTHREADS: Efficient and Deterministic Multithreading

Multi-core processors and multithreading

Evolution of processor architectures: growing complexity of CPUs and its impact on the software landscape Lecture 2

Multi-core processors and multithreading Paweł Szostek CERN

Inverted CERN School of Computing, 23-24 February 2015

1

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Multi-core processors and multithreading: part 1

ADVANCED TOPICS IN THE COMPUTER ARCHITECTURES 2

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

CPU evolution  In the past manufacturers were increasing frequency  Transistors were invested into larger caches and more powerful cores  From 2005 transistors are spent on new cores → 10 years of paradigm change (see Herb Sutter’s The Free Lunch is over)

 Thermal Design Power (TDP) is stalled at ~150W Why higher clock speed increases the power consumption? 3

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Interlude: power dissipation  In the past, there were no power dissipation issues  Heat density (W/cm3) in a modern CPU approaches the same level as in nuclear reactor [1]  “Tricks” needed to limit power usage (TurboBoost®, AVX frequencies, more transistors for infrequent use)  This can lead to caveats, see AVX [1]: David Chisnall The Dark Silicon Problem and What it Means for CPU Designers

4

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Interlude: manufacturing technology 120nm

Flu virus

5

14nm process transistor iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Simultaneous Multi-Threading  Problem: when executing a stream of instructions, even with outof-order execution, a CPU cannot keep all the execution units constantly busy  Can be caused by many reasons: hazards, front-end stalls, homogenous instruction stream etc.

6

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Simultaneous Multi-Threading (II)  Solution: we can utilize idle execution units with a different thread  SMT is a hardware feature that can be turned on/off in the BIOS  Most of the hardware resources (including caches) are shared  Needs a separate fetching unit  Can both speed up and slow down execution (see next slide)

7

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Simultaneous Multi-Threading (III)  Workloads from HEPSPEC06 benchmark  Many instances of single-threaded processes run in parallel  Different scalability and reactions to SMT  Cache utilization is the most important factor in SMT impact

8

iCSC2015, Pawel Szostek, CERN

SMT

Multi-core processors and multithreading

Simultaneous Multi-Threading (IV)  Idea: we might want to exploit SMT by running a main thread and a helper thread on the same physical core  Example: list or tree traversal  the role of the helper thread is to prefetch the data  helper thread works in front of the main thread by accessing data ahead of the main thread  think of it as an interesting example of exploiting the hardware 9

source: J. Zhou et al. “Improving Database Performance on Simultaneous Multithreading Processors” iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Non-Uniform Memory Access  Multi-processor architecture, where memory access time depends on location of the memory wrt. the processor  Makes accesses fast, when the memory is “close” to the processor  There is a performance hit when accessing the “foreign” memory  Lowers down the pressure on the memory bus 10

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Cluster-on-die  Problem: with increasing number of cores there is more and more concurrent accesses to the shared memories (LLC and RAM)  Solution: split the memory on one socket into two nodes

11

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Intel architectural extensions Extension Generation/year

Value added

MMX

Pentium MMX/1997

64b registers with packed data types, only integer operations

SSE

Pentium III/1999

128b registers (XMM), 32b float only

SSE2

Pentium 4 /2001

SIMD math on any data type

SSE3

Prescott/2004

DSP-oriented math instructions

AVX

Sandy Bridge/2011 256b registers (YMM), 3op instructions

AVX2

Haswell/2013

Integer instructions in YMM registers, FMA

AVX512

Skylake/2016

512b registers

Hardware evolves → programmers and compilers need to adapt 12

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Intel extensions example – AVX2  AVX2 is the latest extension from Intel  Among others, it introduces FMA3 – multiply-accumulate operation with 3 operands ($0 = $0x$2 + $1) – useful for evaluating a polynomial (you remember Horner’s method?)  Creative application – Padé approximant 𝑎𝑎0 + 𝑎𝑎1 𝑥𝑥 + 𝑎𝑎2 𝑥𝑥 2 + ⋯ + 𝑎𝑎𝑛𝑛 𝑥𝑥 𝑛𝑛 𝑅𝑅 𝑥𝑥 = 1 + 𝑏𝑏1 𝑥𝑥 + 𝑏𝑏2 𝑥𝑥 2 + ⋯ + 𝑏𝑏𝑚𝑚 𝑥𝑥 𝑚𝑚

𝑎𝑎0 + 𝑥𝑥(𝑎𝑎1 + 𝑥𝑥 𝑎𝑎2 + 𝑥𝑥 … + 𝑥𝑥𝑎𝑎𝑛𝑛 … ) = 1 + 𝑥𝑥(𝑏𝑏1+ 𝑥𝑥(𝑏𝑏2 + 𝑥𝑥 … + 𝑥𝑥𝑏𝑏𝑚𝑚 … )  VDT is a math vector library using Padé approximant – libm plug&play replacement with speed-ups reaching 10x 13

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

 Common ways to improve CPU performance Technique

Advantages

Disadvantages

Frequency scaling

Immediate scaling

Does not work any more (see: dark silicon)

Hyper-threading

Medium overhead, up to 30% performance improvement

Can double workload’s memory footprint, possible cache pollution

Architectural changes

Increase versatility and Huge design overhead, performance, works well with happen ~every 3 years existing software

Microarchitectural changes

Transparent for the users

Huge design overhead

More cores

Low design overhead, easy to implement, great scalability

Requires heavily-parallel software

14

iCSC2015, Pawel Szostek, CERN

Slide inspiration: A. Nowak “Multicore Architectures”

CPU improvements summary

Multi-core processors and multithreading

Multi-core processors and multithreading: part 2

PARALLEL ARCHITECTURES ON THE SOFTWARE SIDE 15

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Concurrency vs. parallelism

Do concurrent (not parallel) programs need synchronization to access shared resources? Why? 16

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Race conditions

What will be value of n after both threads finish their work? 17

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Race conditions (II)

18

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Thread-level parallelism in Python  C++ parallelism skipped on purpose – already covered at CSC  Python is not a performance-oriented language, but can be made less slow  We can still use threading module to benefit from parallel IO operations via threads by relying on OS  Example is deferred to the synchronization slides. But wait! Is there a real parallelism in Python? What about the Global Interpreter Lock?

19

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Thread-level parallelism in Python (II)  We can easily run many processes with multiprocessing package to leverage parallelism easily, not very efficiently though  high memory footprint  no resource sharing  every worker is a separate process from multiprocessing(.dummy) import Pool def f(x): return x*x if __name__ == '__main__': pool = Pool(processes=4) result = pool.map(f, xrange(10))

20

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

CSC Refresher: vector operations  Problem: all the arithmetic operations are executed one element at a time  Solution: introduce vector operations and vector registers

What is the maximal speed-up from vectorization? Why is it hard to obtain it in practice? 21

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Auto-vectorization in gcc  Vectorization candidate: (inner) loops.  Will only work with more recent gcc versions (>4.6)  By default, auto-vectorization in gcc is disabled  There are tens of optimization flag, but it’s good to retain at least a couple:  -mtune=ARCH, -march=ARCH  -O2, -O3, -Ofast  -ftree-vectorize

22

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Vectorization reports  Compiler can tell us which loop was not vectorized and why  gcc: -ftree-vectorize-verbose=[0-9]  icc: -vec-report=[0-7]  List of vectorizable loops available on-line: https://gcc.gnu.org/projects/tree-ssa/vectorization.html Analyzing loop at vect.cc:14 vect.cc:14: note: not vectorized: control flow in loop. vect.cc:14: note: bad loop form. vect.cc:6: note: vectorized 0 loops in function. 23

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Intel architectural extensions (II)  Compiler is capable of producing different versions of the same function for different architectures (so called Automatic CPU dispatch)  A run-time check is added to the output code  in ICC –axARCH can be used instead

ICC

GCC

24

__attribute__ ((target(“default”))) int foo() { return 0; }

__declspec(cpu_specific(generic)) int foo() { return 0; }

__attribute__((target(“sse4.2”))) int foo() { return 1; }

__declspec(cpu_specific(core_i7_sse4_2)) int foo() { return 1; }

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Vectorization in C++  Possible to use intrinsics, but very cumbersome and “write-only”  Many libraries to approach vectorization, the choice is not easy  Example: Agner Fog’s Vector Class float a[8], b[8], c[8]; … for (int i=0; i4GB)  RISC architecture  Common software ecosystem with x86-64, uses same management standards Energy efficiency scalability  CISC also expanding in this direction figure source: D. Abdurachmanov et al. “Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi”

34

iCSC2015, Pawel Szostek, CERN

Multi-core processors and multithreading

Take-home messages  Moore’s law is doing fine. Transistors will be invested into more cores, bigger caches and wider vectors (512b)  NUMA and COD are another “complex stuff” that a programmer has to keep in mind  Parallelization is possible not only with C++  Not everything that looks like an improvement gives you better performance (e.g. AVX)  Multi-threaded applications always require synchronization to protect shared resources  Auto-vectorization is a speed-up for free 35

iCSC2015, Pawel Szostek, CERN