High Performance Computer Architecture 10 mm

High Performance Computer Architecture 10 mm Northwood Core (0.13µm) 145 mm2/55Mtr Willamette Core (0.18µm) Dothan Core (0.09µm) 84 mm2/140Mtr ...
Author: Ralf Evans
0 downloads 1 Views 1MB Size
High Performance Computer Architecture 10 mm

Northwood Core (0.13µm)

145 mm2/55Mtr

Willamette Core (0.18µm)

Dothan Core (0.09µm)

84

mm2/140Mtr

Intel-Pentium-M (05/2004)

Intel-Pentium-4 (01/2002)

Conroe Core (0.065µm)

Penryn Core (0.045µm)

143 mm2/291Mtr 107 mm2/410Mtr Intel-Core2-Duo (07/2006)

Intel-Core2-Duo (01/2008)

217 mm2m/42Mtr Intel-Pentium-4 (11/2000)

Bloomfield Core (0.045µm)

263 mm2/731Mtr Intel-Core-i7 (11/2008)

Sandy Bridge (0.032µm) 216 mm2/995Mtr Intel-Core-i7-2920XM (01/2011)

Fermi 512G (0.040µm) 467 mm2/3000Mtr

BlueGene/Q – 18 cores (0.045µm)

NVIDIA GF100 (09/2009)

360 mm2/1470Mtr

Llano 4C/400G (0.032µm) 228 mm2/1450Mtr AMD A8-3850 (06/2011)

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 1 di 53

IBM (08/2011)

20 mm

Ivy Bridge (0.022µm) tri-gate A8X 11 Cores (0.020µm)

128 mm2m/3000Mtr Apple-iPad Air2 (11/2014)

160 mm2/1400Mtr Intel-Core-i73770 (04/2012)

12 Cores – 8SMT (0.022µm)

362 mm2m/2100Mtr IBM Power4 (12/2014) First Commercial Dual-Core Chip (0.18µm)

412 mm2m/174Mtr IBM Power4 (12/2001)

Haswell-EP – 18 Cores – 2SMT (0.022 µm)

661 mm2m/5560Mtr Xeon E5-2600 v3 (06/2015

Knights Landing – 72 Cores – 4SMT (0.014 µm)

Fuji XT – 2048 GPU-Cores (0.028 µm)

mm2m/7100Mtr

567 mm2m/8900Mtr

Xeon Phi (06/2015)(estimated)

AMD Radeon R9 Fury-X (06/2015)

7000$)

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 2 di 53

413

Where are High Performance Computers ? Among users

Where you need to make this happen: “I have a limited battery and need to… take a picture, share it with my friends, ...” In the Internet Infrastructure

Where you need to connect anybody with anything

In the Datacenters

Where you need to Store and Retrieve YOUR data

Electronic Devices

Every electronic device has a Computers inside Cars may have many as 50+ computers: (California approved a bill for autonomous vehicles) Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 3 di 53

Computer Architects • Computer Architects UNDERSTAND and CAN BUILD the Computing Infrastructure… and almost ALL details of it ! :-)

HP Proliant DL 585 G7

AMD Opteron 6200 ARCHITECTURE Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 4 di 53

AMD Opteron 6272

AMD Opteron 6200 characteristics

AMD Opteron 6200 CHIP

AMD Opteron 6200 CORE (“Bulldozer”)

Objectives of this course • This course constitutes a deeper study of current computers and aims to provide: • Principles of high-performance microprocessors (superscalar, VLIW) • An understanding of the basic mechanisms for the programming of applications that take advantage of the parallelism made available by the system • Principles of Multi-Core / Multi-Processor Systems • Tools for programming Parallel Machines

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 5 di 53

Course Administration • Teacher: • Telephone: • Office hours: • Slides: -

Roberto Giorgi ( [email protected] ) 0577-191-5182 Monday 16:30/19:00 http://www.dii.unisi.it/~giorgi/teaching/hpca2

• Adopted Textbook: • M. Dubois, M. Annavaram, P. Stenstrom, "Parallel Computer Organization and Design", Cambridge University Press, 2012, ISBN: 978-0-521-88675-8

• Other Reference Textbooks • Hennessy and Patterson, “Computer Architecture: A Quantitative Approach” 5th Ed., Morgan Kauffman, 2012,ISBN 978-0-12-383872-8 • D. Culler, J.P. Singh, A. Gupta, "Parallel Computer Architecture: A Hw/Sw Approach", Morgan Kaufman/Elsevier, 1998, ISBN 1558603433 • M.J. Flynn, "Computer Architecture: Pipelined and Parallel Processor Design", Jones and Bartlett Publishers, Inc., 1995, ISBN 0867202041 Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 6 di 53

Rules for exams, dates, slides, tools • Check out the course website:

http://www.dii.unisi.it/~giorgi/teaching/hpca2

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 7 di 53

Computer Architecture “The term ARCHITECTURE is used here to describe the set of attributes of a system, as this appears to the programmer*, i.e., its conceptual structure and its operation, with a distinctive organization of the networks that manage the flow of data and control networks, as compared to the logical design and physical implementation” -- Gene Amdahl, IBM Journal of R&D, Apr. 1964

*programmer == system programmer (OS) engineer or the compiler Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 8 di 53

Architecture: an overloaded term • In the strict sense: Interface Hardware / Software • • • •

Set of instructions Memory management and protection Interruptions and exceptions (traps) Data formats (for example, IEEE 754 floating point)

• Organization: also called "Microarchitecture" • In this sense, it is "the implementation" of architecture (this is a part that Gene Amdahl had excluded) • Specifies the functional units and connections • Configuration of the pipeline • Position and configuration of cache memory

• As a discipline, "Architecture of Computers" also includes the microarchitecture • To avoid confusion when it comes to interface HW / SW we use "Instruction Set Architecture" (ISA) • "COMPUTER ARCHITECTURE concerns the interface between what the technology provides and what the market demands" - Yale Patt, ISCA, Jun 2006 Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 9 di 53

Levels of Computer Architecture Interfaces: Software 1 Application Programs 2 Libraries 3

3

Operating System 4 Drivers 8

5 Memory Manager 8

6 Scheduler 8

8

7

7

Execution Hardware 9 10 System Interconnect (bus) 11 Controllers 13 I/O devices and Networking

10

Memory Translation

11 12 Controllers 14 Main Memory

Hardware Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 10 di 53

ISA

1: User Interface 2: API 3,7: ABI 4,5,6: internal interface of the Operating System 7,8: ISA 9: Memory architecture 10: I/O architecture 11,12: RTL architecture 13,14: Bus architecture API=Application Program Interface ABI=Application Binary Interface ISA=Instruction Set Architecture RTL=Register Transfer Level

What technology provides: Moore's Law • “The number of TRANSISTORS doubles every 18 months” (Later revised to "24 months"), this is due to: - higher density (transistors / area) - availability of bigger chips DATA FROM SLIDE 1 Mtr

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 11 di 53

Moore's Law and purely PSYCHOLOGICAL!

What the market demands: Applications • Application Trend • FROM numerical, scientific TO commercial, entertainment • FROM few "big" TO ubiquitous, "small“ - mainframes

• • • • •

FROM FROM FROM FROM FROM

minis

microprocessors

handheld, embedded

little TO big memory storage (primary and secondary) single-thread TO multiple-threads standalone TO networked (cloud computing) character-oriented TO multimedia (graphics and sound) personal data TO “BIG DATA”

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 12 di 53

Main Applications • Numerical/Scientific • Computational Fluid Dynamics, Weather Prediction, ECAD • Long word length, floating point arithmetic

• Commercial • inventory control, billing, payroll, decision support • byte oriented, fixed point, high I/O, large secondary storage

• Real-Time/Embedded • control, some communications • predictable performance • interrupt architecture important, low power, cost critical

• Home Computing • multimedia, entertainment • high bandwidth data movement, graphics • cryptography, compression/decompression Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 13 di 53

App. Trends: Multimedia, Networked, Web-servers • A large choice of multimedia devices with • Graphic displays (LCD, etc.). • High Definition Audio • Large capacity of secondary storage for images, sound, etc…

• Services via the Web and high-performance networks require • Many independent threads • Wide band communication

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 14 di 53

MICROPROCESSOR ARCHITECTURE • The increasing number of transistors (cheaper and faster) has fueled the demand for higher performance CPU • 1970s – Serial CPU, 1-bit for integers • 1980s – 32-bit RISC with a pipeline - The ISA simplicity allows the integration of the entire processor chip

• 1990s – bigger CPUs, superscalar - Also for CISC

• 2000s – Multiprocessors on a chip .. .

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 15 di 53

Course Structure 1. 2. 3. 4. 5. 6. 7. 8.

High Performance Pipelining Branch Prediction Superscalar processor Media Processing: VLIW processors Multiprocessors and related problems TLP: Thread Level Parallelism Evaluation of High Performance Architectures Tools for Parallel programming machines (Cilk, OpenMP, MPI, CUDA, ...)

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 16 di 53

EVALUATING COMPUTERS

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 17 di 53

POWER • TOTAL POWER: DYNAMIC + STATIC(LEAKAGE) Pdynamic = αCV2f Pstatic = VIsub ≈ Ve-KVt/T • DYNAMIC POWER FAVORS PARALLEL PROCESSING OVER HIGHER CLOCK RATE • DYNAMIC POWER ROUGHLY PROPORTIONAL TO f3 • TAKE A CORE AND REPLICATE IT 4 TIMES: 4X SPEEDUP & 4X POWER • TAKE A CORE AND CLOCK IT 4 TIMES FASTER: 4X SPEEDUP BUT 64X DYNAMIC POWER! • STATIC POWER • BECAUSE CIRCUITS LEAK WHATEVER THE FREQUENCY IS. • POWER/ENERGY ARE CRITICAL PROBLEMS • POWER (IMMEDIATE ENERGY DISSIPATION) MUST BE DISSIPATED • OTHERWISE TEMPERATURE GOES UP (AFFECTS PERFORMANCE, CORRECTNESS AND MAY POSSIBLY DESTROY THE CIRCUIT, SHORT TERM OR LONG TERM) • EFFECT ON THE SUPPLY OF POWER TO THE CHIP

• ENERGY (DEPENDS ON POWER AND SPEED) • COSTLY; GLOBAL PROBLEM • BATTERY OPERATED DEVICES Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 18 di 53

RELIABILITY • TRANSIENT FAILURES (OR SOFT ERRORS) • CHARGE Q = C X V • IF C AND V DECREASE THEN IT IS EASIER TO FLIP A BIT

• SOURCES ARE COSMIC RAYS AND ALPHA PARTICLES RADIATING FROM THE PACKAGING MATERIAL • DEVICE IS STILL OPERATIONAL BUT VALUE HAS BEEN CORRUPTED • SHOULD DETECT/CORRECT AND CONTINUE EXECUTION • ALSO: ELECTRICAL NOISE CAUSES SIMILAR FAILURES

• INTERMITTENT/TEMPORARY FAILURES • LAST LONGER • DUE TO • TEMPORARY: ENVIRONMENTAL VARIATIONS (EG, TEMPERATURE) • INTERMITTENT: AGING

• SHOULD TRY TO CONTINUE EXECUTION

• PERMANENT FAILURES • MEANS THAT THE DEVICE WILL NEVER FUNCTION AGAIN • MUST BE ISOLATED AND REPLACED BY SPARE PROCESS VARIATIONS INCREASE THE PROBABILITY OF FAILURES

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 19 di 53

PERFORMANCE METRICS (MEASURE) • METRIC #1: TIME TO COMPLETE A TASK (Texe): EXECUTION TIME, RESPONSE TIME, LATENCY • “X IS N TIMES FASTER THAT Y” MEANS Texe(Y)/Texe(X) = N • THE MAJOR METRIC USED IN THIS COURSE

• METRIC #2: NUMBER OF TASKS PER DAY, HOUR, SEC, NS • THE THROUGHPUT FOR X IS N TIMES HIGHER THAN Y IF THROUGHPUT(X)/THROUGHPUT(Y) = N • NOT THE SAME AS LATENCY (EXAMPLE OF MULTIPROCESSORS)

• EXAMPLES OF UNRELIABLE METRICS: • MIPS: MILLION OF INSTRUCTIONS PER SECOND • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS PER SECOND EXECUTION TIME OF A PROGRAM IS THE ULTIMATE MEASURE OF PERFORMANCE BENCHMARKING

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 20 di 53

WHICH PROGRAM TO CHOOSE? • REAL PROGRAMS: • PORTING PROBLEM; COMPLEXITY; NOT EASY TO UNDERSTAND THE CAUSE OF RESULTS

• KERNELS • COMPUTATIONALLY INTENSE PIECE OF REAL PROGRAM

• TOY BENCHMARKS (E.G. QUICKSORT, MATRIX MULTIPLY) • SYNTHETIC BENCHMARKS (NOT REAL) • BENCHMARK SUITES • SPEC: STANDARD PERFORMANCE EVALUATION CORPORATION • SCIENTIFIC/ENGINEEING/GENERAL PURPOSE • INTEGER AND FLOATING POINT • NEW SET EVERY SO MANY YEARS (95,98,2000,2006)

• TPC BENCHMARKS: • FOR COMMERCIAL SYSTEMS • TPC-B, TPC-C, TPC-H, AND TPC-W

• EMBEDDED BENCHMARKS • MEDIA BENCHMARKS

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 21 di 53

REPORTING PERFORMANCE FOR A SET OF PROGRAMS LET Ti BE THE EXECUTION TIME OF PROGRAM i (out of N progams): 1. (WEIGHTED) ARITHMETIC MEAN OF EXECUTION TIMES:

∑ Ti ⁄ N i

∑ Ti × W i

OR

i

THE PROBLEM HERE IS THAT THE PROGRAMS WITH LONGEST EXECUTION TIMES DOMINATE THE RESULT

2. DEALING WITH SPEEDUPS • SPEEDUP MEASURES THE ADVANTAGE OF A MACHINE OVER A REFERENCE MACHINE FOR A PROGRAM i (let TR,i be the execution time on the reference machine)

TR, i Si = ----------Ti

• ARITHMETIC MEAN OF SPEEDUPS

=

1

• HARMONIC MEAN = Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 22 di 53



1

REPORTING PERFORMANCE FOR A SET OF PROGRAMS • GEOMETRIC MEANS OF SPEEDUPS =

- MEAN SPEEDUP COMPARIONS BETWEEN TWO MACHINES ARE INDEPENDENT OF THE REFERENCE MACHINE - EASILY COMPOSABLE - USED TO REPORT SPEC NUMBERS FOR INTEGER AND FLOATING POINT Program A

Program B

Arithmetic Mean

Speedup (ref 1)

Speedup (ref 2)

Machine 1

10 sec

100 sec

55 sec

91.8

10

Machine 2

1 sec

200 sec

100.5 sec

50.2

5.5

Reference 1

100 sec

10000 sec

5050 sec

Reference 2

100 sec

1000 sec

550 sec

In terms of speedup:

Program A

Program B

Arithmetic

Harmonic

Geometric

18.2

31.6

Wrt Reference 1

Machine 1

10

100

55

Machine 2

100

50

75

Wrt Reference 2

Machine 1

10

10

10

Machine 2

100

5

52.5

x1.36 x5.25

66.7 10 9.5

x3.66 x0.95

70.7 10 22.4

GM: whichever reference machine we choose, the relative speed between the two machines is always the SAME !! Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 23 di 53

x2.2 x2.2

FUNDAMENTAL PERFORMANCE EQUATIONS FOR CPUs (also known as “IRON LAW”) Texe = IC X CPI X Tc • IC: DEPENDS ON PROGRAM, COMPILER AND ISA • CPI: DEPENDS ON INSTRUCTION MIX, ISA, AND IMPLEMENTATION • Tc: DEPENDS ON IMPLEMENTATION COMPLEXITY AND TECHNOLOGY CPI (CLOCK PER INSTRUCTION) IS OFTEN USED INSTEAD OF EXECUTION TIME

• WHEN PROCESSOR EXECUTES MORE THAN ONE INSTRUCTION PER CLOCK USE IPC (INSTRUCTIONS PER CLOCK) Texe = (IC X Tc)/IPC Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 24 di 53

AMDAHL’S LAW 1-F

F without E

Apply enhancement 1-F

F/S with E

• ENHANCEMENT E ACCELERATES A FRACTION F OF THE TASK BY A FACTOR S F T ex e ( w it h E ) = T e x e ( w it h o u tE )X ( 1 – F ) + S T e x e ( w it h o u t E ) 1 S p e e d u p ( E ) = ------------------------- = --------------T e x e ( w it h E ) F (1 – F) + S

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 25 di 53

LESSONS FROM AMDAHL’S LAW 1) IMPROVEMENT IS LIMITED BY THE FRACTION OF THE EXECUTION TIME THAT CANNOT BE ENHANCED S P E E D U P ( E ) < ---1--1–F

• LAW OF DIMINISHING RETURNS – MARGINAL SPEEDUP • The difference between SPEEDUPk+1 and SPEEDUPk is smaller and smaller as S goes from k to k+1

Amdhal’s maximum

F=0.5 Amdhal’s Law Remaining Speedup Marginal Speedup

2) OPTIMIZE THE COMMON CASE • EXECUTE THE RARE CASE IN SOFTWARE (E.G. EXCEPTIONS)

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 26 di 53

PARALLEL SPEEDUP =

1 = = 1− + /

1 < + (1 − ) 1 − Amdhal’s maximum

F=0.95 Ideal speedup

“mortar shot”

Amdhal’s Law

• NOTE: SPEEDUP CAN BE SUPERLINEAR. HOW CAN THAT BE?? OVERALL NOT VERY HOPEFUL

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 27 di 53

GUSTAFSON’S LAW • REDEFINE SPEEDUP • THE RATIONALE IS THAT, AS MORE AND MORE CORES ARE INTEGRATED ON CHIP OVER TIME, THE WORKLOADS ARE ALSO GROWING • STARTS WITH THE EXECUTION TIME ON THE PARALLEL MACHINE WITH P PROCESSORS: TP = s + p

• s IS THE TIME TAKEN BY THE SERIAL CODE AND p IS THE TIME TAKEN BY THE PARALLEL CODE

• EXECUTION TIME ON ONE PROCESSOR IS

T1 = s + pP • Let F=p/(s+p). Then SP = (s+pP)/(s+p) = 1-F+FP = 1+F(P-1)

Gustafson observes that even if the single algorithm/program completes faster only if the parallel portion is dominant (Amdhal), the same algorithm will complete more and more faster as we add processors (P) compared to a purely sequential execution that just repeats the parallel portion (p) for P times. Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 28 di 53

Course Structure 1. 2. 3. 4. 5. 6. 7. 8.

High Performance Pipelining Branch Prediction Superscalar processor Media Processing: VLIW processors Multiprocessors and related problems TLP: Thread Level Parallelism Evaluation of High Performance Architectures Tools for Parallel programming machines (Cilk, OpenMP, MPI, CUDA, ...)

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 29 di 53

PIPELINING

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 30 di 53

Pipelining • Pipelining principles • Simple Pipeline • Structural Hazards • Data Hazards • Control Hazards

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 31 di 53

Pipelining principles • Consider instructions composed of n phases of equal duration T

1

2

...

n

1

2

...

1

2 1

n

...

n

2

...

n

• Let T be the time to execute an instruction • Without pipelining • Latency = T • Throughput seq = 1 / T

• With an ideal n-stage pipeline • Latency = T • Throughput pipe = n / T

• Speedup = Throughput Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 32 di 53

pipe

/Throughput

seq

= n

The (ideal) speedup obtainable from an ideal pipeline is equal to n

Implementation of a Simple Pipeline • Simple 5-stage pipeline • • • • •

F -- Instruction Fetch D -- Instruction Decode + Operand Fetch X -- Execution and Effective Address M -- Memory Access W – Write-back Results

latch

latch

F

latch

D

clock

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 33 di 53

latch

X

latch

M

latch

W

5-STAGE PIPELINE

INSTRUCTIONS GO THROUGH EVERY STAGE IN PROCESS ORDER, EVEN IF THEY DON’T USE THE STAGE

• NOTE: CONTROL IMPLEMENTATION • INSTRUCTION CARRIES CONTROL • THIS IS A GENERAL APPROACH: “INSTRUCTION CARRIES ITS BAGGAGE” Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 34 di 53

Notation F

D inst. decode

inst. fetch

X execute

5-stage pipeline i i+1 i+2 i+3 i+4

1 F

2 D F

3 X D F

4 M X D F

5 W M X D F

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 35 di 53

6

7

8

9

W M X D

W M X

W M

W

M memory access

W write back

Pipeline Hazards • Conditions that lead to a malfunction if certain countermeasures are not taken 1) Structural Hazards • Two instructions want to use the same hardware resource in the same cycle (conflict over resources, e.g. Instruction Mem. and Data Mem.)

2) Data Hazards • Two instructions use the same data: must happen in the order defined by the programmer, even if the execution overlaps parts of the instruction execution (see RAW, WAW, WAR)

3) Control Hazards • An instruction (branch, jump, call) can irrevocably determine which instructions are executed next, because the pipeline has already taken instructions from the initial branch even if there is a jump

Roberto Giorgi, Universita' degli Studi di Siena, C216LEZ01-SL 36 di 53

1) Structural Hazards • Two instructions want to use the same hardware resource in the same cycle Example: • A load / store uses the same memory location that is used by the instruction fetch i i+1 i+2 i+3 i+4

F

D F

X D F

M X D *

W M X F

Suggest Documents