Multiprocessor Architecture Basics. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Multiprocessor Archit...

Author: Barnaby Caldwell

13 downloads 0 Views 607KB Size

Report

Download PDF

Recommend Documents

4 Multiprocessor Programming

Multiprocessor Operating Systems. Multiprocessor Applications

Multiprocessor Scheduling. Multiprocessor Scheduling

FPGA Based Embedded Multiprocessor Architecture

The Stanford FLASH Multiprocessor

Multiprocessor Synchronization

The Stanford Dash Multiprocessor

The Shared-Thread Multiprocessor

The Stanford Dash Multiprocessor

Maurice Peter Herlihy

Spectrum of Multiprocessor OS. Types of Multiprocessor Systems

The NUMAchine Multiprocessor

Multiprocessor Scheduling

Multiprocessor Systems

The C.mmp multiprocessor

Multiprocessor Systems

Fresh Breeze: A Multiprocessor Chip Architecture Guided by Modular Programming Principles

The BY91-1 Machine: A CC-NUMA Multiprocessor Architecture

The Case for Fair Multiprocessor Scheduling

The Case for a Single-Chip Multiprocessor

A Multiprocessor Implementation for the GSM Algorithm

DISTRIBUTED AND MULTIPROCESSOR SCHEDULING

Snoop-based Multiprocessor Design

Chapter 15 Multiprocessor Management

Multiprocessor Architecture Basics

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Multiprocessor Architecture • Abstract models are (mostly) OK to understand algorithm correctness • To understand how concurrent algorithms perform • You need to understand something about multiprocessor architectures

Art of Multiprocessor Programming

2

Pieces • • • • •

Processors Threads Interconnect Memory Caches

Art of Multiprocessor Programming

3

Design an Urban Messenger Service in 1980 • Downtown Manhattan • Should you use – Cars (1980 Buicks, 15 MPG)? – Bicycles (hire recent graduates)?

• Better use bicycles

Art of Multiprocessor Programming

4

Technology Changes • Since 1980, car technology has changed enormously – Better mileage (hybrid cars, 35 MPG) – More reliable

• Should you rethink your Manhattan messenger service?

Art of Multiprocessor Programming

5

Processors • Cycle: – Fetch and execute one instruction

• Cycle times change – 1980: 10 million cycles/sec – 2005: 3,000 million cycles/sec

Art of Multiprocessor Programming

6

Computer Architecture • Measure time in cycles – Absolute cycle times change

• Memory access: ~100s of cycles – Changes slowly – Mostly gets worse

Art of Multiprocessor Programming

7

Threads • • • •

Execution of a sequential program Software, not hardware A processor can run a thread Put it aside – Thread does I/O – Thread runs out of time

• Run another thread Art of Multiprocessor Programming

8

Interconnect • Bus – Like a tiny Ethernet – Broadcast medium – Connects • Processors to memory • Processors to processors

• Network – Tiny LAN – Mostly used on large machines Art of Multiprocessor Programming

9

Interconnect • Interconnect is a finite resource • Processors can be delayed if others are consuming too much • Avoid algorithms that use too much bandwidth

Art of Multiprocessor Programming

10

Analogy • You work in an office • When you leave for lunch, someone else takes over your office. • If you don’t take a break, a security guard shows up and escorts you to the cafeteria. • When you return, you may get a different office Art of Multiprocessor Programming

11

Processor and Memory are Far Apart memory

interconnect

processor Art of Multiprocessor Programming

12

Reading from Memory address

Art of Multiprocessor Programming

13

Reading from Memory

zzz…

Art of Multiprocessor Programming

14

Reading from Memory

value

Art of Multiprocessor Programming

15

Writing to Memory address, value

Art of Multiprocessor Programming

16

Writing to Memory

zzz…

Art of Multiprocessor Programming

17

Writing to Memory

ack

Art of Multiprocessor Programming

18

Remote Spinning • Thread waits for a bit in memory to change – Maybe it tried to dequeue from an empty buffer

• Spins

– Repeatedly rereads flag bit

• Huge waste of interconnect bandwidth Art of Multiprocessor Programming

19

Analogy • In the days before the Internet … • Alice is writing a paper on aardvarks • Sources are in university library – Request book by campus mail – Book arrives by return mail – Send it back when not in use

• She spends a lot of time in the mail room Art of Multiprocessor Programming

20

Analogy II • Alice buys – A desk • In her office • To keep the books she is using now

– A bookcase • in the hall • To keep the books she will need soon

Art of Multiprocessor Programming

21

Cache: Reading from Memory address

cache

Art of Multiprocessor Programming

22

Cache: Reading from Memory

cache

Art of Multiprocessor Programming

23

Cache: Reading from Memory

cache

Art of Multiprocessor Programming

24

Cache Hit

? cache

Art of Multiprocessor Programming

25

Cache Hit

Yes! cache

Art of Multiprocessor Programming

26

Cache Miss address

No…

? cache

Art of Multiprocessor Programming

27

Cache Miss

cache

Art of Multiprocessor Programming

28

Cache Miss

cache

Art of Multiprocessor Programming

29

Local Spinning • With caches, spinning becomes practical • First time – Load flag bit into cache

• As long as it doesn’t change – Hit in cache (no interconnect used)

• When it changes – One-time cost – See cache coherence below Art of Multiprocessor Programming

30

Granularity • Caches operate at a larger granularity than a word • Cache line: fixed-size block containing the address

Art of Multiprocessor Programming

31

Locality • If you use an address now, you will probably use it again soon – Fetch from cache, not memory

• If you use an address now, you will probably use a nearby address soon – In the same cache line

Art of Multiprocessor Programming

32

Hit Ratio • Proportion of requests that hit in the cache • Measure of effectiveness of caching mechanism • Depends on locality of application

Art of Multiprocessor Programming

33

L1 and L2 Caches

L2

L1 Art of Multiprocessor Programming

34

L1 and L2 Caches

L2

L1 Art of Multiprocessor Programming

Small & fast 1 or 2 cycles ~16 byte line 35

L1 and L2 Caches

Larger and slower 10s of cycles ~1K line size

L2

L1 Art of Multiprocessor Programming

36

When a Cache Becomes Full… • Need to make room for new entry • By evicting an existing entry • Need a replacement policy – Usually some kind of least recently used heuristic

Art of Multiprocessor Programming

37

Fully Associative Cache • Any line can be anywhere in the cache – Advantage: can replace any line – Disadvantage: hard to find lines

Art of Multiprocessor Programming

38

Direct Mapped Cache • Every address has exactly 1 slot – Advantage: easy to find a line – Disadvantage: must replace fixed line

Art of Multiprocessor Programming

39

K-way Set Associative Cache • Each slot holds k lines – Advantage: pretty easy to find a line – Advantage: some choice in replacing line

Art of Multiprocessor Programming

40

Contention • Alice and Bob are both writing research papers on aardvarks. • Alice has encyclopedia vol AA-AC • Bob asks library for it – Library asks Alice to return it – Alice returns it & rerequests it – Library asks Bob to return it… Art of Multiprocessor Programming

41

Contention • Good to avoid memory contention. • Idle processors • Consumes interconnect bandwidth

Art of Multiprocessor Programming

42

Contention • Alice is still writing a research paper on aardvarks. • Carol is writing a tourist guide to the German city of Aachen • No conflict? – Library deals with volumes, not articles – Both require same encyclopedia volume Art of Multiprocessor Programming

43

False Sharing • Two processors may conflict over disjoint addresses • If those addresses lie on the same cache line

Art of Multiprocessor Programming

44

False Sharing • Large cache line size – increases locality – But also increases likelihood of false sharing

• Sometimes need to “scatter” data to avoid this problem

Art of Multiprocessor Programming

45

Cache Coherence • Processor A and B both cache address x • A writes to x – Updates cache

• How does B find out? • Many cache coherence protocols in literature Art of Multiprocessor Programming

46

MESI • Modified – Have modified cached data, must write back to memory

Art of Multiprocessor Programming

47

MESI • Modified – Have modified cached data, must write back to memory

• Exclusive – Not modified, I have only copy

Art of Multiprocessor Programming

48

MESI • Modified – Have modified cached data, must write back to memory

• Exclusive – Not modified, I have only copy

• Shared – Not modified, may be cached elsewhere Art of Multiprocessor Programming

49

MESI • Modified

– Have modified cached data, must write back to memory

• Exclusive

– Not modified, I have only copy

• Shared

– Not modified, may be cached elsewhere

• Invalid

– Cache contents not meaningful Art of Multiprocessor Programming

50

Processor Issues Load Request load x

cache

cache

cache

Bus

memory Art of Multiprocessor Programming

data 51 (1)

Memory Responds

E

cache

cache

cache

Bus

Got it!

memory Art of Multiprocessor Programming

Bus

data 52 (3)

Processor Issues Load Request Load x

E

data

cache

cache

Bus

memory Art of Multiprocessor Programming

data 53 (2)

Other Processor Responds Got it

S E

data

S

cache

cache

Bus

memory Art of Multiprocessor Programming

Bus

data 54 (2)

Modify Cached Data

S

data

S

data data

cache

Bus

memory Art of Multiprocessor Programming

data 55 (1)

Write-Through Cache Write x!

S

data

S

data

cache

Bus

memory Art of Multiprocessor Programming

data 56 (5)

Write-Through Caches • Immediately broadcast changes • Good – Memory, caches always agree – More read hits, maybe

• Bad

– Bus traffic on all writes – Most writes to unshared data – For example, loop indexes … Art of Multiprocessor Programming

57 (1)

Write-Through Caches • Immediately broadcast changes • Good “show stoppers” – Memory, caches always agree – More read hits, maybe

• Bad

– Bus traffic on all writes – Most writes to unshared data – For example, loop indexes … Art of Multiprocessor Programming

58 (1)

Write-Back Caches • Accumulate changes in cache • Write back when line evicted – Need the cache for something else – Another processor wants it

Art of Multiprocessor Programming

59

Invalidate

Invalidate x

S I

data cache

S M

data

cache

Bus

memory Art of Multiprocessor Programming

data 60 (4)

Multicore Architectures • The university president – Alarmed by fall in productivity

• Puts Alice, Bob, and Carol in same corridor – Private desks – Shared bookcase

• Contention costs go way down Art of Multiprocessor Programming

61

Old-School Multiprocessor

cache

cache Bus

cache Bus

memory Art of Multiprocessor Programming

62

Multicore Architecture

cache

cache Bus

cache Bus

memory Art of Multiprocessor Programming

63

Multicore • Private L1 caches • Shared L2 caches • Communication between same-chip processors now very fast • Different-chip processors still not so fast Art of Multiprocessor Programming

64

NUMA Architectures • Alice and Bob transfer to NUMA State University • No centralized library • Each office basement holds part of the library

Art of Multiprocessor Programming

65

Distributed Shared-Memory Architectures • Alice’s has volumes that start with A – Aardvark papers are convenient: run downstairs – Zebra papers are inconvenient: run across campus

Art of Multiprocessor Programming

66

SMP vs NUMA

memory SMP

NUMA

• SMP: symmetric multiprocessor • NUMA: non-uniform memory access • CC-NUMA: cache-coherent … Art of Multiprocessor Programming

67 (1)

Spinning Again • NUMA without cache – OK if local variable – Bad if remote

• Cc-NUMA – Like SMP

Art of Multiprocessor Programming

68

Relaxed Memory • Remember the flag principle? – Alice and Bob’s flag variables false

• Alice writes true to her flag and reads Bob’s • Bob writes true to his flag and reads Alice’s • One must see the other’s flag true Art of Multiprocessor Programming

69

Not Necessarily So • Sometimes the compiler reorders memory operations • Can improve – cache performance – interconnect use

• But unexpected concurrent interactions Art of Multiprocessor Programming

70

Write Buffers address

• Absorbing • Batching Art of Multiprocessor Programming

71

Volatile • In Java, if a variable is declared volatile, operations won’t be reordered • Expensive, so use it only when needed

Art of Multiprocessor Programming

72

This work is licensed under a Creative Commons AttributionShareAlike 2.5 License. • You are free: – to Share — to copy, distribute and transmit the work – to Remix — to adapt the work • Under the following conditions: – Attribution. You must attribute the work to ―The Art of Multiprocessor Programming‖ (but not in any way that suggests that the authors endorse you or your use of the work). – Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. • For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to – http://creativecommons.org/licenses/by-sa/3.0/. • Any of the above conditions can be waived if you get permission from the copyright holder. • Nothing in this license impairs or restricts the author's moral rights. Art of Multiprocessor Programming

73