Lecture 18: Multi-Processors - Snoopy Caches

CS 152 Computer Architecture and Engineering Lecture 18: Multi-Processors - Snoopy Caches John Wawrzynek Electrical Engineering and Computer Sciences...

Author: Scott Jackson

0 downloads 0 Views 299KB Size

Report

Download PDF

Recommend Documents

Lecture 12: Memory hierarchy & caches

Caches

August 18, Some References Lecture Lecture Lecture

Snoopy Coordinate Graph

Lecture 18 - Covalent Bonding. Lecture 18 - Valence Bond Theory. Lecture 18 - Introduction. Lecture 18 - Valence Bond Theory

Lecture 18: Automated Testing"

CS6290 Multiprocessors

Java Programming Lecture 18

18 Lecture Period:

Lecture 18 Space Weather

Shared Memory Multiprocessors

Loosely Coupled Multiprocessors

Editori-AL. Snoopy und Fanta

Topics. ! Generic cache memory organization! Direct mapped caches! Set associative caches! Impact of caches on performance

Lecture 18 : Wednesday May 14th

Lecture 18: Common Emitter Amplifier

Lecture 18: Virtual Memory: Systems

Lecture 18: Multiple Logistic Regression

Lecture 17-18: Memory Hierarchy

Lecture 18: VLSI Design Styles

Lecture 18: Finite State Automata

Lecture 18 The Urinary System

E11 Lecture 18: Technical Writing

Lecture 18: Introduction to J2ME

CS 152 Computer Architecture and Engineering Lecture 18: Multi-Processors - Snoopy Caches John Wawrzynek

Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw http://inst.cs.berkeley.edu/~cs152 11/3/2016

CS152, Fall 2016

Uniprocessor Performance (SPECint) 3X

Performance (vs. VAX-11/780)

10000

1000

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

??%/year

52%/year 100

10

25%/year

1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

CS152-Spring’09 11/3/2016

• VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 • RISC + x86: ??%/year 2002 to present CS152, Fall 2016

2

Parallel Processing: Déjà vu all over again? § “… today’s processors … are nearing an impasse as technologies approach the speed of light..” – David Mitchell, The Transputer: The Time Is Now (1989)

§ Transputer had bad timing (Uniprocessor performance↑) ⇒ Procrastination rewarded: 2X seq. perf. / 1.5 years § “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” – Paul Otellini, President, Intel (2005)

§ All microprocessor companies switch to MP (2+ CPUs/2 yrs) ⇒ Procrastination penalized: 2X sequential perf. / 5 yrs § Even handheld systems moved to multicore – Nintendo 3DS, iPhone6 has two cores each (plus additional specialized cores), Android Qualcomm Snapdragon 805 has four cores. Playstation Portable Vita has four cores.

11/3/2016

CS152, Fall 2016

3

Symmetric Multiprocessors (SMPs) Processor

Processor CPU-Memory bus bridge

Memory

I/O bus I/O controller

symmetric • All memory is equally far away from all processors • Any processor can do any I/O (set up a DMA transfer)

I/O controller

Graphics output

I/O controller

Networks

Local caches at processors makes it practical! 11/3/2016

CS152, Fall 2016

4

Synchronization The need for synchronization arises whenever there are concurrent processes in a system cooperating on some task (even in a uniprocessor system)

producer

Two classes of synchronization: Producer-Consumer: A consumer process must wait until the producer process has produced data Mutual Exclusion: Ensure that only one process uses a resource at a given time 11/3/2016

CS152, Fall 2016

consumer

P1

P2

Shared Resource 5

Need for Mutual Exclusion § Example (wikipedia): shared linked list management § Two nodes, i and i + 1, being removed simultaneously results in node i + 1 not being removed.

11/3/2016

CS152, Fall 2016

6

Memory Coherence in SMPs CPU-1

A

CPU-2

cache-1

100

A

100

cache-2

CPU-Memory bus

A

100

memory

Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? 11/3/2016

CS152, Fall 2016

7

Maintaining Cache Coherence § A cache coherence protocol ensures that all writes by one processor are eventually visible to other processors, for one memory address – i.e., updates are not lost

§ Hardware support is required such that – only one processor at a time has write permission for a location – no processor can load a stale copy of the location after a write ⇒ cache coherence protocols

11/3/2016

CS152, Fall 2016

10

Warmup: Parallel I/O Memory Bus

Address (A)

Proc.

Data (D)

Physical Memory

Cache

R/W

Page transfers occur while the Processor is running

Either Cache or DMA can be the Bus Master and effect transfers

A D R/W

DMA DISK

(DMA stands for “Direct Memory Access”, means the I/O device can read/write memory autonomous from the CPU)

11/3/2016

CS152, Fall 2016

13

Problems with Parallel I/O Cached portions of page

Memory Bus

Proc.

Physical Memory

Cache DMA transfers

DMA DISK Memory Disk

11/3/2016

Disk: Physical memory may be stale if cache copy is dirty

Memory: Cache may hold stale data and not see memory writes

CS152, Fall 2016

14

Snoopy Cache, Goodman 1983 § Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing” § Snoopy cache tags are dual-ported Used to drive Memory Bus when Cache is Bus Master A

Proc.

R/W

Tags and State

D

Data (lines)

A R/W

Snoopy read port attached to Memory Bus

Cache

11/3/2016

CS152, Fall 2016

15

Snoopy Cache Actions for DMA Observed Bus Cycle

Cache State

Cache Action

Address not cached

No action

Cached, unmodified

No action

Cached, modified

Cache intervenes

Address not cached

No action

DMA Write

Cached, unmodified

Cache purges its copy

Disk

Cached, modified

DMA Read Memory

11/3/2016

Disk

Memory

CS152, Fall 2016

???

16

Shared Memory Multiprocessor Memory Bus

M1

Snoopy Cache

M2

Snoopy Cache

M3

Snoopy Cache

Physical Memory

DMA

DISKS

Use snoopy mechanism to keep all processors’ view of memory coherent 11/3/2016

CS152, Fall 2016

18

Snoopy Cache Coherence Protocols write miss: the address is invalidated in all other caches before the write is performed read miss: if a dirty copy is found in some cache, a writeback is performed before the memory is read

11/3/2016

CS152, Fall 2016

19

The MSI protocol Each cache line has state bits Address tag

M: Modified S: Shared I: Invalid

state bits

§ Modified: The block has been modified in the cache. The data in the cache is then inconsistent with the backing store (e.g. memory). A cache with a block in the "M" state has the responsibility to write the block to the backing store when it is evicted. A block in the Modified state is exclusive (it can’t be in any other cache). § Shared: This block is unmodified and exists in read-only state in at least one cache. The cache can evict the data without writing it to the backing store. § Invalid: This block is either not present in the current cache or has been invalidated by a bus request, and must be fetched from memory if the block is to be stored in this cache. 11/3/2016

CS152, Fall 2016

20

The MSI protocol Each cache line has state bits state bits

Address tag

M: Modified S: Shared I: Invalid

§ A read miss to a block in a cache, C1, generates a bus transaction – if another cache,C2, has the block in M state (“exclusively”), it has to write back the block before memory supplies it. C1 gets the data from the bus and the block becomes “shared” in both caches. § A write hit to a shared block in C1 forces a write back – all other caches that have the block should invalidate that block – the block becomes “exclusive” in C1. § A write hit to a modified (exclusive) block does not generate a write back or change of state. § A write miss (to an invalid block) in C1 generates a bus transaction – If a cache, C2, has the block as “shared”, C2 invalidates it’s copy

– If a cache, C2, has the block in “modified (exclusive)”, it writes back the block and changes it state in C2 to “invalid”. – If no cache supplies the block, the memory will supply it. – When C1 gets the block, it sets its state to ”modified (exclusive)” 11/3/2016

CS152, Fall 2016

21

Cache State Transition Diagram The MSI protocol

Each cache line has state bits Address tag state bits

M: Modified S: Shared I: Invalid

Write miss (P1 gets line from memory) Other processor reads (P1 writes back)

M

Other processor intent to write (P1 writes back)

Read miss (P1 gets line from memory)

Read by any processor 11/3/2016

P1 reads or writes

S

Other processor intent to write

CS152, Fall 2016

I Cache state in processor P1 22

Two Processor Example

(Reading and writing the same cache line) P1 reads P1 writes P2 reads P2 writes P1 reads P1 writes P2 writes

P1

P2 reads, P1 writes back

M

P1 reads or writes Write miss

P2 intent to write Read miss

P1 writes

P2

S

P2 intent to write

P1 reads, P2 writes back

I

M

P2 reads or writes Write miss

P1 intent to write Read miss

11/3/2016

S

P1 intent to write

CS152, Fall 2016

I 23

Observations Other processor reads P1 writes back

M

P1 reads or writes Write miss Other processor intent to write

Read miss Read by any processor

S

Other processor intent to write

I

§ If a line is in the M state then no other cache can have a copy of the line! § Memory stays coherent, multiple differing copies cannot exist § A write to a line in the S state causes a writeback (even if no other cache has a copy!) 11/3/2016

CS152, Fall 2016

24

MESI: An Enhanced MSI protocol increased performance for private data

Each cache line has a tag M: Modified Exclusive

E: Exclusive but unmodified S: Shared I: Invalid

Address tag state bits

Write miss

P1 write or read

M

Other processor reads P1 writes back Read miss, shared Read by any processor

S

P1 write P1 intent to write

Other processor reads

E

Other processor intent to write, P1 writes back

Other processor intent to write

Other processor intent to write

I

Write to a Exclusive line doesn’t cause a writeback. 11/3/2016

CS152, Fall 2016

P1 read Read miss, not shared

Cache state in processor P1 25

Optimized Snoop with Level-2 Caches CPU

CPU

CPU

CPU

L1 $

L1 $

L1 $

L1 $

L2 $

L2 $

L2 $

L2 $

Snooper

Snooper

Snooper

Snooper

• Processors often have two-level caches • small L1, large L2 (usually both on chip now) • Inclusion property: entries in L1 must be in L2 invalidation in L2 ⇒ invalidation in L1 • Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? 11/3/2016

CS152, Fall 2016

26

Acknowledgements § These slides contain material developed and copyright by: – – – – – –

Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB)

§ MIT material derived from course 6.823 § UCB material derived from course CS252

11/3/2016

CS152, Fall 2016

32