Shared Memory Architecture. Shared Memory Bus for Multiprocessor Systems. Shared Memory Architecture. Cache Coherency Problem

Shared Memory Architecture Shared Memory Bus for Multiprocessor Systems Memory CPU ? Mat Laibowitz and Albert Chiou Group 6 ? ? ? CPU CPU ? ...

Author: Raymond Harrell

2 downloads 0 Views 274KB Size

Report

Download PDF

Recommend Documents

Bus Architecture for Shared Memory Multiprocessors

Chapter 12: Multiprocessor Architectures. Lesson 06: Centralized Shared Memory Architecture

Haskell on a Shared-Memory Multiprocessor

Shared Memory Multiprocessors

Design of a bus-based shared-memory multiprocessor DICE

Reader-Writer Synchronization for Shared-Memory Multiprocessor Real-Time Systems

Shared Memory Programming OpenMP

Shared Memory. Overview

Shared Memory Parallel Programming

DISTRIBUTED SHARED MEMORY

Unix Shared Memory 1

Shared Memory Parallel Computing

Shared Memory Programming with OpenMP

Shared Memory Multiprocessor Architectures for Software IP Routers

Chapter 12: Distributed Shared Memory

Shared Memory programming with OpenMP

Lect. 4: Shared Memory Multiprocessors

Shared Memory Bus for Multiprocessor Systems. Mat Laibowitz and Albert Chiou Group 6

Application-level Checkpointing for Shared Memory Programs

DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Efficiently

SOCKET EXERCISES. 2. Title: Dining Philosophers Problem without shared memory

Shared Memory Programming with OpenMP (2)

Streams: Emerging from a Shared Memory Model

Network Multicomputing Using Recoverable Distributed Shared Memory

Shared Memory Architecture

Shared Memory Bus for Multiprocessor Systems

Memory CPU

?

Mat Laibowitz and Albert Chiou Group 6

? ?

?

CPU

CPU

?

?

?

? Memory

• We want multiple processors to share memory Question: How do we connect them together?

Shared Memory Architecture CPU

CPU

CPU

CPU

CPU

Cache Coherency Problem $

CPU 0

Memory

Single, large memory

Memory

Memory

Memory

Multiple smaller memories

Issues • Scalability • Access Time • Cost • Application: WLAN vs Single chip multiprocessor

• Each cache needs to correctly handle memory accesses across multiple processors •A value written by one processor is eventually visible by the other processors •When multiple writes happen to the same location by multiple processors, all the processors see the writes in the same order.

$ 1

Snooping vs Directory CPU A

=M CPU C

CPU B

CPU A

=I

CPU B

MSI State Machine CPUWr/-CPURd/--

CPU C

M

=I

CPUWr/ RingInv

Memory

RingInv/-DataMsg

CPUWr/ RingInv CPU A CPU A

CPU B

CPU C

CPURd/ RingRd

= M !!!

CPUWr/-CPURd/--

Memory

Pending State

Incoming Ring Transaction

Incoming Processor Transaction

0

-

Read

Pending->1; SEND Read

I & Miss

0

-

Write

Pending->1; SEND Write

I & Miss

0

Read

-

PASS

I & Miss

0

Write

-

PASS

I & Miss

0

WriteBack

-

PASS

I & Miss

1

Read

-

DATA/S->Cache; SEND WriteBack(DATA)

I & Miss

1

Write (I/S)

-

DATA/M->Cache, Modify Cache; SEND WriteBack(DATA)

Write (M)

DATA/M->Cache, Modify Cache; SEND WriteBack(DATA), SEND WriteBack(data), Pending->2

S

0

-

Read(Hit)

-

S

0

-

Write

Pending->1; SEND Write

S

0

Read(Hit)

-

Add DATA; PASS

S

0

Read(Miss)

-

PASS

S

0

Write(Hit)

-

Add DATA; Cache->I & PASS

S

0

Write(Miss)

-

PASS

S

0

WriteBack

-

PASS

S

1

Write

-

Modify Cache; Cache->M & Pass Token

S

1

WriteBack

-

Pending->0, Pass Token

M

0

-

Read(Hit)

-

M

0

-

Write(Hit)

-

M

0

Read(Hit)

-

Add DATA; Cache->S & PASS

M

0

Read(Miss)

-

PASS

M

0

Write(Hit)

-

Add DATA; Cache->I & PASS

M

0

Write(Miss)

-

PASS

M

0

WriteBack

-

PASS

M

1

WriteBack

-

Pending->0 & Pass Token

WriteBack

-

Pending->1

M

2

Ring Topology

Actions

I & Miss

I & Miss

RingInv/--

I

MSI Transition Chart Cache State

RingInv/-DataMsg

S

CPU 1

CPU 2

Cache Controller 1

Cache Controller 2

Cache 1

Cache 2

Memory Controller

Memory

CPU n

●●●

Cache Controller n Cache n

Ring Implementation • A ring topology was chosen for speed and its electrical characteristics – Only point-to-point – Like a bus – Scaleable

Client

Client

response FIFO

mkMSICacheController rule

ringIn FIFO

$ Controller

rule

●●● ringOut FIFO

rules

$ Controller

rule

=

$ Controller

toDMem FIFO

fromDMem FIFO

dataReqQ FIFO

dataRespQ FIFO

token

mkDataMem

ringIn FIFO

mkMultiCache

pending

ringOut FIFO

waitRegrule

ringOut FIFO

=

$ Controller

pending

token

• An additional module was implemented that takes a single stream of memory requests and deals them out to the individual cpu data request ports. •This module can either send one request at a time, wait for a response, and then go on to the next cpu or it can deal them out as fast as the memory ports are ready. •This demux allows individual processor verification prior to multiprocessor verification.

mkDataMemoryController mkMSICache

rules

response FIFO

Test Rig (cont)

mkMultiCacheTH

request FIFO

mkMSICacheController

mkMSICache waitReg

Test Rig

$ Controller

request FIFO ringIn FIFO

• Uses a token to ensure sequential consistency

Client

Test Rig

rule

•It can then be fed set test routines to exercise all the transitions or be hooked up to the random request generator

=> Cache 2: toknMsg op->Tk8 => Cache 5: toknMsg op->Tk2 => Cache 3: ringMsg op->WrBk addr->0000022c data->aaaaaaaa valid->1 cache->1 => Cache 3: getState I => Cache 1: newCpuReq St { addr=00000230, data=ba4f0452 } => Cache 1: getState I => Cycle = 56 => Cache 2: toknMsg op->Tk7 => Cache 6: ringMsg op->Rd addr->00000250 data->aaaaaaaa valid->1 cache->6 => DataMem: ringMsg op->WrBk addr->00000374 data->aaaaaaaa valid->1 cache->5 => Cache 6: getState I => Cache 8: ringReturn op->Wr addr->000003a8 data->aaaaaaaa valid->1 cache->7 => Cache 8: getState I => Cache 8: writeLine state->M addr->000003a8 data->4ac6efe7 => Cache 3: ringMsg op->WrBk addr->00000360 data->aaaaaaaa valid->1 cache->4 => Cache 3: getState I => Cycle = 57 => Cache 6: toknMsg op->Tk2 => Cache 3: toknMsg op->Tk8 => Cache 4: ringMsg op->WrBk addr->0000022c data->aaaaaaaa valid->1 cache->1 => Cache 4: getState I => Cycle = 58 => dMemReq: St { addr=00000374, data=aaaaaaaa } => Cache 3: toknMsg op->Tk7 => Cache 7: ringReturn op->Rd addr->00000250 data->aaaaaaaa valid->1 cache->6 => Cache 7: writeLine state->S addr->00000250 data->aaaaaaaa => Cache 7: getState I => Cache 1: ringMsg op->WrBk addr->00000374 data->aaaaaaaa valid->1 cache->5 => Cache 1: getState I => Cache 4: ringMsg op->WrBk addr->00000360 data->aaaaaaaa valid->1 cache->4 => Cache 4: getState I => Cache 9: ringMsg op->WrBk addr->000003a8 data->aaaaaaaa valid->1 cache->7 => Cache 9: getState I => Cycle = 59 => Cache 5: ringMsg op->WrBk addr->0000022c data->aaaaaaaa valid->1 cache->1 => Cache 5: getState I => Cache 7: toknMsg op->Tk2 => Cache 3: execCpuReq Ld { addr=000002b8, tag=00 } => Cache 3: getState I => Cache 4: toknMsg op->Tk8 => Cycle = 60 => DataMem: ringMsg op->WrBk addr->000003a8 data->aaaaaaaa valid->1 cache->7 => Cache 2: ringMsg op->WrBk addr->00000374 data->aaaaaaaa valid->1 cache->5 => Cache 2: getState I => Cache 8: ringMsg op->WrBk addr->00000250 data->aaaaaaaa valid->1 cache->6 => Cache 8: getState I => Cache 5: ringReturn op->WrBk addr->00000360 data->aaaaaaaa valid->1 cache->4 => Cache 5: getState S => Cycle = 61 => Cache 5: toknMsg op->Tk8

Design Exploration Trace

• Scale up number of cache controllers • Add additional tokens to the ring allowing basic pipelining of memory requests

Example

• Tokens service disjoint memory addresses (ex. odd or even) • Compare average memory access time versus number of tokens and number of active CPUs

Test Results

Test Results Number of Tokens vs. Avg. Access Time (9 Controllers)

30

30

25

25

Average Access Time (clock cycles)

Average Access Time (clock cyles)

Number of Controllers vs. Avg. Access Time (2 Tokens)

20

15

10

5

0

20

15

10

5

0 3

6

Number of Controllers

9

2

4

Number of Tokens

8

Placed and Routed

Stats (9 cache, 8 tokens) • Clock speed: 3.71ns (~270 Mhz) • Area: 1,296,726 µm2 with memory • Average memory access time: ~39ns