The Memory Hierarchy Part II

Chapter 7 The Memory Hierarchy – Part II Note: The slides being presented represent a mix. Some are created by Mark Franklin, Washington University ...
Author: Andrew Curtis
20 downloads 0 Views 212KB Size
Chapter 7

The Memory Hierarchy – Part II

Note: The slides being presented represent a mix. Some are created by Mark Franklin, Washington University in St. Louis, Dept. of CSE. Many are taken from the Patterson & Hennessy book, “Computer Organization & Design”, Copyright 1998 Morgan Kaufmann Publishers. This material may not be copied or distributed for commercial purposes without express written permission of the copyright holder. The original slides may be found at: http://books.elsevier.com/us//mk/us/subindex.asp?maintarget=companions/d efaultindividual.asp&isbn=1558604286&country=United+States&srccode=&re f=&subcode=&head=&pdf=&basiccode=&txtSearch=&SearchField=&operator =&order=&community=mk Additionally, some of the slides are taken from V. Heuring & H. Jordan, “Computer Systems Design and Architecture” 1997.

SP04

1

Outline: • Memory components: – RAM memory cells and cell arrays – Static RAM—more expensive, but less complex – Tree and matrix decoders—needed for large RAM chips – Dynamic RAM—less expensive, but needs “refreshing” • Chip organization • Timing – ROM—Read-only memory • Memory boards – Arrays of chips give more addresses and/or wider words – 2-D and 3-D chip arrays • Memory modules – Large systems can benefit by partitioning memory for • separate access by system components • fast access to multiple words SP04

2

Memory Hierarchy Outline (cont): • The Memory Hierarchy: from fast and expensive to slow and cheap: – Registers → Cache → Main Memory → Disk – Consider two adjacent hierarchy levels: Cache Æ Main Memory – Cache: High speed, expensive (1st level on-chip, 2nd level off-chip) • Design Types: Direct mapped, associative, set associative – Virtual memory: Makes the hierarchy to disk transparent • Translate the address from CPU’s logical address to the physical address where the information is actually stored. • Memory management—how to move information back and forth. • Multiprogramming—what to do while we wait. • The “TLB” helps in speeding the address translation process. • Memory as a subsystem: Overall performance.

SP04

3

Memory Technology Characteristics Level

SP04

Memory Type

Typical Size

Unit of Transfer (Block Size)

.5 –20ns

8KB - 32MB

Word 16-32bits Cache line 8B-16B

Average Access Time

1

Cache

2

Main Memory

40 –200ns

2MB - 16GB

3

Disk

5 – 10ms

> 100Gb

Page 4KB-16KB

4

Magnetic Tape

1 – 5sec

> 200Gb

Record 16KB

4

Levels of the Memory Hierarchy Capacity, Access Time, Cost

Staging Xfer Unit

CPU Registers 100s Bytes > size of cache, how do we map general addresses into cache addresses ? • Case 1 - Direct Mapping: address is modulo the number of blocks in the cache 000 001 010 011 100 101 110 111

Cache

Cache is 8 words.

Main memory is 32 words 00001

SP04

00101

01001

01101

10001

Memory

10101

11001

11101

11

Direct-Mapped Cache Tag memory

Valid bits

30

1

0

0

256 512

9

1

1

1

257 513

1

1

2

2

258 514

1

1

255

Tag field, 5 bits

Cache memory

Main memory block numbers

7680 7936 0 2305

7681 7937 1 7682 7938 2

255 511 767 Tag #: 0 1 2

9

One cache line, 8 bytes

Key Idea: all the MM Cache address: blocks from a given group can go into only one location in the Main memory address: 5 cache, corresponding Tag to the group number.

Group #:

8

3

8

3

Group

Byte

8191 255 30 31

One cache line, 8 bytes

Now the cache needs only examine the single group that its reference specifies. SP04 (Heuring & Jordan)

12

Direct Mapped Cache Address (showing bit positions) 31 30



13 12 11

210 Byte offset

For MIPS: Hit

10

20 Tag

Data

Index

Index Valid Tag

Data

0 1 2

1021 1022 1023 20

SP04

32

13

Direct-Mapped Cache Operation 1. Decode the group number of the incoming MM address to select the group 2. If Match AND Valid

Main memory address Tag Group Byte 3

1

Tag memory

Valid bits

30

1

9

1

1

1

3. Then gate out the tag field

8– 256 decoder

1

Cache memory

Hit

256

0 5

5

3

1 2

2

4. Compare cache tag with incoming tag 5. If a hit, then gate out the cache line

8

5

1

1

255 5

Tag field, 5 bits

4

5-bit comparator Cache miss



64

5

=

3

6

Selector Cache hit

8

6. and use the word field to select the desired word. SP04 (Heuring & Jordan)

14

Direct Mapped Cache (64-KB cache with 4-word, 16B blocks) •

Taking advantage of spatial locality: Address (showing bit positions) 31

16 15 16

Hit

4 32 1 0 12

2 Byte offset

Tag

Data

Index V

Block offset

16 bits

128 bits

Tag

Data

4K entries

16

32

32

32

32

Mux 32

SP04

15

Performance as a function of block size 40%

Increasing the block size tends to decrease miss rate:

35% 30% Miss rate



25% 20% 15% 10% 5%



Use split caches because there is more spatial locality in code:

0%

4

16

256

64 Block size (bytes)

1 KB 8 KB 16 KB 64 KB 256 KB

Program gcc spice

SP04

Block size in words 1 4 1 4

Instruction miss rate 6.1% 2.0% 1.2% 0.3%

Data miss rate 2.1% 1.7% 1.3% 0.6%

Effective combined miss rate 5.4% 1.9% 1.2% 0.4%

16

Fully Associative Cache Associative mapped cache model: any block from main memory can be put anywhere in the cache. Assume a 16-bit main memory.*

Tag memory

Valid bits

421

1

?

Cache memory

Main memory

0

Cache block 0

MM block 0

0

1

?

MM block 1

119

1

2

Cache block 2

MM block 2

2

1

255 Cache block 255 MM block 119

Tag field, 13 bits

One cache line, 8 bytes Valid, 1 bit

Main memory address:

MM block 421

MM block 8191 13

3

Tag

Byte

One cache line, 8 bytes

*16 bits, while unrealistically small, simplifies the examples SP04 (Heuring & Jordan)

17

Fully Associative Cache Mechanism Because any block can reside anywhere in the cache, an associative (content addressable) memory is used. All locations are searched simultaneously. Associative tag memory 1

Argument register

Match bit

Valid bit

Cache block 0 2

?

Match

Cache block 2 3 4

Cache block 255 64 Main memory address Tag Byte 13

One cache line, 8 bytes 3

5

3

Selector 6

To CPU

SP04 (Heuring & Jordan)

8

18

Associative Mapped Cache Properties Advantage • Most flexible of all—any MM block can go anywhere in the cache. Disadvantages • Large tag memory. • Need to search entire tag memory simultaneously Æ high hardware complexity

Q.: How is an associative search conducted at the logic gate level? Direct-mapped caches simplify the hardware by allowing each MM block to go into only one place in the cache based on simple modulo operation

SP04

19

Direct-Mapped vs Fully Associative Designs • The direct mapped cache uses less hardware, but is much more restrictive in block placement. • If two blocks from the same group are frequently referenced, then the cache will “thrash.” That is, repeatedly bring the two competing blocks into and out of the cache. This will cause a performance degradation. • Block replacement strategy is trivial. • Compromise—allow several cache blocks in each group—the Set-Associative Cache:

SP04 (Heuring & Jordan)

20

Three Basic Cache Designs

To get advantages of both direct mapped & fully associated we consider now set associative design SP04

21

Set-Associative Cache (2-Way) Example shows 256 groups, a set of two per group. Sometimes referred to as a 2-way set-associative cache. Cache memory

Tag memory

Main memory block numbers

2 62

0

256

7680

0

128 256

1 17 • 1

1

129

2304

1

129

2



130

2

130

0

2

128

Tag field, 6 bits 12 bits total

5

383

127

127 Tag #: 0

7680 7936 0 2304

8191 127

383

255 1

2

17

62

63

One cache line, 8 bytes Cache group address:

Main memory address:

SP04 (Heuring & Jordan)

7681 7937 1 7682 7938 2

One cache line, 8 bytes

5 1227

Group #:

7

3

6

7

3

Tag

Set

Byte Modified by W.J. Taffe

22

Set-Associative Cache (2-Way) Example shows 256 groups, a set of two per group. Sometimes referred to as a 2-way set-associative cache. Cache memory

Tag memory

Main memory block numbers

2 30

0

512

7680

0

256 512

2

9

1

513

2304

1

257 513

1

2

258

2

258 514

511

255

511 767

0

1

255 255

Tag #: Tag field, 5 bits

0

1

Group #:

7680 7936 0 2304

7681 7937 1 7682 7938 2

8191 255

2

9

One cache line, 8 bytes

30

31

One cache line, 8 bytes Cache group address:

8

3

5

8

3

Tag

Set

Byte

This model doubles the size of cache memory Main memory address:

SP04 (Heuring & Jordan)

23

Four-way Set Associative Design Address 31 30

12 11 10 9 8 8

22

Index 0 1 2

V

Tag

Data

V

321 0

Tag

Data

V

Tag

Data

V

Tag

Data

253 254 255 22

N-way Set Associative Cache v. Direct Mapped Cache: • N comparators vs. 1 • Extra MUX delay for the data • Data comes AFTER Hit/Miss SP04

32

4-to-1 multiplexor

Hit

Data

24

Set Associative Performance PROGRAM

ASSOCIATIVITY

INSTRUCTION MISS RATE

DATA MISS RATE

COMBINED MISS RATE

Gcc

1

2.0%

1.7%

1.9%

Gcc

2

1.6%

1.4%

1.5%

Gcc

4

1.6%

1.4%

1.5%

Spice

1

0.3%

0.6%

0.4%

Spice

2

0.3%

0.6%

0.4%

Spice

4

0.3%

0.6%

0.4%

SP04

25

More Performance 15%

12%

Miss rate

9%

6%

3%

0% One-way

Two-way

Four-way Associativity

SP04

Eight-way 1 KB

16 KB

2 KB

32 KB

4 KB

64 KB

8 KB

128 KB

26

4 Questions for Memory Hierarchy • Q1: Where can a block be placed in the upper level? (Block placement) • Q2: How is a block found if it is in the upper level? (Block identification) • Q3: Which block should be replaced on a miss? (Block replacement) • Q4: What happens on a write? (Write strategy)

SP04

27

Q3: Which block replaced on a miss? • Direct Mapped: Modulo operation Æ no choice • Set Associative or Fully Associative: – Random, LRU (Least Recently Used), FIFO

Associativity: Size LRU 16 KB 64 KB 256 KB

SP04

5.2% 1.9% 1.15%

2-way Random

LRU

5.7% 4.7% 2.0% 1.5% 1.17% 1.13%

4-way Random 5.3% 1.7% 1.13%

8-way LRU Random 4.4% 1.4% 1.12%

5.0% 1.5% 1.12%

28

Q4: What happens on a write? • Write through—The information is written to both the block in the cache and to the block in the lower-level memory. • Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. – is block clean or dirty? Use of “dirty” bit. • Pros and Cons of each? – WT: PRO – Cache consistency. CON – More memory bus activity. – WB: PRO - Repeated writes to same location without tying up memory bus. CON – Memory consistency. • WT always combined with write buffers so there is no waiting on lower level memory. SP04

29

Write Buffer for Write Through Processor

Cache

DRAM

Write Buffer

• A Write Buffer is needed between the Cache and Memory – Processor: writes data into the cache and the write buffer – Memory controller: write contents of the buffer to memory • Write buffer is just a FIFO: – Typical number of entries: 4, 8 – Works if: Write frequency

Suggest Documents