Chapter 7
The Memory Hierarchy – Part II
Note: The slides being presented represent a mix. Some are created by Mark Franklin, Washington University in St. Louis, Dept. of CSE. Many are taken from the Patterson & Hennessy book, “Computer Organization & Design”, Copyright 1998 Morgan Kaufmann Publishers. This material may not be copied or distributed for commercial purposes without express written permission of the copyright holder. The original slides may be found at: http://books.elsevier.com/us//mk/us/subindex.asp?maintarget=companions/d efaultindividual.asp&isbn=1558604286&country=United+States&srccode=&re f=&subcode=&head=&pdf=&basiccode=&txtSearch=&SearchField=&operator =&order=&community=mk Additionally, some of the slides are taken from V. Heuring & H. Jordan, “Computer Systems Design and Architecture” 1997.
SP04
1
Outline: • Memory components: – RAM memory cells and cell arrays – Static RAM—more expensive, but less complex – Tree and matrix decoders—needed for large RAM chips – Dynamic RAM—less expensive, but needs “refreshing” • Chip organization • Timing – ROM—Read-only memory • Memory boards – Arrays of chips give more addresses and/or wider words – 2-D and 3-D chip arrays • Memory modules – Large systems can benefit by partitioning memory for • separate access by system components • fast access to multiple words SP04
2
Memory Hierarchy Outline (cont): • The Memory Hierarchy: from fast and expensive to slow and cheap: – Registers → Cache → Main Memory → Disk – Consider two adjacent hierarchy levels: Cache Æ Main Memory – Cache: High speed, expensive (1st level on-chip, 2nd level off-chip) • Design Types: Direct mapped, associative, set associative – Virtual memory: Makes the hierarchy to disk transparent • Translate the address from CPU’s logical address to the physical address where the information is actually stored. • Memory management—how to move information back and forth. • Multiprogramming—what to do while we wait. • The “TLB” helps in speeding the address translation process. • Memory as a subsystem: Overall performance.
SP04
3
Memory Technology Characteristics Level
SP04
Memory Type
Typical Size
Unit of Transfer (Block Size)
.5 –20ns
8KB - 32MB
Word 16-32bits Cache line 8B-16B
Average Access Time
1
Cache
2
Main Memory
40 –200ns
2MB - 16GB
3
Disk
5 – 10ms
> 100Gb
Page 4KB-16KB
4
Magnetic Tape
1 – 5sec
> 200Gb
Record 16KB
4
Levels of the Memory Hierarchy Capacity, Access Time, Cost
Staging Xfer Unit
CPU Registers 100s Bytes > size of cache, how do we map general addresses into cache addresses ? • Case 1 - Direct Mapping: address is modulo the number of blocks in the cache 000 001 010 011 100 101 110 111
Cache
Cache is 8 words.
Main memory is 32 words 00001
SP04
00101
01001
01101
10001
Memory
10101
11001
11101
11
Direct-Mapped Cache Tag memory
Valid bits
30
1
0
0
256 512
9
1
1
1
257 513
1
1
2
2
258 514
1
1
255
Tag field, 5 bits
Cache memory
Main memory block numbers
7680 7936 0 2305
7681 7937 1 7682 7938 2
255 511 767 Tag #: 0 1 2
9
One cache line, 8 bytes
Key Idea: all the MM Cache address: blocks from a given group can go into only one location in the Main memory address: 5 cache, corresponding Tag to the group number.
Group #:
8
3
8
3
Group
Byte
8191 255 30 31
One cache line, 8 bytes
Now the cache needs only examine the single group that its reference specifies. SP04 (Heuring & Jordan)
12
Direct Mapped Cache Address (showing bit positions) 31 30
•
13 12 11
210 Byte offset
For MIPS: Hit
10
20 Tag
Data
Index
Index Valid Tag
Data
0 1 2
1021 1022 1023 20
SP04
32
13
Direct-Mapped Cache Operation 1. Decode the group number of the incoming MM address to select the group 2. If Match AND Valid
Main memory address Tag Group Byte 3
1
Tag memory
Valid bits
30
1
9
1
1
1
3. Then gate out the tag field
8– 256 decoder
1
Cache memory
Hit
256
0 5
5
3
1 2
2
4. Compare cache tag with incoming tag 5. If a hit, then gate out the cache line
8
5
1
1
255 5
Tag field, 5 bits
4
5-bit comparator Cache miss
≠
64
5
=
3
6
Selector Cache hit
8
6. and use the word field to select the desired word. SP04 (Heuring & Jordan)
14
Direct Mapped Cache (64-KB cache with 4-word, 16B blocks) •
Taking advantage of spatial locality: Address (showing bit positions) 31
16 15 16
Hit
4 32 1 0 12
2 Byte offset
Tag
Data
Index V
Block offset
16 bits
128 bits
Tag
Data
4K entries
16
32
32
32
32
Mux 32
SP04
15
Performance as a function of block size 40%
Increasing the block size tends to decrease miss rate:
35% 30% Miss rate
•
25% 20% 15% 10% 5%
•
Use split caches because there is more spatial locality in code:
0%
4
16
256
64 Block size (bytes)
1 KB 8 KB 16 KB 64 KB 256 KB
Program gcc spice
SP04
Block size in words 1 4 1 4
Instruction miss rate 6.1% 2.0% 1.2% 0.3%
Data miss rate 2.1% 1.7% 1.3% 0.6%
Effective combined miss rate 5.4% 1.9% 1.2% 0.4%
16
Fully Associative Cache Associative mapped cache model: any block from main memory can be put anywhere in the cache. Assume a 16-bit main memory.*
Tag memory
Valid bits
421
1
?
Cache memory
Main memory
0
Cache block 0
MM block 0
0
1
?
MM block 1
119
1
2
Cache block 2
MM block 2
2
1
255 Cache block 255 MM block 119
Tag field, 13 bits
One cache line, 8 bytes Valid, 1 bit
Main memory address:
MM block 421
MM block 8191 13
3
Tag
Byte
One cache line, 8 bytes
*16 bits, while unrealistically small, simplifies the examples SP04 (Heuring & Jordan)
17
Fully Associative Cache Mechanism Because any block can reside anywhere in the cache, an associative (content addressable) memory is used. All locations are searched simultaneously. Associative tag memory 1
Argument register
Match bit
Valid bit
Cache block 0 2
?
Match
Cache block 2 3 4
Cache block 255 64 Main memory address Tag Byte 13
One cache line, 8 bytes 3
5
3
Selector 6
To CPU
SP04 (Heuring & Jordan)
8
18
Associative Mapped Cache Properties Advantage • Most flexible of all—any MM block can go anywhere in the cache. Disadvantages • Large tag memory. • Need to search entire tag memory simultaneously Æ high hardware complexity
Q.: How is an associative search conducted at the logic gate level? Direct-mapped caches simplify the hardware by allowing each MM block to go into only one place in the cache based on simple modulo operation
SP04
19
Direct-Mapped vs Fully Associative Designs • The direct mapped cache uses less hardware, but is much more restrictive in block placement. • If two blocks from the same group are frequently referenced, then the cache will “thrash.” That is, repeatedly bring the two competing blocks into and out of the cache. This will cause a performance degradation. • Block replacement strategy is trivial. • Compromise—allow several cache blocks in each group—the Set-Associative Cache:
SP04 (Heuring & Jordan)
20
Three Basic Cache Designs
To get advantages of both direct mapped & fully associated we consider now set associative design SP04
21
Set-Associative Cache (2-Way) Example shows 256 groups, a set of two per group. Sometimes referred to as a 2-way set-associative cache. Cache memory
Tag memory
Main memory block numbers
2 62
0
256
7680
0
128 256
1 17 • 1
1
129
2304
1
129
2
•
130
2
130
0
2
128
Tag field, 6 bits 12 bits total
5
383
127
127 Tag #: 0
7680 7936 0 2304
8191 127
383
255 1
2
17
62
63
One cache line, 8 bytes Cache group address:
Main memory address:
SP04 (Heuring & Jordan)
7681 7937 1 7682 7938 2
One cache line, 8 bytes
5 1227
Group #:
7
3
6
7
3
Tag
Set
Byte Modified by W.J. Taffe
22
Set-Associative Cache (2-Way) Example shows 256 groups, a set of two per group. Sometimes referred to as a 2-way set-associative cache. Cache memory
Tag memory
Main memory block numbers
2 30
0
512
7680
0
256 512
2
9
1
513
2304
1
257 513
1
2
258
2
258 514
511
255
511 767
0
1
255 255
Tag #: Tag field, 5 bits
0
1
Group #:
7680 7936 0 2304
7681 7937 1 7682 7938 2
8191 255
2
9
One cache line, 8 bytes
30
31
One cache line, 8 bytes Cache group address:
8
3
5
8
3
Tag
Set
Byte
This model doubles the size of cache memory Main memory address:
SP04 (Heuring & Jordan)
23
Four-way Set Associative Design Address 31 30
12 11 10 9 8 8
22
Index 0 1 2
V
Tag
Data
V
321 0
Tag
Data
V
Tag
Data
V
Tag
Data
253 254 255 22
N-way Set Associative Cache v. Direct Mapped Cache: • N comparators vs. 1 • Extra MUX delay for the data • Data comes AFTER Hit/Miss SP04
32
4-to-1 multiplexor
Hit
Data
24
Set Associative Performance PROGRAM
ASSOCIATIVITY
INSTRUCTION MISS RATE
DATA MISS RATE
COMBINED MISS RATE
Gcc
1
2.0%
1.7%
1.9%
Gcc
2
1.6%
1.4%
1.5%
Gcc
4
1.6%
1.4%
1.5%
Spice
1
0.3%
0.6%
0.4%
Spice
2
0.3%
0.6%
0.4%
Spice
4
0.3%
0.6%
0.4%
SP04
25
More Performance 15%
12%
Miss rate
9%
6%
3%
0% One-way
Two-way
Four-way Associativity
SP04
Eight-way 1 KB
16 KB
2 KB
32 KB
4 KB
64 KB
8 KB
128 KB
26
4 Questions for Memory Hierarchy • Q1: Where can a block be placed in the upper level? (Block placement) • Q2: How is a block found if it is in the upper level? (Block identification) • Q3: Which block should be replaced on a miss? (Block replacement) • Q4: What happens on a write? (Write strategy)
SP04
27
Q3: Which block replaced on a miss? • Direct Mapped: Modulo operation Æ no choice • Set Associative or Fully Associative: – Random, LRU (Least Recently Used), FIFO
Associativity: Size LRU 16 KB 64 KB 256 KB
SP04
5.2% 1.9% 1.15%
2-way Random
LRU
5.7% 4.7% 2.0% 1.5% 1.17% 1.13%
4-way Random 5.3% 1.7% 1.13%
8-way LRU Random 4.4% 1.4% 1.12%
5.0% 1.5% 1.12%
28
Q4: What happens on a write? • Write through—The information is written to both the block in the cache and to the block in the lower-level memory. • Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. – is block clean or dirty? Use of “dirty” bit. • Pros and Cons of each? – WT: PRO – Cache consistency. CON – More memory bus activity. – WB: PRO - Repeated writes to same location without tying up memory bus. CON – Memory consistency. • WT always combined with write buffers so there is no waiting on lower level memory. SP04
29
Write Buffer for Write Through Processor
Cache
DRAM
Write Buffer
• A Write Buffer is needed between the Cache and Memory – Processor: writes data into the cache and the write buffer – Memory controller: write contents of the buffer to memory • Write buffer is just a FIFO: – Typical number of entries: 4, 8 – Works if: Write frequency