ECE4680 Computer Organization and Architecture Memory Hierarchy: Cache System
ECE4680 Cache.1
2002-4-17
The Motivation for Caches Memory System
Processor
Cache
DRAM
°Motivation: • Large memories (DRAM) are slow • Small memories (SRAM) are fast °Make the average access time small by: • Servicing most accesses from a small, fast memory. °Reduce the bandwidth required of the large memory
ECE4680 Cache.2
2002-4-17
An Expanded View of the Memory System
Processor Control Memory Memory Memory
Memory
Datapath
Memory
Slowest Biggest Lowest
Speed: Fastest Size: Smallest Cost: Highest
ECE4680 Cache.3
2002-4-17
Levels of the Memory Hierarchy Upper Level
Capacity Access Time Cost
Staging Xfer Unit
CPU Registers 100s Bytes DRAM (Read/Write) Access Time • - 2:1; why? °DRAM (Read/Write) Cycle Time : • How frequent can you initiate an access? • Analogy: A little kid can only ask his father for money on Saturday °DRAM (Read/Write) Access Time: • How quickly will you get what you want once you initiate an access? • Analogy: As soon as he asks, his father will give him the money °DRAM Bandwidth Limitation analogy: • What happens if he runs out of money on Wednesday?
ECE4680 Cache.39
2002-4-17
Increasing Bandwidth - Interleaving Access Pattern without Interleaving:
D1 available Start Access for D1
CPU
Memory
Start Access for D2 Memory Bank 0
Access Pattern with 4-way Interleaving:
CPU
Memory Bank 1
Access Bank 0
Memory Bank 2 Memory Bank 3 Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank 0 again
ECE4680 Cache.40
2002-4-17
Main Memory Performance °Timing model • 1 to send address, • 6 access time, 1 to send data • Cache Block is 4 words °Simple M.P. = 4 x (1+6+1) = 32 °Wide M.P. =1+6+1 =8 °Interleaved M.P. = 1 + 6 + 4x1 = 11
ECE4680 Cache.41
2002-4-17
Independent Memory Banks
°How many banks? number banks
number clocks to access word in bank
• For sequential accesses, otherwise will return to original bank before it has next word ready °Increasing DRAM => fewer chips => harder to have banks • Growth bits/chip DRAM : 50%-60%/yr • Nathan Myrvold M/S: mature software growth (33%/yr for NT) - growth MB/$ of DRAM (25%-30%/yr)
ECE4680 Cache.42
2002-4-17
SPARCstation 20’s Memory System
Memory Module 0
Memory Module 1
Memory Module 2
Memory Module 3
Memory Module 4
Memory Module 5
Memory Module 6
Memory Bus (SIMM Bus) 128-bit wide datapath Memory Module 7
SPARCstation 20’s External Cache Processor Module (Mbus Module) SuperSPARC Processor External Instruction Cache Cache 1 MB Register Direct Mapped File Data Write Back Cache Write Allocate
°SPARCstation 20’s External Cache: • Size and organization: 1 MB, direct mapped • Block size: 128 B • Sub-block size: 32 B • Write Policy: Write back, write allocate
ECE4680 Cache.44
2002-4-17
SPARCstation 20’s Internal Instruction Cache Processor Module (Mbus Module) SuperSPARC Processor External I-Cache Cache 20 KB 5-way 1 MB Register Direct Mapped File Write Back Data Write Allocate Cache
°SPARCstation 20’s Internal Instruction Cache: • Size and organization: 20 KB, 5-way Set Associative • Block size: 64 B • Sub-block size: 32 B • Write Policy: Does not apply °Note: Sub-block size the same as the External (L2) Cache
°SPARCstation 20’s Internal Data Cache: • Size and organization: 16 KB, 4-way Set Associative • Block size: 64 B • Sub-block size: 32 B • Write Policy: Write through, write not allocate °Sub-block size the same as the External (L2) Cache
°Why did they use N-way set associative cache internally? • Answer: A N-way set associative cache is like having N direct mapped caches in parallel. They want each of those N direct mapped cache to be 4 KB. Same as the “virtual page size.” °How many levels of cache does SPARCstation 20 has? • Answer: Three levels. (1) Internal I & D caches, (2) External cache and (3) ...
ECE4680 Cache.47
2002-4-17
SPARCstation 20’s Memory Module °Supports a wide range of sizes: • Smallest 4 MB: 16 2Mb DRAM chips, 8 KB of Page Mode SRAM • Biggest: 64 MB: 32 16Mb chips, 16 KB of Page Mode SRAM DRAM Chip 15 512 cols 256K x 8 = 2 MB
512 rows
DRAM Chip 0 256K x 8 = 2 MB
512 x 8 SRAM 8 bits
bits
512 x 8 SRAM bits
ECE4680 Cache.48
Memory Bus
2002-4-17
DRAM Performance °A 60 ns (tRAC) DRAM can • perform a row access only every 110 ns (tRC) • perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC). -
In practice, external address delays and turning around buses make it 40 to 50 ns
°These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead. • Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins… • 180 ns to 250 ns latency from processor to memory is good for a “60 ns” (tRAC) DRAM
ECE4680 Cache.49
2002-4-17
Summary: °The Principle of Locality: Temporal Locality vs Spatial Locality °Four Questions For Any Cache • Where to place in the cache • How to locate a block in the cache • Replacement • Write policy: Write through vs Write back - Write miss: °Three Major Categories of Cache Misses: • Compulsory Misses: sad facts of life. Example: cold start misses. • Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! • Capacity Misses: increase cache size