Cache Memory and Performance
Memory Hierarchy 1
Many of the following slides are taken with permission from Complete Powerpoint Lecture Notes for Computer Systems: A Programmer's Perspective (CS:APP) Randal E. Bryant and David R. O'Hallaron http://csapp.cs.cmu.edu/public/lectures.html
The book is used explicitly in CS 2505 and CS 3214 and as a reference in CS 2506.
CS@VT
Computer Organization II
©2005-2015 CS:APP & McQuain
Memory Hierarchy 2
An Example Memory Hierarchy L0: Registers L1: Smaller, faster, costlier per byte
L2:
CPU registers hold words retrieved from L1 cache
L1 cache (SRAM)
L1 cache holds cache lines retrieved from L2 cache
L2 cache (SRAM)
L2 cache holds cache lines retrieved from main memory
L3:
Larger, slower, cheaper per byte
L5:
CS@VT
Main memory (DRAM) L4:
Local secondary storage (local disks)
Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers
Remote secondary storage (tapes, distributed file systems, Web servers)
Computer Organization II
©2005-2015 CS:APP & McQuain
Random-Access Memory (RAM)
Memory Hierarchy 3
Key features – – –
RAM is traditionally packaged as a chip. Basic storage unit is normally a cell (one bit per cell). Multiple RAM chips form a memory.
Static RAM (SRAM) – – – –
Each cell stores a bit with a four or six-transistor circuit. Retains value indefinitely, as long as it is kept powered. Relatively insensitive to electrical noise (EMI), radiation, etc. Faster and more expensive than DRAM.
Dynamic RAM (DRAM) –
– – –
CS@VT
Each cell stores bit with a capacitor. One transistor is used for access Value must be refreshed every 10-100 ms. More sensitive to disturbances (EMI, radiation,…) than SRAM. Slower and cheaper than SRAM.
Computer Organization II
©2005-2015 CS:APP & McQuain
SRAM vs DRAM Summary
Trans. per bit
Memory Hierarchy 4
Access Needs Needs time refresh? EDC?
Cost
Applications
SRAM 4 or 6
1X
No
Maybe 100x
Cache memories
DRAM 1
10X
Yes
Yes
Main memories, frame buffers
CS@VT
1X
Computer Organization II
©2005-2015 CS:APP & McQuain
Traditional CPU-Memory Bus Structure
Memory Hierarchy 5
A bus is a collection of parallel wires that carry address, data, and control signals. Buses are typically shared by multiple devices.
CPU chip Register file
ALU System bus
Bus interface
CS@VT
I/O bridge
Computer Organization II
Memory bus
Main memory
©2005-2015 CS:APP & McQuain
Memory Read Transaction (1)
Memory Hierarchy 6
CPU places address A on the memory bus.
Register file %eax
Load operation: movl A, %eax ALU
Main memory I/O bridge Bus interface
CS@VT
0
A x
Computer Organization II
A
©2005-2015 CS:APP & McQuain
Memory Read Transaction (2)
Memory Hierarchy 7
Main memory reads A from the memory bus, retrieves word x, and places it on the bus.
Register file %eax
Load operation: movl A, %eax ALU
I/O bridge Bus interface
CS@VT
x
Main memory 0 x
Computer Organization II
A
©2005-2015 CS:APP & McQuain
Memory Read Transaction (3)
Memory Hierarchy 8
CPU read word x from the bus and copies it into register %eax.
Register file %eax
x
Load operation: movl A, %eax ALU
I/O bridge Bus interface
CS@VT
Main memory 0 x
Computer Organization II
A
©2005-2015 CS:APP & McQuain
Memory Write Transaction (1)
Memory Hierarchy 9
CPU places address A on bus. Main memory reads it and waits for the corresponding data word to arrive.
Register file %eax
y
Store operation: movl %eax, A ALU
I/O bridge Bus interface
CS@VT
A
Main memory 0 A
Computer Organization II
©2005-2015 CS:APP & McQuain
Memory Write Transaction (2)
Memory Hierarchy 10
CPU places data word y on the bus.
Register file %eax
y
Store operation: movl %eax, A ALU
I/O bridge Bus interface
CS@VT
y
Main memory 0 A
Computer Organization II
©2005-2015 CS:APP & McQuain
Memory Write Transaction (3)
Memory Hierarchy 11
Main memory reads data word y from the bus and stores it at address A.
register file %eax
y
Store operation: movl %eax, A ALU
I/O bridge bus interface
CS@VT
main memory 0 y
Computer Organization II
A
©2005-2015 CS:APP & McQuain
The Bigger Picture: I/O Bus
Memory Hierarchy 12
CPU chip Register file ALU System bus
Memory bus
Main memory
I/O bridge
Bus interface
I/O bus USB controller
Graphics adapter
Mouse Keyboard
Monitor
Disk controller
Expansion slots for other devices such as network adapters.
Disk CS@VT
Computer Organization II
©2005-2015 CS:APP & McQuain
Storage Trends
Memory Hierarchy 13
SRAM
Metric
1980
1985
1990
1995
2000
2005
2010
2010:1980
$/MB access (ns)
19,200 300
2,900 150
320 35
256 15
100 3
75 2
60 1.5
320 200
1980
1985
1990
1995
2000
2005
2010
2010:1980
$/MB 8,000 access (ns) 375 typical size (MB) 0.064
880 200 0.256
100 100 4
30 70 16
1 60 64
0.1 50 2,000
0.06 40 8,000
130,000 9 125,000
2010
2010:1980
DRAM
Metric
Disk
Metric
1980
1985
1990
1995
2000
2005
$/MB access (ms) typical size (MB)
500 87 1
100 75 10
8 28 160
0.30 10 1,000
0.01 8 20,000
0.005 0.0003 1,600,000 4 3 29 160,000 1,500,000 1,500,000
CS@VT
Computer Organization II
©2005-2015 CS:APP & McQuain
The CPU-Memory Gap The gap
Memory Hierarchy 14
between DRAM, disk, and CPU speeds.
100,000,000.0
Disk
10,000,000.0 1,000,000.0
SSD
100,000.0
Disk seek time Flash SSD access time DRAM access time SRAM access time CPU cycle time Effective CPU cycle time
ns
10,000.0 1,000.0
DRAM
100.0 10.0 1.0
CPU
0.1 0.0 1980
1985
1990
1995
2000
2003
2005
2010
Year CS@VT
Computer Organization II
©2005-2015 CS:APP & McQuain
Memory Hierarchy 15
Locality
Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently Temporal locality: –
Recently referenced items are likely to be referenced again in the near future
Spatial locality: – Items with nearby addresses tend to be referenced close together in time
CS@VT
Computer Organization II
©2005-2015 CS:APP & McQuain
Locality Example
Memory Hierarchy 16 sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;
Data references – –
Reference array elements in succession (stride-1 reference pattern). Reference variable sum each iteration.
Spatial locality Temporal locality
Instruction references – –
CS@VT
Reference instructions in sequence. Cycle through loop repeatedly.
Computer Organization II
Spatial locality Temporal locality
©2005-2015 CS:APP & McQuain
Taking Advantage of Locality
Memory Hierarchy 17
Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory –
Main memory
Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory –
CS@VT
Cache memory attached to CPU
Computer Organization II
©2005-2015 CS:APP & McQuain
Memory Hierarchy 18
An Example Memory Hierarchy L0: Registers L1: Smaller, faster, costlier per byte
L2:
CPU registers hold words retrieved from L1 cache
L1 cache (SRAM)
L1 cache holds cache lines retrieved from L2 cache
L2 cache (SRAM)
L2 cache holds cache lines retrieved from main memory
L3:
Larger, slower, cheaper per byte
L5:
CS@VT
Main memory (DRAM) L4:
Local secondary storage (local disks)
Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers
Remote secondary storage (tapes, distributed file systems, Web servers)
Computer Organization II
©2005-2015 CS:APP & McQuain
Caches Cache:
Memory Hierarchy 19 a smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device.
Fundamental idea of a memory hierarchy: –
For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.
Why do memory hierarchies work? – –
Because of locality, programs tend to access the data at level k more often than they access the data at level k+1. Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.
Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
CS@VT
Computer Organization II
©2005-2015 CS:APP & McQuain
General Cache Concepts
Cache
8 4
9
Memory Hierarchy 20
14 10
Data is copied in block-sized transfer units
10 4
Memory
CS@VT
3
Smaller, faster, more expensive memory caches a subset of the blocks
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Computer Organization II
Larger, slower, cheaper memory viewed as partitioned into “blocks”
©2005-2015 CS:APP & McQuain
General Cache Concepts: Hit
Memory Hierarchy 21
Request: 14
Cache
8
9
14
3
Memory
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CS@VT
Computer Organization II
Data in block b is needed Block b is in cache: Hit!
©2005-2015 CS:APP & McQuain
General Cache Concepts: Miss
Memory Hierarchy 22
Request: 12
Cache
8
9 12
CS@VT
3
Request: 12
12
Memory
14
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Computer Organization II
Data in block b is needed Block b is not in cache: Miss! Block b is fetched from memory Block b is stored in cache • Placement policy: determines where b goes • Replacement policy: determines which block gets evicted (victim)
©2005-2015 CS:APP & McQuain
Types of Cache Misses
Memory Hierarchy 23
Cold (compulsory) miss –
Cold misses occur because the cache is empty.
Conflict miss –
Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k.
–
E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.
Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block.
E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
Capacity miss –
CS@VT
Occurs when the set of active cache blocks (working set) is larger than the cache.
Computer Organization II
©2005-2015 CS:APP & McQuain
Examples of Caching in the Hierarchy
Memory Hierarchy 24
Cache Type
What is Cached?
Where is it Cached?
Registers
4-8 bytes words
CPU core
0
Compiler
TLB
Address translations
On-Chip TLB
0
Hardware
L1 cache
64-bytes block
On-Chip L1
1
Hardware
L2 cache
64-bytes block
On/Off-Chip L2
10
Hardware
Virtual Memory
4-KB page
Main memory
100
Hardware + OS
Buffer cache
Parts of files
Main memory
100
OS
Disk cache
Disk sectors
Disk controller
100,000
Disk firmware
Network buffer cache
Parts of files
Local disk
10,000,000
AFS/NFS client
Browser cache
Web pages
Local disk
10,000,000
Web browser
Web cache
Web pages
Remote server disks
CS@VT
Computer Organization II
Latency (cycles)
1,000,000,000
Managed By
Web proxy server
©2005-2015 CS:APP & McQuain
Cache Memories
Memory Hierarchy 25
Cache memories are small, fast SRAM-based memories managed automatically in hardware. –
Hold frequently accessed blocks of main memory
CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory. Typical system structure: CPU chip Register file Cache memories
ALU System bus
Bus interface
CS@VT
I/O bridge
Computer Organization II
Memory bus Main memory
©2005-2015 CS:APP & McQuain