Lecture 8: Memory Hierarchy and Cache Cache: A safe place for hiding and storing things. Webster’s New World Dictionary (1976)
First hour of today’s class will be in the CS Seminar. Dr. Kirk Sayre will present a seminar on Wednesday, March 8, 2006, at 1:30 p.m. in Claxton Complex 206. 1
Cache and Its Importance in Performance
Motivation: ¾ Time to run code = clock cycles running code + clock cycles waiting for memory ¾ For many years, CPU’s have sped up an average of 50% per year over memory chip speed ups.
Hence, memory access is the bottleneck to computing fast. Definition of a cache: ¾ Dictionary: a safe place to hide or store things. ¾ Computer: a level in a memory hierarchy. 2
1
Standard Uniprocessor Memory Hierarchy Processor On Chip Level-1 Cache
Level-2 Cache
Bus
System Memory
Intel Pentium 4 2 GHz processor P7 Prescott 478
¾ 8 Kbytes of 4 way assoc. L1 instruction cache with 32 byte lines. ¾ 8 Kbytes of 4 way assoc. L1 data cache with 32 byte lines. ¾ 256 Kbytes of 8 way assoc. L2 cache 32 byte lines. ¾ 400 MB/s bus speed ¾ SSE2 provide peak of 4 Gflop/s
3
4
2
What is a cache?
Small, fast storage used to improve average access time to slow memory. Exploits spacial and temporal locality In computer architecture, almost everything is a cache! ¾ Registers “a cache” on variables – software managed ¾ First-level cache a cache on second-level cache ¾ Second-level cache a cache on memory ¾ Memory a cache on disk (virtual memory) ¾ TLB a cache on page table ¾ Branch-prediction a cache on prediction information? Proc/Regs L1-Cache
Bigger
Faster
L2-Cache Memory Disk, Tape, etc.
5
DRAM 9%/yr. (2X/10 yrs)
Latency in a Single System Memory Access Time Time (ns)
100
400 300 200
10
100
CPU Time
1
Memory to CPU Ratio
500
Ratio
1000
0 0.1 1997
1999
2001
2003
2006
X-Axis CPU Clock Period (ns) Memory System Access Time
Ratio
2009
µProc 60%/yr. (2X/1.5yr)
Processor-Memory Performance Gap: (grows 50% / year)
6
3
Commodity Processor Trends Bandwidth/Latency is the Critical Issue, not FLOPS Got Bandwidth?
Annual increase
Typical value in 2005
Typical value in 2010
Typical value in 2020
Single-chip floating-point performance
59%
4 GFLOP/s
32 GFLOP/s
3300 GFLOP/s
Front-side bus bandwidth
23%
1 GWord/s = 0.25 word/flop
3.5 GWord/s = 0.11 word/flop
27 GWord/s = 0.008 word/flop
DRAM latency
(5.5%)
70 ns = 280 FP ops = 70 loads
50 ns = 1600 FP ops = 170 loads
28 ns = 94,000 FP ops = 780 loads
Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.
7
Traditional Four Questions for Memory Hierarchy Designers Q1:
Where can a block be placed in the upper level?
(Block placement)
¾ Fully Associative, Set Associative, Direct Mapped
Q2:
How is a block found if it is in the upper level?
(Block identification) ¾ Tag/Block
Q3:
Which block should be replaced on a miss?
(Block replacement) ¾ Random, LRU
Q4:
What happens on a write?
(Write strategy)
¾ Write Back or Write Through (with Write Buffer) 8
4
Cache-Related Terms
ICACHE : Instruction cache DCACHE (L1) : Data cache closest to registers SCACHE (L2) : Secondary data cache TCACHE (L3) : Third level data cache ¾ Data from SCACHE has to go through DCACHE to registers ¾ TCACHE is larger than SCACHE, and SCACHE is larger than DCACHE ¾ Not all processors have TCACHE
9
Unified versus Split Caches This
refers to having a single or separate caches for data and machine instructions.
Split
is obviously superior. It reduces thrashing, which we will come to shortly..
10
5
Unified vs Split Caches Unified
vs Separate I&D Proc I-Cache-1
Unified Cache-1
Proc
D-Cache-1
Unified Cache-2
Unified Cache-2
Example:
¾ 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% ¾ 32KB unified: Aggregate miss rate=1.99% Which
is better (ignore L2 cache)?
¾ Assume 33% data ops ⇒ 75% accesses from instructions (1.0/1.33) ¾ hit time=1, miss time=50 ¾ Note that data hit has 1 stall for unified cache (only one port) 11
Simplest Cache: Direct Mapped Memory Address
Memory
0
4 Byte Direct Mapped Cache
1
Cache Index
2
0
3
1
4
2
5 6 7
3
8
¾ Memory location 0, 4, 8, ... etc. ¾ In general: any memory location whose 2 LSBs of the address are 0s ¾ Address => cache index
9 A B C D
E F
Location 0 can be occupied by data from:
Which one should we place in the cache? How can we tell which one is in the cache? 12
6
Cache Mapping Strategies
There are two common sets of methods in use for determining which cache lines are used to hold copies of memory lines. Direct: Cache address = memory address MODULO cache size.
Set associative: There are N cache banks and memory is assigned to just one of the banks. There are three algorithmic choices for which line to replace: ¾ Random: Choose any line using an analog random number generator. This is cheap and simple to make. ¾ LRU (least recently used): Preserves temporal locality, but is expensive. This is not much better than random according to (biased) studies. ¾ FIFO (first in, first out): Random is far superior. 13
Cache Basics Cache
hit: a memory access that is found in the cache -- cheap Cache miss: a memory access that is not in the cache - expensive, because we need to get the data from elsewhere Consider a tiny cache (for illustration only) X|00|0
Address
X001
X010
X011
X100
X101
X110
X111
tag
line
offset
Cache line length: number of bytes loaded together in one entry Direct mapped: only one address (line) in a given range in cache Associative: 2 or more lines with different addresses exist
14
7
Direct-Mapped Cache
Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache.
cache
main memory
15
Fully Associative Cache
Fully Associative Cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be associated with any entry in the cache.
cache
Main memory
16
8
Set Associative Cache
Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way setassociative cache a block from main memory can go into N (N > 1) locations in the cache.
2-way set-associative cache
Main memory
17
Here assume cache has 8 blocks, while memory has 32 Fully associative 12 can go anywhere
Block no
01234567
Direct mapped 12 can go only into block 4 (12 mod 8)
01234567
Set associative 12 can go anywhere in Set 0 (12 mod 4)
01234567
11111111112 22222222333 01234567890123456789 012345678901
18
9
Here assume cache has 8 blocks, while memory has 32 Fully associative 12 can go anywhere
Block no
01234567
Direct mapped 12 can go only into block 4 (12 mod 8)
Set associative 12 can go anywhere in Set 0 (12 mod 4)
01234567
01234567
11111111112 22222222333 01234567890123456789 012345678901
19
Diagrams Serial:
CPU
Registers
Parallel:
Logic
Cache
Main Memory
Shared Memory ... Network
Cache 1
Cache 2
...
Cache p
CPU 1
CPU 2
...
CPU p 20
10
Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining.
21
Registers Registers
are the source and destination of most CPU data operations.
They
hold one element each.
They
are made of static RAM (SRAM), which is very expensive.
The
access time is usually 1-1.5 CPU clock cycles.
Registers
are at the top of the memory subsystem. 22
11
The Principle of Locality The
Principle of Locality:
¾Program access a relatively small portion of the address space at any instant of time. Two
Different Types of Locality:
¾Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) ¾Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last
15 years, HW relied on localilty for speed
23
Principals of Locality Temporal:
soon.
an item referenced now will be again
Spatial:
an item referenced now causes neighbors to be referenced soon.
Lines,
not words, are moved between memory levels. Both principals are satisfied. There is an optimal line size based on the properties of the data bus and the memory subsystem designs.
Cache
lines are typically 32-128 bytes with 1024 being the longest currently. 24
12
Cache Thrashing Thrashing
occurs when frequently used cache lines replace each other. There are three primary causes for thrashing: ¾ Instructions and data can conflict, particularly in unified caches. ¾ Too many variables or too large of arrays are accessed that do not fit into cache. ¾ Indirect addressing, e.g., sparse matrices.
Machine
architects can add sets to the associativity. Users can buy another vendor’s machine. However, neither solution is realistic. 25
Cache Coherence for Multiprocessors All
data must be coherent between memory levels. Multiple processors with separate caches must inform the other processors quickly about data modifications (by the cache line). Only hardware is fast enough to do this.
Standard
protocols on multiprocessors:
¾Snoopy: all processors monitor the memory bus. ¾Directory based: Cache lines maintain an extra 2 bits per processor to maintain clean/dirty 26 status bits.
13
Indirect Addressing d=0
x
do i = 1,n j = ind(i)
y
d = d + sqrt( x(j)*x(j) + y(j)*y(j) + z(j)*z(j) ) end do
z
Change loop statement to d = d + sqrt( r(1,j)*r(1,j) + r(2,j)*r(2,j) + r(3,j)*r(3,j) )
r
Note that r(1,j)-r(3,j) are in contiguous memory and probably are in the same cache line (d is probably in a register and is irrelevant). The original form uses 3 cache lines at every instance of the loop and can cause cache thrashing. 27
Cache Thrashing by Memory Allocation parameter ( m = 1024*1024 ) real a(m), b(m)
For a 4 Mb direct mapped cache, a(i) and b(i) are always mapped to the same cache line. This is trivially avoided using padding.
real a(m), extra(32), b(m)
extra is at least 128 bytes in length, which is longer than a cache line on all but one memory subsystem that is available today. 28
14
Cache Blocking
We want blocks to fit into cache. On parallel computers we have p x cache so that data may fit into cache on p processors, but not one. This leads to superlinear speed up! Consider matrix-matrix multiply.
do k = 1,n do j = 1,n do i = 1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) end do end do end do
An alternate form is ...
29
Cache Blocking do kk = 1,n,nblk
N
do jj = 1,n,nblk
M NB
C
do ii = 1,n,nblk
K M
A
N
*
B
K
do k = kk,kk+nblk-1 do j = jj,jj+nblk-1 do i = ii,ii+nblk-1 c(i,j) = c(i,j) + a(i,k) * b(k,j) end do . . . end do 30
15
Summary : The Cache Design Space Several
interacting dimensions
Cache Size
¾ cache size ¾ block size ¾ associativity ¾ replacement policy ¾ write-through vs write-back ¾ write allocation The
Associativity
Block Size
optimal choice is a compromise
¾ depends on access characteristics Bad » workload » use (I-cache, D-cache, TLB) Good ¾ depends on technology / cost Simplicity
often wins
Factor A
Less
Factor B
More 31
Lessons The
actual performance of a simple program can be a complicated function of the architecture Slight changes in the architecture or program change the performance significantly Since we want to write fast programs, we must take the architecture into account, even on uniprocessors Since the actual performance is so complicated, we need simple models to help us design efficient algorithms We will illustrate with a common technique for improving cache performance, called blocking
32
16
Assignment 6
33
Optimizing Matrix Addition for Caches Dimension
A(n,n), B(n,n), C(n,n) A, B, C stored by column (as in Fortran) Algorithm 1: ¾ for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j) Algorithm
2:
¾ for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j) What
is “memory access pattern” for Algs 1 and 2? Which is faster? What if A, B, C stored by row (as in C)? 34
17
Loop Fusion Example /* Before */ for (i = 0; i for (j = 0; a[i][j] for (i = 0; i for (j = 0; d[i][j]
< j = < j =
N; i = i+1) < N; j = j+1) 1/b[i][j] * c[i][j]; N; i = i+1) < N; j = j+1) a[i][j] + c[i][j];
/* After */ for (i = 0; i for (j = 0; { a[i][j] d[i][j]
< j = =
N; i = i+1) < N; j = j+1) 1/b[i][j] * c[i][j]; a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; improve spatial locality
35
Optimizing Matrix Multiply for Caches Several
techniques for making this faster on modern processors ¾heavily studied
Some
optimizations done automatically by compiler, but can do much better In general, you should use optimized libraries (often supplied by vendor) for this and other very common linear algebra operations ¾BLAS = Basic Linear Algebra Subroutines
Other
algorithms you may want are not going to be supplied by vendor, so need to know these techniques 36
18
Warm up: Matrix-vector multiplication y = y + A*x for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)*x(j)
A(i,:)
+
= y(i)
y(i)
* x(:)
37
Warm up: Matrix-vector multiplication y = y + A*x {read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n {read row i of A into fast memory} for j = 1:n y(i) = y(i) + A(i,j)*x(j) {write y(1:n) back to slow memory} ° m = number of slow memory refs = 3*n + n^2 ° f = number of arithmetic operations = 2*n^2 ° q = f/m ~= 2 ° Matrix-vector multiplication limited by slow memory speed 38
19
Multiply C=C+A*B for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)
C(i,j)
A(i,:)
C(i,j)
=
+
*
B(:,j)
39
Matrix Multiply C=C+A*B(unblocked, or untiled) for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory}
C(i,j)
A(i,:)
C(i,j)
=
+
*
B(:,j) 40
20
Matrix Multiply (unblocked, or untiled)
q=ops/slow mem ref
Number of slow memory references on unblocked matrix multiply m = n^3 read each column of B n times + n^2 read each column of A once for each i + 2*n^2 read and write each element of C once = n^3 + 3*n^2 So q = f/m = (2*n^3)/(n^3 + 3*n^2) ~= 2 for large n, no improvement over matrix-vector mult
C(i,j)
A(i,:)
C(i,j)
=
+
*
B(:,j) 41
Matrix Multiply (blocked, or tiled) Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory} C(i,j)
A(i,k)
C(i,j)
=
+
*
B(k,j) 42
21
Matrix Multiply (blocked or tiled) Why is this algorithm correct?
q=ops/slow mem ref n size of matrix b blocksize N number of blocks
Number of slow memory references on blocked matrix multiply m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N) + N*n^2 read each block of A N^3 times + 2*n^2 read and write each block of C once = (2*N + 2)*n^2 So q = f/m = 2*n^3 / ((2*N + 2)*n^2) ~= n/N = b for large n So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2)
Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b^2 “World’s largest MD simulation - 10 gazillion particles!” - run grid sizes for only a few cycles because the full run won’t finish during this lifetime or because the resolution makes no sense compared with resolution of input data • Suggested alternate approach (Gustafson): Constant time benchmarks - run code for a fixed time and measure work done
66
33
Example of a Scaled Speedup Experiment
Processors NChains 1 2 4 8 16 32 64 128 256 512
Time
32 64 128 256 512 940 1700 2800 4100 5300
38.4 38.4 38.5 38.6 38.7 35.7 32.7 27.4 20.75 14.49
Natoms
2368 4736 9472 18944 37888 69560 125800 207200 303400 392200
Time per Atom per PE 1.62E-02 8.11E-03 4.06E-03 2.04E-03 1.02E-03 5.13E-04 2.60E-04 1.32E-04 6.84E-05 3.69E-05
Time Efficiency per Atom 1.62E-02 1.000 1.62E-02 1.000 1.63E-02 0.997 1.63E-02 0.995 1.63E-02 0.992 1.64E-02 0.987 1.66E-02 0.975 1.69E-02 0.958 1.75E-02 0.926 1.89E-02 0.857
TBON on ASCI Red Efficiency 1.040 0.940 0.840 0.740 0.640 0.540 0.440 0
200
400
600
67
68
34