Lecture 11: Memory Hierarchy Reducing Hit Time, Main Memory, and Examples Professor David A. Patterson Computer Science 252 Spring 1998

Lecture 11: Memory Hierarchy—Reducing Hit Time, Main Memory, and Examples Professor David A. Patterson Computer Science 252 Spring 1998 DAP Spr.‘98 ©...

Author: Jennifer Grant

7 downloads 0 Views 109KB Size

Report

Download PDF

Recommend Documents

Lecture 15: Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 6: Vector Processing Professor David A. Patterson Computer Science 252 Fall 1996

Lecture 9: Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy. Who Cares about Memory Hierarchy?

Lecture 17-18: Memory Hierarchy

Lecture 12: Memory hierarchy & caches

Lecture 8: Memory Hierarchy and Cache

Lecture 11: MOS Memory

Memory Hierarchy: Caches, Virtual Memory

A typical memory hierarchy

Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring Topic Notes: Memory Hierarchy

Memory Hierarchy and Cache. Memory Hierarchy and Cache

Memory Hierarchy (Main Memory & Virtual Memory & paging) Memory (Programmer s View)

Computer Architecture. Chapter 5: Memory Hierarchy

Lecture 10: Case Study CDC 6600 Scoreboard Professor Randy H. Katz Computer Science 252 Spring 1996

Module 2: Virtual Memory and Caches Lecture 4: Cache Hierarchy and Memory-level Parallelism. The Lecture Contains: Cache Hierarchy

Memory hierarchy. Outline: memory hierarchy basics on-chip RAM and caches memory management operating systems

Memory Hierarchy. Introduction. Goal of Memory Hierarchy. Locality

Chapter 7. Memory Hierarchy

Overview. Memory Hierarchy

OBJECTIVE. 1. Understanding Memory Protection 2. Understanding Memory Coherency ADVANCED COMPUTER ARCHITECTURE LESSON 22: MEMORY HIERARCHY

Introduction. Memory Hierarchy

The Memory Hierarchy

Exploiting Memory Hierarchy

The Memory Hierarchy

Lecture 11: Memory Hierarchy—Reducing Hit Time, Main Memory, and Examples Professor David A. Patterson Computer Science 252 Spring 1998

DAP Spr.‘98 ©UCB 1

Review: Reducing Misses  CPUtime = IC × CPI 

Execution

+

Memory accesses  × Miss rate × Miss penalty × Clock cycle time  Instruction

• 3 Cs: Compulsory, Capacity, Conflict Misses • Reducing Miss Rate 1. Reduce Misses via Larger Block Size 2. Reduce Misses via Higher Associativity 3. Reducing Misses via Victim Cache 4. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Misses by Compiler Optimizations

• Remember danger of concentrating on just one parameter when evaluating performance DAP Spr.‘98 ©UCB 2

Reducing Miss Penalty Summary  CPUtime = IC × CPI 

Execution

+

Memory accesses  × Miss rate × Miss penalty × Clock cycle time  Instruction

• Five techniques – – – – –

Read priority over write on miss Subblock placement Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache

• Can be applied recursively to Multilevel Caches – Danger is that time to DRAM will grow with multiple levels in between – First attempts at L2 caches can make things worse, since increased worst case is worse

• Out-of-order CPU can hide L1 data cache miss (≈3–5 clocks), but stall on L2 miss (≈40–100 clocks)? DAP Spr.‘98 ©UCB 3

Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

DAP Spr.‘98 ©UCB 4

1. Fast Hit times via Small and Simple Caches • Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? – Small data cache and clock rate

• Direct Mapped, on chip

DAP Spr.‘98 ©UCB 5

2. Fast hits by Avoiding Address Translation

• Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache – Every time process is switched logically must flush the cache; otherwise get false hits » Cost is time to flush + “compulsory” misses from empty cache

– Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address – I/O must interact with cache, so need virtual address

• Solution to aliases – HW guarantees that every cache block has unique physical address – SW guarantee: lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring

• Solution to cache flush – Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process DAP Spr.‘98 ©UCB 6

Virtually Addressed Caches CPU

CPU

VA

VA

VA VA Tags

TB

CPU

PA Tags

$

$

TB

VA

PA

PA L2 $

TB

$

PA

PA MEM

MEM

Conventional Organization

Virtually Addressed Cache Translate only on miss Synonym Problem

MEM Overlap $ access with VA translation: requires $ index to remain invariant DAP Spr.‘98 ©UCB 7 across translation

2. Fast Cache Hits by Avoiding Translation: Process ID impact • Black is uniprocess • Light Gray is multiprocess when flush cache • Dark Gray is multiprocess when use Process ID tag • Y axis: Miss Rates up to 20% • X axis: Cache size from 2 KB to 1024 KB

DAP Spr.‘98 ©UCB 8

2. Fast Cache Hits by Avoiding Translation Avoiding Translation: Index with Physical Portion of Address • If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag Page Address

Address Tag

Page Offset

Index

Block Offset

• Limits cache to page size: what if want bigger caches and uses same trick? – Higher associativity moves barrier to right – Page coloring

DAP Spr.‘98 ©UCB 9

3. Fast Hit Times Via Pipelined Writes • Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update • Only STORES in the pipeline; empty during a miss Store r2, (r1) Add Sub Store r4, (r3)

Check r1 --M[r1] computers use any generation DRAM

• Commodity, second source industry => high volume, low profit, conservative – Little organization innovation in 20 years

• Order of importance: 1) Cost/bit 2) Capacity – First RAMBUS: 10X BW, +30% cost => little impact DAP Spr.‘98 ©UCB 21

DRAM Future: 1 Gbit DRAM (ISSCC ‘96; production ‘02?) • • • •

Blocks Clock Data Pins Die Size

Mitsubishi 512 x 2 Mbit 200 MHz 64 24 x 24 mm

Samsung 1024 x 1 Mbit 250 MHz 16 31 x 21 mm

– Sizes will be much smaller in production

• Metal Layers • Technology

3 0.15 micron

4 0.16 micron

• Wish could do this for Microprocessors! DAP Spr.‘98 ©UCB 22

Main Memory Performance • Simple: – CPU, Cache, Bus, Memory same width (32 or 64 bits)

• Wide: – CPU/Mux 1 word; Mux/ Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512)

• Interleaved: – CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved

DAP Spr.‘98 ©UCB 23

Main Memory Performance • Timing model (word size is 32 bits) – 1 to send address, – 6 access time, 1 to send data – Cache Block is 4 words

• Simple M.P. = 4 x (1+6+1) = 32 • Wide M.P. =1+6+1 =8 • Interleaved M.P. = 1 + 6 + 4x1 = 11

DAP Spr.‘98 ©UCB 24

Independent Memory Banks • Memory banks for independent accesses vs. faster sequential accesses – Multiprocessor – I/O – CPU with Hit under n Misses, Non-blocking Cache

• Superbank: all memory active on one block transfer (or Bank) • Bank: portion within a superbank that is word interleaved (or Subbank)

…

Superbank

Bank

DAP Spr.‘98 ©UCB 25

Independent Memory Banks • How many banks? number banks ≥ number clocks to access word in bank

– For sequential accesses, otherwise will return to original bank before it has next word ready – (like in vector case)

• Increasing DRAM => fewer chips => harder to have banks

DAP Spr.‘98 ©UCB 26

Minimum Memory Size

DRAMs per PC over Time

4 MB 8 MB 16 MB 32 MB 64 MB

‘86 1 Mb 32

DRAM Generation ‘89 ‘92 ‘96 ‘99 ‘02 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 8 16

4 8

2 4

1

8

2

128 MB

4

1

256 MB

8

2

DAP Spr.‘98 ©UCB 27

Avoiding Bank Conflicts • Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; • Even with 128 banks, since 512 is multiple of 128, conflict on word accesses • SW: loop interchange or declaring array not power of 2 (“array padding”) • HW: Prime number of banks – – – – –

bank number = address mod number of banks address within bank = address / number of words in bank modulo & divide per memory access with prime no. banks? address within bank = address mod number words in bank DAP Spr.‘98 ©UCB 28 bank number? easy if 2N words per bank

Fast Bank Number • Chinese Remainder Theorem As long as two sets of integers ai and bi follow these rules bi = x modai ,0 ≤ bi < ai , 0 ≤ x < a 0 × a1 × a 2 ×… and that ai and aj are co-prime if i ≠ j, then the integer x has only one solution (unambiguous mapping):

– bank number = b0, number of banks = a0 (= 3 in example) – address within bank = b1, number of words in bank = a1 (= 8 in example) – N word address 0 to N-1, prime no. banks, words power of 2 Bank Number: Address within Bank: 0 1 2 3 4 5 6 7

Seq. Interleaved 0 1 2 0 3 6 9 12 15 18 21

1 4 7 10 13 16 19 22

2 5 8 11 14 17 20 23

Modulo Interleaved 0 1 2 0 9 18 3 12 21 6 15

16 1 10 19 4 13 22 7

8 17 2 11 20 5 14 DAP 23Spr.‘98 ©UCB 29

Fast Memory Systems: DRAM specific • Multiple CAS accesses: several names (page mode) – Extended Data Out (EDO): 30% faster in page mode

• New DRAMs to address gap; what will they cost, will they survive? – RAMBUS: startup company; reinvent DRAM interface » » » » »

Each Chip a module vs. slice of memory Short bus between CPU and chips Does own refresh Variable amount of data returned 1 byte / 2 ns (500 MB/s per chip)

– Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz) – Intel claims RAMBUS Direct (16 b wide) is future PC memory

• Niche memory or main memory?

DAP Spr.‘98 ©UCB 30

– e.g., Video RAM for frame buffers, DRAM + fast serial output

DRAM Latency >> BW • More App Bandwidth => Cache misses => DRAM RAS/CAS • Application BW => Lower DRAM Latency • RAMBUS, Synch DRAM increase BW but higher latency • EDO DRAM < 5% in PC

Proc I$ D$ L2$ Bus D R A M

D R A M

D R A M

D R A M DAP Spr.‘98 ©UCB 31

Potential DRAM Crossroads? • After 20 years of 4X every 3 years, running into wall? (64Mb - 1 Gb) • How can keep $1B fab lines full if buy fewer DRAMs per computer? • Cost/bit –30%/yr if stop 4X/3 yr? • What will happen to $40B/yr DRAM industry?

DAP Spr.‘98 ©UCB 32

Main Memory Summary • Wider Memory • Interleaved Memory: for sequential or independent accesses • Avoiding bank conflicts: SW & HW • DRAM specific optimizations: page mode & Specialty DRAM • DRAM future less rosy?

DAP Spr.‘98 ©UCB 33

Cache Cross Cutting Issues • Superscalar CPU & Number Cache Ports must match: number memory accesses/cycle? • Speculative Execution and non-faulting option on memory/TLB • Parallel Execution vs. Cache locality – Want far separation to find independent operations vs. want reuse of data accesses to avoid misses

• I/O and consistency of data between cache and memory – Caches => multiple copies of data – Consistency by HW or by SW? – Where connect I/O to computer? DAP Spr.‘98 ©UCB 34

Alpha 21064 • Separate Instr & Data TLB & Caches • TLBs fully associative • TLB updates in SW (“Priv Arch Libr”) • Caches 8KB direct mapped, write thru • Critical 8 bytes first • Prefetch instr. stream buffer • 2 MB L2 cache, direct mapped, WB (off-chip) • 256 bit path to main memory, 4 x 64-bit modules • Victim Buffer: to give read priority over write • 4 entry write buffer between D$ & L2$

Instr

Data

Write Buffer Stream Buffer

DAP Spr.‘98 ©UCB 35

Victim Buffer

Miss Rate

Su2cor

Nasa7

Spice

Hydro2d

Mdljp2

Wave5

Tomcatv

Alvinn

Doduc

Swm256

Ear

Fpppp

Ora

Mdljsp2

Compress

Gcc

Sc

Li

Espresso

TPC-B (db1)

10.00%

TPC-B (db2)

I$ miss = 6% D$ miss = 32% L2 miss = 10%

AlphaSort

100.00%

Eqntott

Alpha Memory Performance: Miss Rates of SPEC92

8K I $ D $8K L2

1.00%

2M

0.10%

0.01%

I$ miss = 2% D$ miss = 13% L2 miss = 0.6%

I$ miss = 1% D$ miss = 21% L2 miss = 0.3% DAP Spr.‘98 ©UCB 36

Alpha CPI Components • Instruction stall: branch mispredict (green); • Data cache (blue); Instruction cache (yellow); L2$ (pink) Other: compute + reg conflicts, structural conflicts 4.50 4.00 3.50 L2 I$ D$ I Stall Other

2.50 2.00 1.50 1.00

Hydro2d

Mdljp2

Wave5

Tomcatv

Alvinn

Doduc

Swm256

Ear

Fpppp

Ora

Mdljsp2

Compress

Gcc

Sc

Eqntott

Li

TPC-B (db1)

TPC-B (db2)

0.00

Espresso

0.50 AlphaSort

CPI

3.00

DAP Spr.‘98 ©UCB 37

Pitfall: Predicting Cache Performance from Different Prog. (ISA, compiler, ...) 35%

D$, Tom 30% D: tomcatv D: gcc D: espresso I: gcc I: espresso I: tomcatv

25%

D$, gcc

20% 15%

D$, esp

10%

I$, gcc 5%

64

32

16

8

Cache Size (KB)

128

I$, Tom

4

I$, esp 2

0%

1

• 4KB Data cache miss rate 8%,12%, or 28%? • 1KB Instr cache miss rate 0%,3%,or 10%? Miss • Alpha vs. MIPS Rate for 8KB Data $: 17% vs. 10% • Why 2X Alpha v. MIPS?

DAP Spr.‘98 ©UCB 38

Pitfall: Simulating Too Small an Address Trace 4.5 4

Cummlati 3.5 ve 3 Average Memory 2.5 Access 2 Time 1.5 1 I$ = 4 KB, B=16B 0 1 2 3 4 5 6 7 8 9 10 11 12 D$ = 4 KB, B=16B L2 = 512 KB, B=128B Instructions Executed (billions) MP = 12, 200 DAP Spr.‘98 ©UCB 39

Main Memory Summary • Wider Memory • Interleaved Memory: for sequential or independent accesses • Avoiding bank conflicts: SW & HW • DRAM specific optimizations: page mode & Specialty DRAM • DRAM future less rosy?

DAP Spr.‘98 ©UCB 40

hit time

miss penalty

miss rate

Cache Optimization Summary Technique Larger Block Size Higher Associativity Victim Caches Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses Priority to Read Misses Subblock Placement Early Restart & Critical Word 1st Non-Blocking Caches Second Level Caches Small & Simple Caches Avoiding Address Translation Pipelining Writes

MR + + + + + + +

MP HT – –

+ + + + + –

+

+ + +

Complexity 0 1 2 2 2 3 0 1 1 2 3 2 0 2 1 DAP Spr.‘98 ©UCB 41

Practical Memory Hierarchy • Issue is NOT inventing new mechanisms • Issue is taste in selecting between many alternatives in putting together a memory hierarchy that fit well together – e.g., L1 Data cache write through, L2 Write back – e.g., L1 small for fast hit time/clock cycle, – e.g., L2 big enough to avoid going to DRAM?

DAP Spr.‘98 ©UCB 42