ECE 552: Introduction To Computer Architecture 1

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Memory Hierarchy Temporal Locality • Keep recently referenced items at higher levels • Futu...

Author: Diana Conley

2 downloads 0 Views 59KB Size

Report

Download PDF

Recommend Documents

ECE 552: Introduction To Computer Architecture 1

ECE 361. Computer Architecture Lecture 1. Prof. Alok N. Choudhary

Computer Architecture Introduction Instructions

Introduction to Computer System and Architecture SEEM

ECE 4750 Computer Architecture, Fall 2016 T04 Fundamental Memory Microarchitecture

Introduction To Computer Graphics 1

Computer Architecture ECE 361 Lecture 7: ALU Design : Division

ECE 15B COMPUTER ORGANIZATION

Chapter 1. Introduction to Databases, Environment & Architecture

Welcome to CS2410! Computer architecture? Computer architecture? CS2410: Computer Architecture. Technology, software, performance, and cost issues

Outline. Introduction To Computers and. Programming. Computer system. What is a computer? Basic Computer Architecture

INTRODUCTION TO ARCHITECTURE

Introduction to System Architecture

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Chapter No. 1: Introduction to Computer

CS170 Introduction to Computer Science Midterm 1

Introduction to Computer Graphics 1. Graphics Systems

CS 552 Computer Security. Richard Martin

CSE502: Computer Architecture CSE 502: Computer Architecture

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

COMPUTER ARCHITECTURE COMPUTER HISTORY

INTRODUCTION TO COMPUTER

Introduction to Computer Security

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti

Memory Hierarchy Temporal Locality • Keep recently referenced items at higher levels • Future references satisfied quickly

Fall ll 2010 University of Wisconsin-Madison

CPU

I & D L1 Cache

Spatial Locality • Bring neighbors of recently referenced to higher levels • Future references satisfied quickly

Shared L2 Cache

Main Memory

Lecture notes based on notes by Mark Hill Updated by Mikko Lipasti

Disk 2 © Hill, Lipasti

Caches and Performance 

Performance Impact

Caches



– Enable design for common case: cache hit  

– Included in “pipeline” portion of CPI

Cycle time, pipeline organization Recovery policy









Intel/HP McKinley: 1 cycle – Heroic array design – No address generation: load r1, (r2)

Fetch from next level – Apply recursively if multiple levels



E.g. IBM study: 1.15 CPI with 100% cache hits

– Typically T i ll 1-3 1 3 cycles l for f L1 cache h

– Uncommon case: cache miss 

Cache hit latency



What to do in the meantime?

IBM Power4: 3 cycles – Address generation – Array access – Word select and align

What is performance impact? Various optimizations are possible 3

4

© Hill, Lipasti

© Hill, Lipasti

Cache Hit continued

Cache Hits and Performance 



Cycle stealing common

AGEN

– Address generation < cycle AGEN – Array access > cycle – Clean, Clean FSD cycle boundaries violated 

CACHE

Cache hit latency determined by: – Cache organization 



Block size



Number of block (sets x associativity)

Speculation rampant – – – –

– Word select may be slow (fan-in, wires)

“Predict” cache hit Don’t wait for tag check Consume fetched word in pipeline Recover/flush when miss is detected 

Associativity – Parallel tag checks expensive, slow – Way select slow (fan-in, wires)

CACHE

– – – –



Reportedly 7 (!) cycles later in Pentium-IV

Word Line

Wire delay across array “Manhattan distance” = width + height Word line delay: width Bit line delay: height

Bit Line

Array design is an art form – Detailed analog circuit/wire delay modeling

5 © Hill, Lipasti

ECE 552: Introduction To Computer Architecture

6 © Hill, Lipasti

1

Cache Misses and Performance 

Cache Miss Rate

Miss penalty



– Detect miss: 1 or more cycles – Find victim (replace line): 1 or more cycles 

– Program characteristics 

Write back if dirty



– Request line from next level: several cycles – Transfer line from next level: several cycles 

Determined by:

– Cache organization

(block size) / (bus width)



– Fill line into data array, update tag array: 1+ cycles – Resume execution 

Temporal locality Spatial locality Block size, associativity, number of sets

In practice: 6 cycles to 100s of cycles 7

8

© Hill, Lipasti

© Hill, Lipasti

Improving Locality

Improving Locality





Instruction text placement





Maximize temporal locality

– Structures: pack commonly commonly-accessed accessed fields together

Maximize temporal locality



Maximize spatial, temporal locality

– Trees, linked lists: allocate in usual reference order

– Eliminate taken branches 

Data placement, access order – Arrays: “block” loops to access subarray that fits into cache

– Profile program, place unreferenced or rarely referenced paths “elsewhere”



Fall-through path has spatial locality





Heap manager usually allocates sequential addresses Maximize spatial locality

Hard problem, not easy to automate: – C/C++ disallows rearranging structure fields – OK in Java

9

10

© Hill, Lipasti

© Hill, Lipasti

Cache Miss Rates: 3 C’s [Hill]

Cache Miss Rate Effects





Associativity – Higher associativity reduces conflicts – Very little benefit beyond 8-way 8 way set-associative set associative

Capacity – Working set exceeds cache capacity – Useful blocks (with future references) displaced



Number of blocks (sets x associativity) – Bigger is better: fewer conflicts, greater capacity

– First-ever reference to a given block of memory 



Compulsory miss



Block size – Larger blocks exploit spatial locality – Usually: miss rates improve until 64B-256B – 512B or more miss rates get worse

Conflict – Placement restrictions (not fully-associative) cause useful blocks to be displaced – Think of as capacity within set

 

11 © Hill, Lipasti

ECE 552: Introduction To Computer Architecture

Larger blocks less efficient: more capacity misses Fewer placement choices: more conflict misses 12

© Hill, Lipasti

2

Cache Miss Rate

9

Subtle tradeoffs between cache organization parameters

8 Miss per In nstruction (%)



Cache Miss Rates: 3 C’s

– Large blocks reduce compulsory misses but increase miss penalty  

#compulsory p y = (working ( g set)) / (block ( size)) #transfers = (block size)/(bus width)

– Large blocks increase conflict misses 

#blocks = (cache size) / (block size)

6 Conflict Capacity Compulsory

5 4 3 2 1

– Associativity reduces conflict misses – Associativity increases access time 

7

0 8K1W



Can associative cache ever have higher miss rate than direct-mapped cache of same size?

8K4W

16K1W

– Compulsory misses are constant – Capacity and conflict misses are reduced

13 © Hill, Lipasti

© Hill, Lipasti

Cache Miss Rates: 3 C’s

Cache Misses and Performance

8

Miss per IInstruction (%)

7



6



5

Conflict Capacity C Compulsory l

4 3

=

How does this affect performance? Performance = Time / Program Instructions Program (code size)

0 8K32B

8K64B

16K32B

X

Cycles y X Instruction

Time Cycle

(CPI)

(cycle time)



Cache organization affects cycle time



Cache misses affect CPI

16K64B

– Hit latency

Vary size and block size – Compulsory misses drop with increased block size – Capacity and conflict can increase with larger blocks

15

16

© Hill, Lipasti

© Hill, Lipasti

Cache Misses and CPI

Cache Misses and CPI

cycles cycleshit cyclesmiss   CPI  inst inst inst cycleshit cycles miss    inst miss inst cycleshit   Miss _ penalty  Miss _ rate inst  

14

2 1



16K4W

Vary size and associativity

CPI    

Pl is miss penalty at each of n levels of cache MPIl is miss rate per instruction at each of n levels of cache Miss rate specification: – Per instruction: easy to incorporate in CPI – Per reference: must convert to per instruction

Cycles spent handling misses are strictly additive Miss_penalty is recursively defined at next level of cache hierarchy as weighted sum of hit latency and miss latency

 

17 © Hill, Lipasti

ECE 552: Introduction To Computer Architecture

n cycleshit   Pl  MPI l inst l 1

Local: misses per local reference Global: misses per ifetch or load or store 18

© Hill, Lipasti

3

Cache Performance Example

Cache Performance Example 

CPI 

Assume following: – – – – –

8cycles  0.02miss 0.04miss     miss  inst inst  19cycles 0.40miss 0.06ref    miss ref inst 19cycles 0.024miss  1.15  0.48   miss inst  1.15  0.48  0.456  2.086

L1 instruction cache with 98% per instruction hit rate L1 data cache with 96% per instruction hit rate Shared L2 cache with 40% local miss rate L1 miss penalty of 8 cycles L2 miss penalty of:    

n cycleshit   Pl  MPI l inst l 1

CPI  1.15 

10 cycles latency to request word from memory 2 cycles per 16B bus transfer, 4x16B = 64B block transferred Hence 8 cycles transfer plus 1 cycle to fill L2 Total penalty 10+8+1 = 19 cycles

19

20

© Hill, Lipasti

© Hill, Lipasti

Cache Misses and Performance

Caches Summary



CPI equation



– Only holds for misses that cannot be overlapped with other activity – Store misses often overlapped    

– Placement 

Direct-mapped, set-associative, fully-associative

– Identification d ifi i

Place store in store queue Wait for miss to complete Perform store Allow subsequent instructions to continue in parallel



Tag array used for tag check

– Replacement 

– Modern out-of-order processors also do this for loads 

Four questions

LRU, FIFO, Random

– Write policy

Cache performance modeling requires detailed modeling of entire processor core



Write-through, writeback

21 © Hill, Lipasti

22 © Hill, Lipasti

Caches: SetSet-associative Address

Hash

Caches: DirectDirect-Mapped Address

SRAM Cache

Index

Tag

Index

a Tags

?=

?=

?=

a Data Blocks

Hash

Index

Tag

Index

Data

?= Tag

Offset

Data Out

Data Out 23

© Hill, Lipasti

ECE 552: Introduction To Computer Architecture

?=

Offset

24 © Hill, Lipasti

4

Caches: Fully Fully--associative Caches Summary

Address

CPI 

aSRAM Data Cache Blocks

a Tags

Hash



Hit latency



Miss penalty



Miss rate

n cycleshit   Pl  MPI l inst l 1

– Block size, associativity, number of blocks

Tag

?=

?=

?=

?=

– Overhead, fetch latency, transfer, fill – 3 C’s: compulsory, capacity, conflict – Determined by locality, cache organization

Offset

Data Out 25 © Hill, Lipasti

ECE 552: Introduction To Computer Architecture

26 © Hill, Lipasti

5