CONTEMPORARY DRAM ARCHITECTURES AND BEYOND Bruce Jacob University of Maryland
Contemporary DRAM Architectures and Beyond Bruce Jacob Electrical & Computer Engineering University of Maryland, College Park http://www.ece.umd.edu/~blj/
OUTLINE:
UNIVERSITY OF MARYLAND
•
Motivation & Background
•
Experiments
•
Results
•
More Recent Results
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Sources
Bruce Jacob University of Maryland
“A Performance Study of Contemporary DRAM Architectures,” Proc. ISCA ’99. V. Cuppu, B. Jacob, B. Davis, and T. Mudge Recent experiments by Vinodh Cuppu, Ph.D. student at University of Maryland Recent experiments by Brian Davis, Ph.D. student at University of Michigan
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Dilemma: THIS ...
Bruce Jacob University of Maryland
STATUS QUO in MEMORY-SYSTEM RESEARCH: ... if (memory_instruction(INSTR)) { if (L1_cache_miss( data_addr(INSTR) ){ if (L2_cache_miss( data_addr(INSTR) ){ cycles += DRAM_LATENCY; } } } ...
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND Bruce Jacob University of Maryland
... or THIS
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Motivation
Bruce Jacob
DRAM LATENCY: DATA TRANSFER OVERLAP COLUMN ACCESS ROW ACCESS BUS TRANSMISSION
MC
DRAM
bus
DRAM
CPU
...
MC
DRAM
bus
DRAM
CPU
DRAM
DRAM
HERE’S WHAT YOU MISS:
...
University of Maryland
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Goal
Bruce Jacob University of Maryland
PRELIMINARY DRAM STUDY: •
Bus Transmission
•
Row Access
•
Column Access
•
Data Transfer
•
Bus Wait/Synch Time
•
Stalls Due to Refresh
•
The OVERLAP of These Components (with each other) (with CPU execution)
MODEL EXISTING TECHNOLOGY
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
DRAM Primer
Bruce Jacob
BUS TRANSMISSION DRAM
Data In/Out Buffers
Column Decoder Sense Amps BUS
MEMORY CONTROLLER
... Bit Lines...
....
CPU
Row Decoder
University of Maryland
Memory Array
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
DRAM Primer
Bruce Jacob
ROW ACCESS DRAM
Data In/Out Buffers
Column Decoder Sense Amps BUS
MEMORY CONTROLLER
... Bit Lines...
....
CPU
Row Decoder
University of Maryland
Memory Array
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
DRAM Primer
Bruce Jacob
COLUMN ACCESS DRAM
Data In/Out Buffers
Column Decoder Sense Amps BUS
MEMORY CONTROLLER
... Bit Lines...
....
CPU
Row Decoder
University of Maryland
Memory Array
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
DRAM Primer
Bruce Jacob
DATA TRANSFER DRAM
Data In/Out Buffers
Column Decoder Sense Amps BUS
MEMORY CONTROLLER
... Bit Lines...
....
CPU
Row Decoder
University of Maryland
Memory Array
note: page mode enables overlap with COL
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
DRAM Primer
Bruce Jacob
BUS TRANSMISSION DRAM
Data In/Out Buffers
Column Decoder Sense Amps BUS
MEMORY CONTROLLER
... Bit Lines...
....
CPU
Row Decoder
University of Maryland
Memory Array
note: overlapped component not shown
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
DRAM Primer
Bruce Jacob University of Maryland
Read Timing for Conventional DRAM Data Transfer RAS
Transfer Overlap Column Access
CAS
Row Access
Address Row Address
DQ
Column Address
Row Address
Valid Dataout
Column Address
Valid Dataout
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
DRAM Primer
Bruce Jacob University of Maryland
Read Timing for Fast Page Mode DRAM Data Transfer Transfer Overlap Column Access RAS
Row Access
CAS
Address Row Address
DQ
Column Address
Column Address
Valid Dataout
Column Address
Valid Dataout
Valid Dataout
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
DRAM Primer
Bruce Jacob University of Maryland
Read Timing for Extended Data Out DRAM Data Transfer Transfer Overlap
RAS
Column Access CAS
Row Access
Address Row Address
DQ
Column Address
Column Address
Valid Dataout
Column Address
Valid Dataout
Valid Dataout
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
DRAM Primer
Bruce Jacob University of Maryland
Read Timing for Synchronous DRAM Data Transfer Clock
Transfer Overlap Column Access
RAS
Row Access CAS
Address Row Address
DQ
Column Address
Valid Valid Valid Dataout Dataout Dataout
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
DRAM Primer
Bruce Jacob University of Maryland
Read Timing for Rambus DRAM Data Transfer Transfer Overlap Column Access
4 cycles
Row Access Col Addr
Address
Command ACTV/ READ
DQ
Bank/ Row
Read Strobe
Col Addr
Col Addr
Read Term
Valid Valid Valid Dataout Dataout Dataout
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Simulator Overview
Bruce Jacob University of Maryland
CPU: SimpleScalar v3.0a •
8-way out-of-order
•
L1 cache: split 64K/64K, lockup free x32
•
L2 cache: unified 1MB, lockup free x1
•
L2 blocksize: 128 bytes
Main Memory: 8 64Mb DRAMs •
100MHz/128-bit memory bus
•
Optimistic open-page policy (close-immediately can be calculated)
Represents a “typical” workstation
DRAM Configurations FPM, EDO, SDRAM, ESDRAM:
Bruce Jacob
x16 DRAM x16 DRAM
University of Maryland
x16 DRAM CPU and caches
128-bit 100MHz bus
x16 DRAM
Memory Controller
x16 DRAM x16 DRAM x16 DRAM x16 DRAM
Fast, Narrow Channel
Note: TRANSFER WIDTH of Direct Rambus Channel •
equals that of ganged FPM, EDO, etc.
•
is 2x that of Rambus & SLDRAM
DRAM
DRAM
DRAM
DRAM
Memory Controller
DRAM
128-bit 100MHz bus
DRAM
CPU and caches
DRAM
DIMM
Rambus, Direct Rambus, SLDRAM: DRAM
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
DRAM Configurations
Bruce Jacob
Memory Controller
DRAM
128-bit 100MHz bus
DRAM
CPU and caches
DRAM
DRAM
DRAM
Strawman: Rambus, etc.
...
University of Maryland
Multiple Parallel Channels
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Overhead: Memory vs. CPU
Bruce Jacob University of Maryland
Stalls due to Memory Access Time Overlap between Execution & Memory Processor Execution (includes caches)
Clocks Per Instruction (CPI)
3
2.5 To
2
1.5
mo rr Ye To ow’ ste day s C P rd ay ’s CP U ’s CP U U
1
0.5
0
Compress Go
Ijpeg Li BENCHMARK
Perl
Vortex
Variable: speed of processor & caches
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND Bruce Jacob University of Maryland
Definitions (var. on Burger, et al) •
tPROC — processor with perfect memory
•
tREAL — realistic configuration
•
tBW — CPU with wide memory paths
•
tDRAM — time seen by DRAM system tREAL Stalls Due to BANDWIDTH
tDRAM
tREAL - tBW
Stalls Due to LATENCY
tBW - tPROC
CPU-Memory OVERLAP
tPROC - (tREAL - tDRAM)
CPU+L1+L2 Execution
tREAL - tDRAM
tBW
tPROC
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Memory & CPU — PERL
Bruce Jacob University of Maryland
Stalls due to Memory Bandwidth Stalls due to Memory Latency Overlap between Execution & Memory Processor Execution
5 4.5 Cycles Per Instruction (CPI)
PU C U ’s CP PU w s ro y’ s C or da y’ m To da er st
To Ye
4 3.5 3 2.5 2 1.5 1 0.5 0
FPM
EDO
SDRAM
ESDRAM
DRAM Configuration
DRDRAM
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Average Latency of DRAMs
Bruce Jacob
Bus Wait Time Refresh Time Data Transfer Time Data Transfer Time Overlap Column Access Time Row Access Time Bus Transmission Time
University of Maryland 500
Time per Access (ns)
400
300
200
100
0
FPM
EDO
SDRAM ESDRAM SLDRAM RDRAM DRDRAM
DRAM Configurations
note: SLDRAM & RDRAM 2x data transfers
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Average Latency of DRAMs
Bruce Jacob
Bus Wait Time Refresh Time Data Transfer Time Data Transfer Time Overlap Column Access Time Row Access Time Bus Transmission Time
University of Maryland 500
Time per Access (ns)
400
300
200
100
0
FPM
EDO
SDRAM ESDRAM SLDRAM RDRAM DRDRAM
DRAM Configurations
note: SLDRAM & RDRAM 2x data transfers
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Cost-Performance
Bruce Jacob University of Maryland
FPM, EDO, SDRAM, ESDRAM: •
Lower Latency => Wide/Fast Bus
•
Increase Capacity => Decrease Latency
•
Low System Cost
Rambus, Direct Rambus, SLDRAM: •
Lower Latency => Multiple Channels
•
Increase Capacity => Increase Capacity
•
High System Cost 1 DRDRAM = Multiple SDRAM
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Conclusions
Bruce Jacob University of Maryland
100MHz/128-bit Bus is Current Bottleneck •
Solution: Fast Bus/es & MC on CPU (e.g. Compaq Alpha, Sony Emotion, ...)
Current DRAMs Solving Bandwidth Problem (but not Latency Problem) There is Locality in DRAM Accesses (but how important is this?) SPECint ’95 Fits in 1MB Cache
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Recent (Unfinished) Work
Bruce Jacob University of Maryland
Investigation of Organization-Level Parameters: •
Channel widths & speeds, turnaround
•
Independent vs. ganged channels
•
Banks per channel, burst widths
Detailed Study of DRDRAM vs. SDRAM in Highly Concurrent Environment Embedded DRAM+DSP Architectures Detailed Study of Multiprocessor Buses
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Channel/Bank Model
Bruce Jacob
D D
University of Maryland
D
D D
D D
D
D
D D
D D
D D
D D
D D
D D
... C
C
...
C
C
D
D
D
C
Two independent channels Banking degrees of 1, 2, 4, ...
One independent channel Banking degrees of 1, 2, 4, ...
D
C
D
D
D
D
D D
D D
D D
D D
D
D
D
D
D D
D D
D D
D D
... C
C Four independent channels Banking degrees of 1, 2, 4, ...
C
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND Bruce Jacob University of Maryland
Read/Write Request Model READ REQUESTS: t0 ADDRESS BUS
10ns 90ns
DRAM BANK
70ns
DATA BUS ADDRESS BUS
10ns
10ns
DRAM BANK
90ns
DATA BUS ADDRESS BUS
70ns
20ns
10ns 100ns
DRAM BANK
70ns
DATA BUS
40ns
WRITE REQUESTS: t0 ADDRESS BUS
10ns 90ns
DRAM BANK
40ns
DATA BUS ADDRESS BUS
10ns
DRAM BANK DATA BUS ADDRESS BUS
40ns
90ns 20ns
10ns
DRAM BANK DATA BUS
10ns
40ns
90ns 40ns
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND Bruce Jacob University of Maryland
Concurrency Model Legal if R/R to different banks: R: R:
10ns 90ns 20ns
70ns 10ns
20ns 90ns 70ns
20ns
Legal if no turnaround and R/W to different banks: R:
10ns 90ns 70ns
W:
10
20ns
10ns 40ns
90ns 20ns
Legal if turnaround ≤ 10ns and R/W to different banks: R:
10ns 90ns 70ns
W:
10
10ns
10ns 90ns 40ns
10ns
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Bandwidth vs. Burst Width
Bruce Jacob University of Maryland
Cycles per Instruction
1.25
8-Byte Burst Width 16-Byte Burst Width 32-Byte Burst Width 64-Byte Burst Width 128-Byte Burst Width
1
0.75
0.5
0.25
0
0.4 0.8 1.6 3.2 6.4 System Bandwidth (GB/s = Channels * Width * Speed)
PERL: 1 channel, 4 banks, 2GHz CPU
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Exploiting Concurrency
Bruce Jacob 8-Bit Data Bus 16-Bit Data Bus 32-Bit Data Bus 64-Bit Data Bus
University of Maryland
1 channel
0.5
0.25
400 MHz x 4 channels
0.75
400 MHz x 2 channels
1
400 MHz x 1 channel
Cycles per Instruction
1.25
2 channels 4 channels
0
8 16 32 64 128 256 Total Datapath Bitwidth (bits = Channels * BusWidth)
PERL: 2 banks, 16-byte burst, 2GHz CPU
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND
Conclusions
Bruce Jacob University of Maryland
None yet ... preliminary data
CONTACT INFO: Prof. Bruce Jacob Electrical & Computer Engineering University of Maryland, College Park http://www.ece.umd.edu/~blj/
[email protected] UNIVERSITY OF MARYLAND