Contemporary DRAM Architectures and Beyond

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND Bruce Jacob University of Maryland Contemporary DRAM Architectures and Beyond Bruce Jacob Electrical & Com...
10 downloads 0 Views 123KB Size
CONTEMPORARY DRAM ARCHITECTURES AND BEYOND Bruce Jacob University of Maryland

Contemporary DRAM Architectures and Beyond Bruce Jacob Electrical & Computer Engineering University of Maryland, College Park http://www.ece.umd.edu/~blj/

OUTLINE:

UNIVERSITY OF MARYLAND



Motivation & Background



Experiments



Results



More Recent Results

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Sources

Bruce Jacob University of Maryland

“A Performance Study of Contemporary DRAM Architectures,” Proc. ISCA ’99. V. Cuppu, B. Jacob, B. Davis, and T. Mudge Recent experiments by Vinodh Cuppu, Ph.D. student at University of Maryland Recent experiments by Brian Davis, Ph.D. student at University of Michigan

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Dilemma: THIS ...

Bruce Jacob University of Maryland

STATUS QUO in MEMORY-SYSTEM RESEARCH: ... if (memory_instruction(INSTR)) { if (L1_cache_miss( data_addr(INSTR) ){ if (L2_cache_miss( data_addr(INSTR) ){ cycles += DRAM_LATENCY; } } } ...

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND Bruce Jacob University of Maryland

... or THIS

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Motivation

Bruce Jacob

DRAM LATENCY: DATA TRANSFER OVERLAP COLUMN ACCESS ROW ACCESS BUS TRANSMISSION

MC

DRAM

bus

DRAM

CPU

...

MC

DRAM

bus

DRAM

CPU

DRAM

DRAM

HERE’S WHAT YOU MISS:

...

University of Maryland

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Goal

Bruce Jacob University of Maryland

PRELIMINARY DRAM STUDY: •

Bus Transmission



Row Access



Column Access



Data Transfer



Bus Wait/Synch Time



Stalls Due to Refresh



The OVERLAP of These Components (with each other) (with CPU execution)

MODEL EXISTING TECHNOLOGY

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

DRAM Primer

Bruce Jacob

BUS TRANSMISSION DRAM

Data In/Out Buffers

Column Decoder Sense Amps BUS

MEMORY CONTROLLER

... Bit Lines...

....

CPU

Row Decoder

University of Maryland

Memory Array

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

DRAM Primer

Bruce Jacob

ROW ACCESS DRAM

Data In/Out Buffers

Column Decoder Sense Amps BUS

MEMORY CONTROLLER

... Bit Lines...

....

CPU

Row Decoder

University of Maryland

Memory Array

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

DRAM Primer

Bruce Jacob

COLUMN ACCESS DRAM

Data In/Out Buffers

Column Decoder Sense Amps BUS

MEMORY CONTROLLER

... Bit Lines...

....

CPU

Row Decoder

University of Maryland

Memory Array

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

DRAM Primer

Bruce Jacob

DATA TRANSFER DRAM

Data In/Out Buffers

Column Decoder Sense Amps BUS

MEMORY CONTROLLER

... Bit Lines...

....

CPU

Row Decoder

University of Maryland

Memory Array

note: page mode enables overlap with COL

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

DRAM Primer

Bruce Jacob

BUS TRANSMISSION DRAM

Data In/Out Buffers

Column Decoder Sense Amps BUS

MEMORY CONTROLLER

... Bit Lines...

....

CPU

Row Decoder

University of Maryland

Memory Array

note: overlapped component not shown

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

DRAM Primer

Bruce Jacob University of Maryland

Read Timing for Conventional DRAM Data Transfer RAS

Transfer Overlap Column Access

CAS

Row Access

Address Row Address

DQ

Column Address

Row Address

Valid Dataout

Column Address

Valid Dataout

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

DRAM Primer

Bruce Jacob University of Maryland

Read Timing for Fast Page Mode DRAM Data Transfer Transfer Overlap Column Access RAS

Row Access

CAS

Address Row Address

DQ

Column Address

Column Address

Valid Dataout

Column Address

Valid Dataout

Valid Dataout

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

DRAM Primer

Bruce Jacob University of Maryland

Read Timing for Extended Data Out DRAM Data Transfer Transfer Overlap

RAS

Column Access CAS

Row Access

Address Row Address

DQ

Column Address

Column Address

Valid Dataout

Column Address

Valid Dataout

Valid Dataout

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

DRAM Primer

Bruce Jacob University of Maryland

Read Timing for Synchronous DRAM Data Transfer Clock

Transfer Overlap Column Access

RAS

Row Access CAS

Address Row Address

DQ

Column Address

Valid Valid Valid Dataout Dataout Dataout

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

DRAM Primer

Bruce Jacob University of Maryland

Read Timing for Rambus DRAM Data Transfer Transfer Overlap Column Access

4 cycles

Row Access Col Addr

Address

Command ACTV/ READ

DQ

Bank/ Row

Read Strobe

Col Addr

Col Addr

Read Term

Valid Valid Valid Dataout Dataout Dataout

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Simulator Overview

Bruce Jacob University of Maryland

CPU: SimpleScalar v3.0a •

8-way out-of-order



L1 cache: split 64K/64K, lockup free x32



L2 cache: unified 1MB, lockup free x1



L2 blocksize: 128 bytes

Main Memory: 8 64Mb DRAMs •

100MHz/128-bit memory bus



Optimistic open-page policy (close-immediately can be calculated)

Represents a “typical” workstation

DRAM Configurations FPM, EDO, SDRAM, ESDRAM:

Bruce Jacob

x16 DRAM x16 DRAM

University of Maryland

x16 DRAM CPU and caches

128-bit 100MHz bus

x16 DRAM

Memory Controller

x16 DRAM x16 DRAM x16 DRAM x16 DRAM

Fast, Narrow Channel

Note: TRANSFER WIDTH of Direct Rambus Channel •

equals that of ganged FPM, EDO, etc.



is 2x that of Rambus & SLDRAM

DRAM

DRAM

DRAM

DRAM

Memory Controller

DRAM

128-bit 100MHz bus

DRAM

CPU and caches

DRAM

DIMM

Rambus, Direct Rambus, SLDRAM: DRAM

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

DRAM Configurations

Bruce Jacob

Memory Controller

DRAM

128-bit 100MHz bus

DRAM

CPU and caches

DRAM

DRAM

DRAM

Strawman: Rambus, etc.

...

University of Maryland

Multiple Parallel Channels

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Overhead: Memory vs. CPU

Bruce Jacob University of Maryland

Stalls due to Memory Access Time Overlap between Execution & Memory Processor Execution (includes caches)

Clocks Per Instruction (CPI)

3

2.5 To

2

1.5

mo rr Ye To ow’ ste day s C P rd ay ’s CP U ’s CP U U

1

0.5

0

Compress Go

Ijpeg Li BENCHMARK

Perl

Vortex

Variable: speed of processor & caches

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND Bruce Jacob University of Maryland

Definitions (var. on Burger, et al) •

tPROC — processor with perfect memory



tREAL — realistic configuration



tBW — CPU with wide memory paths



tDRAM — time seen by DRAM system tREAL Stalls Due to BANDWIDTH

tDRAM

tREAL - tBW

Stalls Due to LATENCY

tBW - tPROC

CPU-Memory OVERLAP

tPROC - (tREAL - tDRAM)

CPU+L1+L2 Execution

tREAL - tDRAM

tBW

tPROC

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Memory & CPU — PERL

Bruce Jacob University of Maryland

Stalls due to Memory Bandwidth Stalls due to Memory Latency Overlap between Execution & Memory Processor Execution

5 4.5 Cycles Per Instruction (CPI)

PU C U ’s CP PU w s ro y’ s C or da y’ m To da er st

To Ye

4 3.5 3 2.5 2 1.5 1 0.5 0

FPM

EDO

SDRAM

ESDRAM

DRAM Configuration

DRDRAM

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Average Latency of DRAMs

Bruce Jacob

Bus Wait Time Refresh Time Data Transfer Time Data Transfer Time Overlap Column Access Time Row Access Time Bus Transmission Time

University of Maryland 500

Time per Access (ns)

400

300

200

100

0

FPM

EDO

SDRAM ESDRAM SLDRAM RDRAM DRDRAM

DRAM Configurations

note: SLDRAM & RDRAM 2x data transfers

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Average Latency of DRAMs

Bruce Jacob

Bus Wait Time Refresh Time Data Transfer Time Data Transfer Time Overlap Column Access Time Row Access Time Bus Transmission Time

University of Maryland 500

Time per Access (ns)

400

300

200

100

0

FPM

EDO

SDRAM ESDRAM SLDRAM RDRAM DRDRAM

DRAM Configurations

note: SLDRAM & RDRAM 2x data transfers

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Cost-Performance

Bruce Jacob University of Maryland

FPM, EDO, SDRAM, ESDRAM: •

Lower Latency => Wide/Fast Bus



Increase Capacity => Decrease Latency



Low System Cost

Rambus, Direct Rambus, SLDRAM: •

Lower Latency => Multiple Channels



Increase Capacity => Increase Capacity



High System Cost 1 DRDRAM = Multiple SDRAM

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Conclusions

Bruce Jacob University of Maryland

100MHz/128-bit Bus is Current Bottleneck •

Solution: Fast Bus/es & MC on CPU (e.g. Compaq Alpha, Sony Emotion, ...)

Current DRAMs Solving Bandwidth Problem (but not Latency Problem) There is Locality in DRAM Accesses (but how important is this?) SPECint ’95 Fits in 1MB Cache

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Recent (Unfinished) Work

Bruce Jacob University of Maryland

Investigation of Organization-Level Parameters: •

Channel widths & speeds, turnaround



Independent vs. ganged channels



Banks per channel, burst widths

Detailed Study of DRDRAM vs. SDRAM in Highly Concurrent Environment Embedded DRAM+DSP Architectures Detailed Study of Multiprocessor Buses

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Channel/Bank Model

Bruce Jacob

D D

University of Maryland

D

D D

D D

D

D

D D

D D

D D

D D

D D

D D

... C

C

...

C

C

D

D

D

C

Two independent channels Banking degrees of 1, 2, 4, ...

One independent channel Banking degrees of 1, 2, 4, ...

D

C

D

D

D

D

D D

D D

D D

D D

D

D

D

D

D D

D D

D D

D D

... C

C Four independent channels Banking degrees of 1, 2, 4, ...

C

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND Bruce Jacob University of Maryland

Read/Write Request Model READ REQUESTS: t0 ADDRESS BUS

10ns 90ns

DRAM BANK

70ns

DATA BUS ADDRESS BUS

10ns

10ns

DRAM BANK

90ns

DATA BUS ADDRESS BUS

70ns

20ns

10ns 100ns

DRAM BANK

70ns

DATA BUS

40ns

WRITE REQUESTS: t0 ADDRESS BUS

10ns 90ns

DRAM BANK

40ns

DATA BUS ADDRESS BUS

10ns

DRAM BANK DATA BUS ADDRESS BUS

40ns

90ns 20ns

10ns

DRAM BANK DATA BUS

10ns

40ns

90ns 40ns

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND Bruce Jacob University of Maryland

Concurrency Model Legal if R/R to different banks: R: R:

10ns 90ns 20ns

70ns 10ns

20ns 90ns 70ns

20ns

Legal if no turnaround and R/W to different banks: R:

10ns 90ns 70ns

W:

10

20ns

10ns 40ns

90ns 20ns

Legal if turnaround ≤ 10ns and R/W to different banks: R:

10ns 90ns 70ns

W:

10

10ns

10ns 90ns 40ns

10ns

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Bandwidth vs. Burst Width

Bruce Jacob University of Maryland

Cycles per Instruction

1.25

8-Byte Burst Width 16-Byte Burst Width 32-Byte Burst Width 64-Byte Burst Width 128-Byte Burst Width

1

0.75

0.5

0.25

0

0.4 0.8 1.6 3.2 6.4 System Bandwidth (GB/s = Channels * Width * Speed)

PERL: 1 channel, 4 banks, 2GHz CPU

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Exploiting Concurrency

Bruce Jacob 8-Bit Data Bus 16-Bit Data Bus 32-Bit Data Bus 64-Bit Data Bus

University of Maryland

1 channel

0.5

0.25

400 MHz x 4 channels

0.75

400 MHz x 2 channels

1

400 MHz x 1 channel

Cycles per Instruction

1.25

2 channels 4 channels

0

8 16 32 64 128 256 Total Datapath Bitwidth (bits = Channels * BusWidth)

PERL: 2 banks, 16-byte burst, 2GHz CPU

CONTEMPORARY DRAM ARCHITECTURES AND BEYOND

Conclusions

Bruce Jacob University of Maryland

None yet ... preliminary data

CONTACT INFO: Prof. Bruce Jacob Electrical & Computer Engineering University of Maryland, College Park http://www.ece.umd.edu/~blj/ [email protected] UNIVERSITY OF MARYLAND