3D-DRAM Circuit Design, Modeling and Exploration for Computer Memory Hierarchy

3D-DRAM Circuit Design, Modeling and Exploration for Computer Memory Hierarchy Rakesh Anigu, Hongbin Sun, James J.-Q. Lu, Ken Rose, and Tong Zhang Ele...
Author: Rodger Sullivan
6 downloads 0 Views 548KB Size
3D-DRAM Circuit Design, Modeling and Exploration for Computer Memory Hierarchy Rakesh Anigu, Hongbin Sun, James J.-Q. Lu, Ken Rose, and Tong Zhang Electrical, Computer and Systems Engineering Department Rensselaer Polytechnic Institute

Motivation TSV size/pitch but…

Thermal Yield loss EDA tools Equipments Cost …

2

Motivation Naturally embraces the immaturity of 3D integration TSV size/pitch

9 Coarse-grained die-to-die interconnect only

Thermal

9 Inherently low power and less heat

Yield loss

9 Easy to achieve very high defect tolerance

EDA tools

9 Minimal departure from 2D design

Equipments

9 Big $$$ market

Cost

9 Higher-end, definitely not commodity

3

Overall performance

Why 3D Processor-DRAM Integration

Memory Wall & Bandwidth Wall Time (Dr. Phil Emma @ IBM)

Move more memory closer to processor cores at minimal extra cost!

3D Processor-DRAM Integration 4

Why 3D Processor-DRAM Integration Almost no yield loss 2D design know-how

Coarse-grained TSVs DRAM dies

Thermal friendly

Processor die

Justifiable cost

To break the memory & bandwidth wall! Quantitatively evaluate the potential 5

Outline ‰ Motivation ‰ 3D DRAM Architecture Design ‰ 3D Processor-DRAM Integration ‰ Conclusions

6

3D DRAM Architecture Design Stacked commodity DRAM dies Processor die

L2 cache ⇔ main memory Bandwidth

Latency

Area CACTI 5 Î 1Gb 2D DRAM @ 65nm Latency Energy

7

3D DRAM Architecture Design Stacked Commodity DRAM Î Customized 3D DRAM

At which granularity should we carry out 3D mapping Intra-sub-array 3D mapping

Fine-grained TSVs

Inter-sub-array 3D mapping

Coarse-grained TSVs 8

Inter-Sub-Array 3D Mapping

TSV I/Os

Top view

9

3D Sub-Array Set Distributed across dies 2D sub-array

Data bus

Address bus 2D sub-array 2D sub-array

TSVs bundle Multi-layer data access (MLDA)

Single-layer data access (SLDA)

‰ All 2D sub-arrays are activated

‰ Only one 2D sub-array is activated

‰ Each handles a portion of data

‰ One 2D sub-array handles all data

TSVs

Energy

TSVs

Energy

10

3D DRAM Architecture Design Inter-sub-array 3D mapping Small number of TSVs (1K~10K) Intact individual DRAM sub-array design Distributed global routing Î performance gain Modified CACTI 5 to support inter-sub-array 3D mapping Case study: 1Gb with 8 banks and 256-bit I/O @ 65nm

2D

vs.

3D die packaging (i.e., no TSVs)

SLDA

vs.

3D DRAM MLDA 11

12

Defect Tolerance One more dimension for redundancy repair

Sub-Array Sub-Array

Sub-Array Redundancy x Redundancy

Redundancy

Inter-die inter-sub-array redundancy repair 13

Inter-Die Inter-Sub-Array Redundancy Repair

1024x256 sub-array, defect density: 0.05%, repair-most algorithm

14

Outline ‰ Motivation ‰ 3D DRAM Architecture Design ‰ 3D Processor-DRAM Integration ‰ Conclusions

15

Current Design Practice Core w/ L1

Core w/ L1

Shared L2 Cache (SRAM) L2 capacity & L1↔L2 bandwidth

Core w/ L1

Core w/ L1

Core w/ L1

Core w/ L1

3D Integration

DDRx

Commodity DRAM

channel

L2 ↔ main memory bandwidth

High-density DRAM High-speed DRAM

16

Heterogeneous 3D DRAM Stacked Commodity DRAM Î Customized 3D DRAM ‰ Heterogeneous 3D-DRAM L2 cache + main memory structure ‰ Each core has its private 2D-SRAM L1 cache & 3D-DRAM L2 cache DRAM density vs. speed trade-off

Density

Density Sub-Array

Sub-Array

Speed

Speed

Integrate both high-threshold & low-threshold MOSFETs 17

Evaluation ‰ M5 full system simulator with Linux (U. of Mich.) ‰ Four 4.0GHz cores with 8-layer 3D-DRAM at 45nm node ¾ 3D-DRAM L2 cache per core: 2MB ¾ 3D-DRAM main memory: 1GB

Processor Die

Baseline

Core w/ L1

Core w/ L1

Core w/ L1

Core w/ L1

Without multi-Vt

With multi-Vt

18

Instruction Per Cycle (IPC) Gain over Baseline

19

One Step Further

Decentralized distributed main memory structure ‰ Fastlane between L2 cache and its closest main memory block

Reduced L2 cache miss penalty 20

One Step Further

21

Conclusions 3D multi-core processor DRAM integration ‰ 3D DRAM Design Simple but effective inter-sub-array 3D mapping strategy Simple but effective 3D redundancy repair Good memory performance gain ‰ Integration of processor and 3D DRAM Heterogeneous 3D DRAM architecture Great computing system performance gain 22

Suggest Documents