Photonic On-Chip Networks for Performance-Energy Optimized Off-Chip Off Chip Memory Access GILBERT HENDRY JOHNNIE CHAN, DANIEL BRUNINA,
LUCA CARLONI, KEREN BERGMAN
Lightwave Research Laboratory Columbia University New York, NY
Motivation y The memory yg gap p warrants a p paradigm g shift in how
we move information to and from storage and computing elements
[www.OpenSparc.net] Lightwave Research Lab, Columbia University
[Exascale Report, 2008]
10/1/2009
Main Premise y Current memory y subsystem y technology gy and
packaging are not well-suited to future trends { { { {
Networks on chip Growing i cache h sizes i Growing bandwidth requirements Growing pin counts
Lightwave Research Lab, Columbia University
10/1/2009
SDRAM context • DIMMs controlled fully in parallel, sharing access on data and address busses • Many wires/pins • Matched signal paths (for delay) • DIMMs made for short, random accesses
Chip
Lately, this is on chip
DIMM Memory Controller
[Intel]
DIMM
DIMM Lightwave Research Lab, Columbia University
10/1/2009
Future SDRAM context y Example: p Tilera TILE 64 4
Lightwave Research Lab, Columbia University
10/1/2009
SDRAM DIMM Anatomy DRAM_Bank
DRAM_Chip data
IO
Cntrl
Banks (usually 8)
Row addr/en
Col Decoder Sense Amps
Row Deco oder
Col addr/en
data
DRAM cell arrays
Addr/ cntrl
Ranks
DRAM_DIMM
Lightwave Research Lab, Columbia University
SDRAM device
10/1/2009
Memory Access in an Electronic NoC message
Packetized, size of packet k t determined d t i d by router buffers
Chip Boundary
NoC router
Memory Controller
Burst length dictated by packet size
Lightwave Research Lab, Columbia University
10/1/2009
Memory Control y Complex p DRAM control { Scheduling accesses around: Open/closed rows Ù Precharging Ù Refreshing Ù Data/Control bus usage Ù
[DRAMsim, UMD] Lightwave Research Lab, Columbia University
10/1/2009
Experimental Setup – Electronic NoC System:
5-port Electronic Router
y 2cm×2cm chip y 8×8 Electronic Mesh { {
28 DRAM Access points (MCs) 2 DIMMs per DRAM AP
y Routers: R { { { {
1 kb input buffers (per VC) 4 virtual channels 256 b packet size 128 b channels h l
y 32 nm tech. point (ORION) { { {
Normal Vt Vdd = 1.0 V F Freq = 22.5 5 GH GHz
Traffic: y y y y
Random core-DRAM access point pairs Random read/write Uniform message sizes Poisson arrival at 1µs
Lightwave Research Lab, Columbia University
DRAM: y y y
Modeled cycle-accurately with DRAMsim [Univ. MD] DDR3 (10 (10-10-10) 10 10) @ 1333 MT/s MT/ 8 chips per DIMM, 8 banks per Chip, 2 ranks 10/1/2009
Experiment Results 269 Gb/s 100
250
Latency (µ µs)
DRAM M Bandwidth h (Gb/s)
300
200 150 100
Avg Read Latency Zero Load Latency
10
1
50
0.1 0 1000
10000
Msg g Size ((b))
Lightwave Research Lab, Columbia University
1000
10000
Msg Size (b)
10/1/2009
Current
Lightwave Research Lab, Columbia University
10/1/2009
Goal: Optically Integrated Memory Optical Fiber
Optical Transceiver
Vdd, Gnd
Lightwave Research Lab, Columbia University
10/1/2009
Advantages of Photonics y Decoupled energy-distance relationship y No long traces to drive and synch with clock { {
DRAM chips can run faster L Less power
y Less pins on DIMM module and going into chip { {
Eventuallyy required q byy packaging p g g constraints Waveguides can achieve dramatically higher density due to WDM
y DRAM can be arbitrarily distant – fiber is low loss
Lightwave Research Lab, Columbia University
10/1/2009
Hybrid Circuit-Switched Photonic Network Broadband 1×2 Switch
[Cornell, 2008]
Tran nsmission n
Broadband 2×2 Switch
Lightwave Research Lab, Columbia University
[Shacham, NOCS ’07]
λ
10/1/2009
Hybrid Circuit-Switched Photonic Network
Lightwave Research Lab, Columbia University
10/1/2009
Hybrid Circuit-Switched Photonic Network 16
International Symposium on Networks-on-Chip
10/1/2009
Hybrid Circuit-Switched Photonic Network 17
[Bergman, HPEC ’07] International Symposium on Networks-on-Chip
10/1/2009
Photonic DRAM Access Fiber / PCB waveguide
DIMM
Memory gateway
DIMM
Photonic + electronic
DIMM Procesor gateway
To network
electronic l i
Processor / cache
Modulators needed to send commands to DRAM Chi p boundary Photonic switch
Modulators cntrl
Memory Control
Mem cntrll
generates memory control commands
Network Interface
To/From network Lightwave Research Lab, Columbia University
10/1/2009
Memory Transaction DIMM
Memory gateway
3
To network et o k
DIMM DIMM
2 Procesor gateway
1 1
Processor / cache Chi p boundary
Lightwave Research Lab, Columbia University
1) Read or write request is initiated from local or remote processor, travels on electronic network 2) Processor Gateway forwards it to Memory gateway 3) Memory gateway receives request
10/1/2009
Memory READ Transaction 4) MC receives READ command 5) Switch is setup from modulators to DIMM, and from DIMM to network 6) Path setup travels back to receiving Processor. Path ACK returns when path is set up 7) Row/Col addresses sent to DIMM optically 8) R Read dd data returned d optically i ll 9) Path torn down, MC knows how long it will take
8
7 Modulators
Photonic switch
5 Control
4 8
Lightwave Research Lab, Columbia University
6 10/1/2009
Memory WRITE Transaction 4) MC receives WRITE command, which is also a path setup from the processor to memory gateway 5) Switch is setup from modulators to DIMM 6) Row/Col addresses sent to DIMM 7) Switch is setup from network to DIMM 8) Path ACK sent back to Processor 9)) D Data transmitted i d optically i ll to DIMM 10) Path torn down from Processor after data transmitted
9
6 Modulators
Photonic switch
5 7 Control
4 8 Lightwave Research Lab, Columbia University
10/1/2009
Optical Circuit Memory (OCM) Anatomy Packe t Format
Detector Bank
λ
DRAM_OpticalTransceiver Cntrl
Burst length
Bank ID
DLL
Col address a
Row address a
Data
Latches
Modulator Bank
Nλ
Addr/cntrl (25)
Mux
Data (64)
Nλ drivers clk
t tRCD
tCL Fiber Coupling
OR
Waveguide Coupling
Lightwave Research Lab, Columbia University
VDD, Gnd
10/1/2009
Advantages of Photonics y Decoupled energy-distance relationship y No long traces to drive and synch with clock { {
DRAM chips can run faster L Less power
y Less pins on DIMM module and going into chip { {
Eventuallyy required q byy packaging p g g constraints Waveguides can achieve dramatically higher density due to WDM
y DRAM can be arbitrarily distant – fiber is low loss y Simplified memory control logic – no contending
accesses, contention handled by path setup {
Accesses are optimized for large streams of data
Lightwave Research Lab, Columbia University
10/1/2009
Experimental Setup - Photonic System:
Photonic Torus Tile
y 2cm×2cm chip y 8×8 Photonic Torus { {
28 DRAM Access points (MCs) 2 DIMMs per DRAM AP
y Routers: R { { {
256 b buffers 32 b packet size 32 b channels
y 32 nm tech. h point i (ORION) { { {
High Vt Vdd = 0.8 V Freq = 1 GHz
y Photonics Ph t i - 13λ
Traffic: y y y y
Random core-DRAM access point pairs Random read/write Uniform message sizes Poisson arrival at 1µs
Lightwave Research Lab, Columbia University
DRAM:
y Modeled with our event-driven DRAM model y DDR3 (10-10-10) @ 1600 MT/s y 8 chips per DIMM, 8 banks per Chip 10/1/2009
Avg Rea ad Latency (µs)
Performance Comparison
700 600 500
Electronic Mesh Photonic Torus
400
200 100 1000
10000
Msg Size (b)
Electronic Mesh Photonic Ph t i T Torus
10
1
0.1 10
300
Zero Load La atency (µs)
DRAM Ba andwidth (Gb b/s)
800
100
1000
10000
Electronic Mesh Photonic Torus 1
0.1
1000
10000
Msg Size (b) Lightwave Research Lab, Columbia University
10/1/2009
Experiment #2 Random
Lightwave Research Lab, Columbia University
Statically Mapped Address Space
10/1/2009
Results
1000
1000 800
EM - random EM- mapped PT - random PT - mapped
Avg R Read Latenccy (µs)
DRAM M Bandwidth h (Gb/s)
1200
600 400 200 0
100
EM - random d EM- mapped PT - random PT - mapped
10
1
0.1
0.01
1000
10000
Msg Size (b)
Lightwave Research Lab, Columbia University
1000
10000
Msg Size (b)
10/1/2009
Network Energy Comparison Electronic Mesh
Photonic Torus 1%
1% 7%
Electronic Arbiter Electronic Clock Tree
16% Electronic Arbiter Electronic Clock Tree
Electronic Crossbar
3% 3
Electronic Inport
4%
Electronic Crossbar
Electronic Wire
Electronic IO Wire
6%
Electronic Inport
4%
Electronic Wire
Detector 57%
Modulator PSE1x2
9%
90%
PSE2x2 Thermal Tuning
Power = 0.42 W Power = 13.3 W
Total Power = 2.53 W (Including laser power)
Lightwave Research Lab, Columbia University
10/1/2009
Summary y Extending gap photonic network to include access to
DRAM looks good for many reasons: {
{ {
Circuit-switching allows large burst lengths and simplified memory control, control for increased bandwidth bandwidth. Energy efficient end-to-end transmission Alleviates p pin count constraints with high-density g y waveguides g
PhotoMAN Lightwave Research Lab, Columbia University
10/1/2009