NANDFlashSim : Intrinsic Latency Variation
Aware NAND Flash Memory System Modeling and Simulation at Microarchitecture Level Myoungsoo y g Jung g (MJ) ( ),
Ellis H. Wilson III, David Donofrio, John Shalf, Mahmut T. Kandemir
Agenda • • • •
Revisiting NAND flash technology Advance NAND flash operations NANDFlashSim Evaluation
Intrinsic Latency Variation • Fowler-Nordheim Tunneling – Making a ga an e electron ect o channel c a e – Voltage is applied over a certain threshold
Gate Floating Gate
Source
Channel
Drain
Distribution of Cellss
• IIncremental t l step t pulse l programming i (ISPP)
Intrinsic Latency Variation • Each step of ISPP needs different programming p g g duration (latency) ( y) • Latencies of the NAND flash memory fluctuate depending on the address of the pages in a block
NAND Flash Architecture • Employing cache and data registers • Multiple planes (memory array) • Multiple dies
Memory Array (Plane)
Density Trend • Flash technology – Each ac cell ce iss capable capab e to store sto e multiple u t p e bits b ts – manufacturing feature size is scaling down
• S So far, f d density it iis iincreasing i b by ttwo to t four times every 2 years
Flash Chip
70 60
Density (G Gb)
50 40 30 20 10 0 2002
2004
2006
2008
Years
2010
2012
Density Trend • Shrinking manufacturing feature size g be limited around 12 nanometer might • Multi-die stack technology – Flash Fl h packages k continue ti to t scale l up by b employing multiple dies and planes
Flash Chip MCP
70
2000 60
Density (G Gb) Density (G Gb)
50 1500
?
40
1000 30
How does performance behavior change?
20 500 10 0 0
MOSAID HLNAND (32Gb 16-DIE NAND STACK) 32Gb
1Gb
64Gb
2002 2004 2004 2006 200620082008 2002 2010 2010 2012 2012 2014
Years
Advance NAND Flash Operation
Legacy Operation • A An I/O operation ti splits lit iinto t severall operation ti stages t • Each stage should be appropriately handled by device drivers
write
check cmd data addr cmd
S t a t u s data
Cache Operation • C Cache h mode d operations i use internal i l registers in an attempt to hide performance overhead h d from f d data t movements t
data2 data1
Internal Data Move Mode • S Saving i space and d cycles l tto copy d data t • Source and destination page address should be located in the same die data
No data N d t movementt through NAND interface cmd dst addr src addr cmd
Multi plane Mode Operation Multi-plane • T Two different diff t pages can b be served d in i parallel ll l • Addresses should indicate same page offset in a block,, same die address and should have different plane addresses (plane addressing rule)
check cmd data2 data1 cmd addr2 addr1 cmd
Interleaved Die Mode Operation Interleaved-Die • providing idi a way, ttaking ki advantage d t off iinternal t l parallelism by interleaving NAND transactions • Scheduling g NAND transactions and bus arbitrations are critical dominant of memory system performance
cmd data1 addr1 cmd
cmd data2 addr2 cmd
Challenges • Performances P f are varied i db based d on:
– intrinsic latency variation characteristic – internal parallelism – advanced flash operations p types yp
• Performances are affected by
– how to deal with diverse advance flash operations – how to effectively schedule NAND transactions
Prior Simulation Works • Flash-based Fl h b d Solid S lid S State Di Disks k Si Simulation l i – Tightly coupled to specific flash firmware
• Unaware of latency variation of NAND flash – Latencyy approximation pp model with constants
• Course-grain NAND command handling – In In-order order execution L Latency Approximation n
SSD Simulator Flash h Firmwaree
I/O Subsystem
NANDFlashSim • Si Simulating l ti and d Modeling M d li NAND fl flash h rather than flash firmware or SSDs
– NANDFlashSim can be applied to diverse application like off-chip caches of a multi-core system and I/O subsystems of mobile systems – Multiple instances can be used for building SATA, PCI-e based SSDs Fllash Firmw ware
L Latency Approximation n
SSD Simulator Flash h Firmwaree
I/O Subsystem
NANDFlashSim • Detailed D t il d Ti Timing i M Model d l • Awareness of intrinsic latency variation
–d designed i d to b be performance f variation-aware i i and employs different page offsets in a physical b oc block
• Reconfigurable Microarchitecture
– Supports pp highly g y reconfigurable g architectures in terms of multiple dies and planes
• Fine-grain NAND flash command handling – 16 combinations of advance flash operation – Supporting out-of-order execution
High level View High-level
k*j Blocks
NA AND Flash I/O Bus
Logical U Unit
• C Command d sett architecture hit t and d individual i di id l state t t machine associated with it • Host H t and d NAND fl flash h clock l kd domain i are separate. t • All entries (controller, register, die, …) are updated at every cycles
Command Set Architecture • Multi-stage Operation – Stage are a e de defined ed by common co o operations ope at o s – CLE, ALE, TIR, TIN, TOR, TON, etc…
• Command C d Chains Ch i – Defines command sequences
Req.
Latency Variation Generators
Evaluation Methodology
SAMSUNG
MICRON
SK HYNIX
Validation (Throughput) 8 7 Bandwid dth (MB/s)
6 5 4
Worst-case based Typical-case Typical case based Variation-aware based MSIS
12%
1.6%
3.8%
2.1%
3 2 1
4.1% ~11.2% Deviation 40 30
Cache Two-plane(2x)2x Cache
16
Worst-case based Typical-case based Variation-aware based MSIS
Worst-case based
14 12
25
10
20 15 10 5 0
Legacy
Write Performance (Single Die) Bandwidth (MB/s)
Bandwidth (MB/s)
35
0
8 6
based (Typical-case) 83 1Typical-case 83.1~170.9% 170 9% (T i l ) Variation-aware based MSIS
7.3%~42%(Worst-case) 5.3~9.4% (NANDFlashSim)
4 2
Legacy
Cache
Two-plane (2x)
0
Legacy Cache Two-plane(2x)2x Cache Improvement from deviation Read Performance (Single Die) Performance (Dual Die) 48.7%~79.6% in typical-case model, 44.4 ~Write 53.5 % in worst-case model
Validation (Latency)
Write-intensive Workload (MSN Server, Financial OLTP)
Latencies of NANDFlashSim are almost completely overlapped with real product latencies (Hynix)
Read-intensive Workload (Webserch, User)
Performance of Multiple Planes • Performance of write are significantly enhanced as the number of plane increases – Cell activities (TIN) can be executed in parallel
• Data movement (TOR) is a dominant factor in determining bandwidth 4096 Bytes
8192 Bytes
2048 Bytes
18
45
16
40
14
35
12 10 8 6 4
Bandwidth (M MB/s)
Bandwidth ((MB/s)
2048 Bytes
4096 Bytes
8192 Bytes
30 25
Larger 20 transfer sizes couldn’t take 15 advantage of multiple planes well 10
2
5
0
0
ac y ane r-plane -plane t-plane ode Leg che M Two-pl Six Fou E i gh Ca
acy ane ane ode ane ane Leg che M Two-pl our-pl Six-pl ight-pl F E Ca
Write Performance (Single Die)
Read Performance (Single Die)
Performance of Multiple Dies • Si Similar il tto multi-plane, lti l write it performance f are improved by increasing the number of dies • Multiple dies architecture provides a little worse performance than multi multi-plane plane 16
12
8 6 4 2 0 2(DDP)
4(QDP)
Package Type
Two-plane
35
10
1(SDP)
Cache Mode
40
Bandwidth (M B MB/s)
Bandwidth (MB/s)
14
Legacy Legacy Cache Mode Two-plane Two-plane Cache Mode
8(ODP)
30 25 20 15 10 5 0 1(SDP)
2(DDP)
4(QDP)
Package Type
8(ODP)
Multi plane VS Multi-die Multi-plane Multi die • Under disk-friendly workload
35000 30000 25000 20000 15000 10000 5000 0
msnfs web fin usr prn
1
2 4 8 The Number of Plane
IOPS
IOPS
– The e pe performance o a ce of o interleaved-die te ea ed d e operation is 54.5% better than multi-plane operation p on average g – Interleaved-die operations have less restrictions for addressing
16
35000 30000 25000 20000 15000 10000 5000 0
msnfs web fin usr prn
1
2 4 8 The Number of Die
16
Breakdown of Cycles • While writes, most cycles are used for p at least NAND flash itself,, reads spend 50.5% of the total time doing. Memory cell activities
Bus activities
0.0
TOR
CLE
ALE
66.9% 62.8% 50.5% 6
5.0x10
1.0x10
7
7
1.5x10
Cycles (ns)
2.0x10
7
7
2.5x10
T Ca wo -p c 2 Le he M lane x C a ga cy ode (2x) che
T Ca wo ch -pla e Le n ga Mo e (2 de cy x)
TON
0.0
7.0%
TIN TIR ALE
6 8% 6.8%
TOR CLE
3.6% 3.5% 7
8
8
8
7.0x10 1.4x10 2.1x10 2.8x10 3.5x10
Cycles (ns)
8
Page Migration Test
9
7x10
Total Cy ycles (ns)
9
5x10
9
4x10
9
Legacy Cache Mode I t Internal l 2x 2x cache mode 2x internal
11
7x10
11
11
3x10
9
2x10
5x10
11
4x10
11
3x10
11
2x10
9
1x10
11
1x10
0
0 2 blocks
4 blocks
8 blocks
Legacy Cache mode Internal 2x 2x cache mode 2x internal
6x10
Energ gy (uJ)
9
6x10
16 blocks
32 blocks
Block Migration Size ALE
Energy consumption is depending on # of activate 2 blocks 4 blocks 8 blocks 16 blocks 32 blocks components Migration Block Size CLE
TIR
TOR
TIN
TON
BER
saving
2x internal 2x cache (2 ) Two-plane (2x) Internal Data Move Cache Mode Legacy g y
0
8
1x10
8
2x10
Cycles (ns)
3x10
8
4x10
8
Conclusion & Future Works • A research h vehicle hi l for f evaluating l ti parallelism ll li and architecture trend – Single instance
• Integrating it into GEM5 and Simics • Plan to apply pp y it with Green Flash and Xtensa of CoDEx
– Multiple instances
• We successfully built a multi-channel SSD framework with 1024 instances (~16384 dies, dies ~ 131072 planes)
• Open Source Project
– Static/shared library – Standalone simulation
Q&A • Download D l d
– http://www.cse.psu.edu/~mqj5086/nfs/
• Mailing list
–
[email protected]
• Thanks to – – – – – –
Dean Klein, Micron Technology, Inc. Seung-hwan Song, University of Minnesota Michael Kim, Corelinks Kurt Lee, Corelinks Leonard Ko Ko, Corelinks Yulwon Cho, Stanford University