NANDFlashSim : Intrinsic Latency Variation Aware NAND Flash Memory System Modeling

NANDFlashSim : Intrinsic Latency Variation Aware NAND Flash Memory System Modeling and Simulation at Microarchitecture Level Myoungsoo y g Jung g (MJ...
Author: Jerome Cox
2 downloads 0 Views 937KB Size
NANDFlashSim : Intrinsic Latency Variation

Aware NAND Flash Memory System Modeling and Simulation at Microarchitecture Level Myoungsoo y g Jung g (MJ) ( ),

Ellis H. Wilson III, David Donofrio, John Shalf, Mahmut T. Kandemir

Agenda • • • •

Revisiting NAND flash technology Advance NAND flash operations NANDFlashSim Evaluation

Intrinsic Latency Variation • Fowler-Nordheim Tunneling – Making a ga an e electron ect o channel c a e – Voltage is applied over a certain threshold

Gate Floating Gate

Source

Channel

Drain

Distribution of Cellss

• IIncremental t l step t pulse l programming i (ISPP)

Intrinsic Latency Variation • Each step of ISPP needs different programming p g g duration (latency) ( y) • Latencies of the NAND flash memory fluctuate depending on the address of the pages in a block

NAND Flash Architecture • Employing cache and data registers • Multiple planes (memory array) • Multiple dies

Memory Array (Plane)

Density Trend • Flash technology – Each ac cell ce iss capable capab e to store sto e multiple u t p e bits b ts – manufacturing feature size is scaling down

• S So far, f d density it iis iincreasing i b by ttwo to t four times every 2 years

Flash Chip

70 60

Density (G Gb)

50 40 30 20 10 0 2002

2004

2006

2008

Years

2010

2012

Density Trend • Shrinking manufacturing feature size g be limited around 12 nanometer might • Multi-die stack technology – Flash Fl h packages k continue ti to t scale l up by b employing multiple dies and planes

Flash Chip MCP

70

2000 60

Density (G Gb) Density (G Gb)

50 1500

?

40

1000 30

How does performance behavior change?

20 500 10 0 0

MOSAID HLNAND (32Gb 16-DIE NAND STACK) 32Gb

1Gb

64Gb

2002 2004 2004 2006 200620082008 2002 2010 2010 2012 2012 2014

Years

Advance NAND Flash Operation

Legacy Operation • A An I/O operation ti splits lit iinto t severall operation ti stages t • Each stage should be appropriately handled by device drivers

write

check cmd data addr cmd

S t a t u s data

Cache Operation • C Cache h mode d operations i use internal i l registers in an attempt to hide performance overhead h d from f d data t movements t

data2 data1

Internal Data Move Mode • S Saving i space and d cycles l tto copy d data t • Source and destination page address should be located in the same die data

No data N d t movementt through NAND interface cmd dst addr src addr cmd

Multi plane Mode Operation Multi-plane • T Two different diff t pages can b be served d in i parallel ll l • Addresses should indicate same page offset in a block,, same die address and should have different plane addresses (plane addressing rule)

check cmd data2 data1 cmd addr2 addr1 cmd

Interleaved Die Mode Operation Interleaved-Die • providing idi a way, ttaking ki advantage d t off iinternal t l parallelism by interleaving NAND transactions • Scheduling g NAND transactions and bus arbitrations are critical dominant of memory system performance

cmd data1 addr1 cmd

cmd data2 addr2 cmd

Challenges • Performances P f are varied i db based d on:

– intrinsic latency variation characteristic – internal parallelism – advanced flash operations p types yp

• Performances are affected by

– how to deal with diverse advance flash operations – how to effectively schedule NAND transactions

Prior Simulation Works • Flash-based Fl h b d Solid S lid S State Di Disks k Si Simulation l i – Tightly coupled to specific flash firmware

• Unaware of latency variation of NAND flash – Latencyy approximation pp model with constants

• Course-grain NAND command handling – In In-order order execution L Latency Approximation n

SSD Simulator Flash h Firmwaree

I/O Subsystem

NANDFlashSim • Si Simulating l ti and d Modeling M d li NAND fl flash h rather than flash firmware or SSDs

– NANDFlashSim can be applied to diverse application like off-chip caches of a multi-core system and I/O subsystems of mobile systems – Multiple instances can be used for building SATA, PCI-e based SSDs Fllash Firmw ware

L Latency Approximation n

SSD Simulator Flash h Firmwaree

I/O Subsystem

NANDFlashSim • Detailed D t il d Ti Timing i M Model d l • Awareness of intrinsic latency variation

–d designed i d to b be performance f variation-aware i i and employs different page offsets in a physical b oc block

• Reconfigurable Microarchitecture

– Supports pp highly g y reconfigurable g architectures in terms of multiple dies and planes

• Fine-grain NAND flash command handling – 16 combinations of advance flash operation – Supporting out-of-order execution

High level View High-level

k*j Blocks

NA AND Flash I/O Bus

Logical U Unit

• C Command d sett architecture hit t and d individual i di id l state t t machine associated with it • Host H t and d NAND fl flash h clock l kd domain i are separate. t • All entries (controller, register, die, …) are updated at every cycles

Command Set Architecture • Multi-stage Operation – Stage are a e de defined ed by common co o operations ope at o s – CLE, ALE, TIR, TIN, TOR, TON, etc…

• Command C d Chains Ch i – Defines command sequences

Req.

Latency Variation Generators

Evaluation Methodology

SAMSUNG

MICRON

SK HYNIX

Validation (Throughput) 8 7 Bandwid dth (MB/s)

6 5 4

Worst-case based Typical-case Typical case based Variation-aware based MSIS

12%

1.6%

3.8%

2.1%

3 2 1

4.1% ~11.2% Deviation 40 30

Cache Two-plane(2x)2x Cache

16

Worst-case based Typical-case based Variation-aware based MSIS

Worst-case based

14 12

25

10

20 15 10 5 0

Legacy

Write Performance (Single Die) Bandwidth (MB/s)

Bandwidth (MB/s)

35

0

8 6

based (Typical-case) 83 1Typical-case 83.1~170.9% 170 9% (T i l ) Variation-aware based MSIS

7.3%~42%(Worst-case) 5.3~9.4% (NANDFlashSim)

4 2

Legacy

Cache

Two-plane (2x)

0

Legacy Cache Two-plane(2x)2x Cache Improvement from deviation Read Performance (Single Die) Performance (Dual Die) 48.7%~79.6% in typical-case model, 44.4 ~Write 53.5 % in worst-case model

Validation (Latency)

Write-intensive Workload (MSN Server, Financial OLTP)

Latencies of NANDFlashSim are almost completely overlapped with real product latencies (Hynix)

Read-intensive Workload (Webserch, User)

Performance of Multiple Planes • Performance of write are significantly enhanced as the number of plane increases – Cell activities (TIN) can be executed in parallel

• Data movement (TOR) is a dominant factor in determining bandwidth 4096 Bytes

8192 Bytes

2048 Bytes

18

45

16

40

14

35

12 10 8 6 4

Bandwidth (M MB/s)

Bandwidth ((MB/s)

2048 Bytes

4096 Bytes

8192 Bytes

30 25

Larger 20 transfer sizes couldn’t take 15 advantage of multiple planes well 10

2

5

0

0

ac y ane r-plane -plane t-plane ode Leg che M Two-pl Six Fou E i gh Ca

acy ane ane ode ane ane Leg che M Two-pl our-pl Six-pl ight-pl F E Ca

Write Performance (Single Die)

Read Performance (Single Die)

Performance of Multiple Dies • Si Similar il tto multi-plane, lti l write it performance f are improved by increasing the number of dies • Multiple dies architecture provides a little worse performance than multi multi-plane plane 16

12

8 6 4 2 0 2(DDP)

4(QDP)

Package Type

Two-plane

35

10

1(SDP)

Cache Mode

40

Bandwidth (M B MB/s)

Bandwidth (MB/s)

14

Legacy Legacy Cache Mode Two-plane Two-plane Cache Mode

8(ODP)

30 25 20 15 10 5 0 1(SDP)

2(DDP)

4(QDP)

Package Type

8(ODP)

Multi plane VS Multi-die Multi-plane Multi die • Under disk-friendly workload

35000 30000 25000 20000 15000 10000 5000 0

msnfs web fin usr prn

1

2 4 8 The Number of Plane

IOPS

IOPS

– The e pe performance o a ce of o interleaved-die te ea ed d e operation is 54.5% better than multi-plane operation p on average g – Interleaved-die operations have less restrictions for addressing

16

35000 30000 25000 20000 15000 10000 5000 0

msnfs web fin usr prn

1

2 4 8 The Number of Die

16

Breakdown of Cycles • While writes, most cycles are used for p at least NAND flash itself,, reads spend 50.5% of the total time doing. Memory cell activities

Bus activities

0.0

TOR

CLE

ALE

66.9% 62.8% 50.5% 6

5.0x10

1.0x10

7

7

1.5x10

Cycles (ns)

2.0x10

7

7

2.5x10

T Ca wo -p c 2 Le he M lane x C a ga cy ode (2x) che

T Ca wo ch -pla e Le n ga Mo e (2 de cy x)

TON

0.0

7.0%

TIN TIR ALE

6 8% 6.8%

TOR CLE

3.6% 3.5% 7

8

8

8

7.0x10 1.4x10 2.1x10 2.8x10 3.5x10

Cycles (ns)

8

Page Migration Test

9

7x10

Total Cy ycles (ns)

9

5x10

9

4x10

9

Legacy Cache Mode I t Internal l 2x 2x cache mode 2x internal

11

7x10

11

11

3x10

9

2x10

5x10

11

4x10

11

3x10

11

2x10

9

1x10

11

1x10

0

0 2 blocks

4 blocks

8 blocks

Legacy Cache mode Internal 2x 2x cache mode 2x internal

6x10

Energ gy (uJ)

9

6x10

16 blocks

32 blocks

Block Migration Size ALE

Energy consumption is depending on # of activate 2 blocks 4 blocks 8 blocks 16 blocks 32 blocks components Migration Block Size CLE

TIR

TOR

TIN

TON

BER

saving

2x internal 2x cache (2 ) Two-plane (2x) Internal Data Move Cache Mode Legacy g y

0

8

1x10

8

2x10

Cycles (ns)

3x10

8

4x10

8

Conclusion & Future Works • A research h vehicle hi l for f evaluating l ti parallelism ll li and architecture trend – Single instance

• Integrating it into GEM5 and Simics • Plan to apply pp y it with Green Flash and Xtensa of CoDEx

– Multiple instances

• We successfully built a multi-channel SSD framework with 1024 instances (~16384 dies, dies ~ 131072 planes)

• Open Source Project

– Static/shared library – Standalone simulation

Q&A • Download D l d

– http://www.cse.psu.edu/~mqj5086/nfs/

• Mailing list

[email protected]

• Thanks to – – – – – –

Dean Klein, Micron Technology, Inc. Seung-hwan Song, University of Minnesota Michael Kim, Corelinks Kurt Lee, Corelinks Leonard Ko Ko, Corelinks Yulwon Cho, Stanford University

Suggest Documents