NANDFlashSim : Intrinsic Latency Variation Aware NAND Flash Memory System Modeling

NANDFlashSim : Intrinsic Latency Variation Aware NAND Flash Memory System Modeling and Simulation at Microarchitecture Level Myoungsoo y g Jung g (MJ...

Author: Jerome Cox

8 downloads 0 Views 937KB Size

Report

Download PDF

Recommend Documents

NAND Flash Memory MLC

A Journaled, NAND-Flash Main-Memory System

Exploiting Latency Variation for Access Conflict Reduction of NAND Flash Memory

NAND Flash Memory MT29F2G08AACWP, MT29F4G08BACWP, MT29F8G08FACWP

Error Correction Codes in. NAND Flash Memory

S34ML08G1 NAND Flash Memory for Embedded

Technical Note Design and Use Considerations for NAND Flash Memory

F20 32Gb MLC NAND Flash Memory TSOP Legacy

32Gb NAND Flash H27UBG8T2A

FlashPower: A Detailed Power Model for NAND Flash Memory

A High Performance Co-design of 26 nm 64 Gb MLC NAND Flash Memory using the Dedicated NAND Flash Controller

8Gb NAND FLASH HY27UG088G5M HY27UG088GDM

MMC, and NAND flash controller

1Gb NAND FLASH HY27UF081G2A HY27UF161G2A

NAND Flash and Mobile LPDRAM

Understanding Intrinsic Characteristics and System Implications of Flash Memory based Solid State Drives

System Software for Flash Memory: A Survey

Toshiba SDIN4C2-8G 32 Gb MLC NAND Flash Toshiba 32 nm NAND Flash Process

M500DC 2.5-Inch SATA NAND Flash SSD

NAND Flash Module 69F96G24 69F192G24 FEATURES: Preliminary

Memory Stick ('Flash Drive')

Flash Center Memory Programmer

Keywords: Product code, NAND flash controller memory, correction error code,fpga; Reedsolomon code

NANDFlashSim : Intrinsic Latency Variation

Aware NAND Flash Memory System Modeling and Simulation at Microarchitecture Level Myoungsoo y g Jung g (MJ) ( ),

Ellis H. Wilson III, David Donofrio, John Shalf, Mahmut T. Kandemir

Agenda • • • •

Revisiting NAND flash technology Advance NAND flash operations NANDFlashSim Evaluation

Intrinsic Latency Variation • Fowler-Nordheim Tunneling – Making a ga an e electron ect o channel c a e – Voltage is applied over a certain threshold

Gate Floating Gate

Source

Channel

Drain

Distribution of Cellss

• IIncremental t l step t pulse l programming i (ISPP)

Intrinsic Latency Variation • Each step of ISPP needs different programming p g g duration (latency) ( y) • Latencies of the NAND flash memory fluctuate depending on the address of the pages in a block

NAND Flash Architecture • Employing cache and data registers • Multiple planes (memory array) • Multiple dies

Memory Array (Plane)

Density Trend • Flash technology – Each ac cell ce iss capable capab e to store sto e multiple u t p e bits b ts – manufacturing feature size is scaling down

• S So far, f d density it iis iincreasing i b by ttwo to t four times every 2 years

Flash Chip

70 60

Density (G Gb)

50 40 30 20 10 0 2002

2004

2006

2008

Years

2010

2012

Density Trend • Shrinking manufacturing feature size g be limited around 12 nanometer might • Multi-die stack technology – Flash Fl h packages k continue ti to t scale l up by b employing multiple dies and planes

Flash Chip MCP

70

2000 60

Density (G Gb) Density (G Gb)

50 1500

?

40

1000 30

How does performance behavior change?

20 500 10 0 0

MOSAID HLNAND (32Gb 16-DIE NAND STACK) 32Gb

1Gb

64Gb

2002 2004 2004 2006 200620082008 2002 2010 2010 2012 2012 2014

Years

Advance NAND Flash Operation

Legacy Operation • A An I/O operation ti splits lit iinto t severall operation ti stages t • Each stage should be appropriately handled by device drivers

write

check cmd data addr cmd

S t a t u s data

Cache Operation • C Cache h mode d operations i use internal i l registers in an attempt to hide performance overhead h d from f d data t movements t

data2 data1

Internal Data Move Mode • S Saving i space and d cycles l tto copy d data t • Source and destination page address should be located in the same die data

No data N d t movementt through NAND interface cmd dst addr src addr cmd

Multi plane Mode Operation Multi-plane • T Two different diff t pages can b be served d in i parallel ll l • Addresses should indicate same page offset in a block,, same die address and should have different plane addresses (plane addressing rule)

check cmd data2 data1 cmd addr2 addr1 cmd

Interleaved Die Mode Operation Interleaved-Die • providing idi a way, ttaking ki advantage d t off iinternal t l parallelism by interleaving NAND transactions • Scheduling g NAND transactions and bus arbitrations are critical dominant of memory system performance

cmd data1 addr1 cmd

cmd data2 addr2 cmd

Challenges • Performances P f are varied i db based d on:

– intrinsic latency variation characteristic – internal parallelism – advanced flash operations p types yp

• Performances are affected by

– how to deal with diverse advance flash operations – how to effectively schedule NAND transactions

Prior Simulation Works • Flash-based Fl h b d Solid S lid S State Di Disks k Si Simulation l i – Tightly coupled to specific flash firmware

• Unaware of latency variation of NAND flash – Latencyy approximation pp model with constants

• Course-grain NAND command handling – In In-order order execution L Latency Approximation n

SSD Simulator Flash h Firmwaree

I/O Subsystem

NANDFlashSim • Si Simulating l ti and d Modeling M d li NAND fl flash h rather than flash firmware or SSDs

– NANDFlashSim can be applied to diverse application like off-chip caches of a multi-core system and I/O subsystems of mobile systems – Multiple instances can be used for building SATA, PCI-e based SSDs Fllash Firmw ware

L Latency Approximation n

SSD Simulator Flash h Firmwaree

I/O Subsystem

NANDFlashSim • Detailed D t il d Ti Timing i M Model d l • Awareness of intrinsic latency variation

–d designed i d to b be performance f variation-aware i i and employs different page offsets in a physical b oc block

• Reconfigurable Microarchitecture

– Supports pp highly g y reconfigurable g architectures in terms of multiple dies and planes

• Fine-grain NAND flash command handling – 16 combinations of advance flash operation – Supporting out-of-order execution

High level View High-level

k*j Blocks

NA AND Flash I/O Bus

Logical U Unit

• C Command d sett architecture hit t and d individual i di id l state t t machine associated with it • Host H t and d NAND fl flash h clock l kd domain i are separate. t • All entries (controller, register, die, …) are updated at every cycles

Command Set Architecture • Multi-stage Operation – Stage are a e de defined ed by common co o operations ope at o s – CLE, ALE, TIR, TIN, TOR, TON, etc…

• Command C d Chains Ch i – Defines command sequences

Req.

Latency Variation Generators

Evaluation Methodology

SAMSUNG

MICRON

SK HYNIX

Validation (Throughput) 8 7 Bandwid dth (MB/s)

6 5 4

Worst-case based Typical-case Typical case based Variation-aware based MSIS

12%

1.6%

3.8%

2.1%

3 2 1

4.1% ~11.2% Deviation 40 30

Cache Two-plane(2x)2x Cache

16

Worst-case based Typical-case based Variation-aware based MSIS

Worst-case based

14 12

25

10

20 15 10 5 0

Legacy

Write Performance (Single Die) Bandwidth (MB/s)

Bandwidth (MB/s)

35

0

8 6

based (Typical-case) 83 1Typical-case 83.1~170.9% 170 9% (T i l ) Variation-aware based MSIS

7.3%~42%(Worst-case) 5.3~9.4% (NANDFlashSim)

4 2

Legacy

Cache

Two-plane (2x)

0

Legacy Cache Two-plane(2x)2x Cache Improvement from deviation Read Performance (Single Die) Performance (Dual Die) 48.7%~79.6% in typical-case model, 44.4 ~Write 53.5 % in worst-case model

Validation (Latency)

Write-intensive Workload (MSN Server, Financial OLTP)

Latencies of NANDFlashSim are almost completely overlapped with real product latencies (Hynix)

Read-intensive Workload (Webserch, User)

Performance of Multiple Planes • Performance of write are significantly enhanced as the number of plane increases – Cell activities (TIN) can be executed in parallel

• Data movement (TOR) is a dominant factor in determining bandwidth 4096 Bytes

8192 Bytes

2048 Bytes

18

45

16

40

14

35

12 10 8 6 4

Bandwidth (M MB/s)

Bandwidth ((MB/s)

2048 Bytes

4096 Bytes

8192 Bytes

30 25

Larger 20 transfer sizes couldn’t take 15 advantage of multiple planes well 10

2

5

0

0

ac y ane r-plane -plane t-plane ode Leg che M Two-pl Six Fou E i gh Ca

acy ane ane ode ane ane Leg che M Two-pl our-pl Six-pl ight-pl F E Ca

Write Performance (Single Die)

Read Performance (Single Die)

Performance of Multiple Dies • Si Similar il tto multi-plane, lti l write it performance f are improved by increasing the number of dies • Multiple dies architecture provides a little worse performance than multi multi-plane plane 16

12

8 6 4 2 0 2(DDP)

4(QDP)

Package Type

Two-plane

35

10

1(SDP)

Cache Mode

40

Bandwidth (M B MB/s)

Bandwidth (MB/s)

14

Legacy Legacy Cache Mode Two-plane Two-plane Cache Mode

8(ODP)

30 25 20 15 10 5 0 1(SDP)

2(DDP)

4(QDP)

Package Type

8(ODP)

Multi plane VS Multi-die Multi-plane Multi die • Under disk-friendly workload

35000 30000 25000 20000 15000 10000 5000 0

msnfs web fin usr prn

1

2 4 8 The Number of Plane

IOPS

IOPS

– The e pe performance o a ce of o interleaved-die te ea ed d e operation is 54.5% better than multi-plane operation p on average g – Interleaved-die operations have less restrictions for addressing

16

35000 30000 25000 20000 15000 10000 5000 0

msnfs web fin usr prn

1

2 4 8 The Number of Die

16

Breakdown of Cycles • While writes, most cycles are used for p at least NAND flash itself,, reads spend 50.5% of the total time doing. Memory cell activities

Bus activities

0.0

TOR

CLE

ALE

66.9% 62.8% 50.5% 6

5.0x10

1.0x10

7

7

1.5x10

Cycles (ns)

2.0x10

7

7

2.5x10

T Ca wo -p c 2 Le he M lane x C a ga cy ode (2x) che

T Ca wo ch -pla e Le n ga Mo e (2 de cy x)

TON

0.0

7.0%

TIN TIR ALE

6 8% 6.8%

TOR CLE

3.6% 3.5% 7

8

8

8

7.0x10 1.4x10 2.1x10 2.8x10 3.5x10

Cycles (ns)

8

Page Migration Test

9

7x10

Total Cy ycles (ns)

9

5x10

9

4x10

9

Legacy Cache Mode I t Internal l 2x 2x cache mode 2x internal

11

7x10

11

11

3x10

9

2x10

5x10

11

4x10

11

3x10

11

2x10

9

1x10

11

1x10

0

0 2 blocks

4 blocks

8 blocks

Legacy Cache mode Internal 2x 2x cache mode 2x internal

6x10

Energ gy (uJ)

9

6x10

16 blocks

32 blocks

Block Migration Size ALE

Energy consumption is depending on # of activate 2 blocks 4 blocks 8 blocks 16 blocks 32 blocks components Migration Block Size CLE

TIR

TOR

TIN

TON

BER

saving

2x internal 2x cache (2 ) Two-plane (2x) Internal Data Move Cache Mode Legacy g y

0

8

1x10

8

2x10

Cycles (ns)

3x10

8

4x10

8

Conclusion & Future Works • A research h vehicle hi l for f evaluating l ti parallelism ll li and architecture trend – Single instance

• Integrating it into GEM5 and Simics • Plan to apply pp y it with Green Flash and Xtensa of CoDEx

– Multiple instances

• We successfully built a multi-channel SSD framework with 1024 instances (~16384 dies, dies ~ 131072 planes)

• Open Source Project

– Static/shared library – Standalone simulation

Q&A • Download D l d

– http://www.cse.psu.edu/~mqj5086/nfs/

• Mailing list

– [email protected]

• Thanks to – – – – – –

Dean Klein, Micron Technology, Inc. Seung-hwan Song, University of Minnesota Michael Kim, Corelinks Kurt Lee, Corelinks Leonard Ko Ko, Corelinks Yulwon Cho, Stanford University