Memory Efficient Sanitization of a Deduplicated Storage System

Memory Efficient Sanitization of a Deduplicated Storage System Fabiano C. Botelho Philip Shilane Nitin Garg Windsor Hsu FAST 2013, San Jose, February...
Author: Martha Dean
2 downloads 0 Views 3MB Size
Memory Efficient Sanitization of a Deduplicated Storage System Fabiano C. Botelho Philip Shilane Nitin Garg Windsor Hsu

FAST 2013, San Jose, February 12-15

© Copyright 2013 EMC Corporation. All rights reserved.

1

What’s Sanitization All About?

Classified storage Network

© Copyright 2013 EMC Corporation. All rights reserved.

2

What’s Sanitization All About?

Classified storage Network

Unclassified storage

© Copyright 2013 EMC Corporation. All rights reserved.

3

What’s Sanitization All About?

Classified Message Incident Classified storage Network

Unclassified storage

© Copyright 2013 EMC Corporation. All rights reserved.

4

How Do We Define Sanitization?

A process to restore the storage system to a state as if the classified message incident had never occurred

© Copyright 2013 EMC Corporation. All rights reserved.

5

Threat Models 1.  Casual Attacks: –  – 

Access through regular file system interfaces NFS, CIFIS, etc

2.  Robust Keyboard Attack –  – 

Access through non-regular interfaces Reading blocks directly from disk, swap areas, or unallocated blocks

3.  Laboratory Attack –  –  – 

Access through exotic laboratory techniques Require specific disk format knowledge and specialized hardware Even after overwrites, the disk may retain magnetic values indicating a previous state

© Copyright 2013 EMC Corporation. All rights reserved.

6

NIST, DoD Guidelines For Sanitization 1.  Clearing Level – 

Single overwrite of affected areas is enough to protect against casual and keyboard attacks.

2.  Purging Level: – 

Devices must be either degaussed or destroyed to protect against laboratory attacks

© Copyright 2013 EMC Corporation. All rights reserved.

7

Why Not Crypto Sanitization 1.  Crypto sanitization: – 

Encrypt each file with a different key and throw away the key of the affected files.

2.  Deduplication is challenging because blocks are shared 3.  Key management is a huge hassle 4.  NIST and DoD guidelines explicitly say that encryption is not acceptable 5.  Sacrifices performance of normal FS operations – 

Read, write, replication

© Copyright 2013 EMC Corporation. All rights reserved.

8

Sanitization Of A System vs A Device 1.  Device: – 

either overwrite each sector or degauss it.

2.  In-place System: – 

following meta data references to the physical location within the storage system, overwriting the values one or more times, and erasing the meta data as well as other locations that have become unreferenced

3.  Deduplicated System: –  –  – 

Usually they are log-structured with large units of writes No in-place erasure of sub-units. Copy forward live data and then erase an earlier region

© Copyright 2013 EMC Corporation. All rights reserved.

9

Bulk Sanitization 1.  Sanitizing individual files is challenging 1.  Need to track blocks/chunks that only belong to the file 2.  What if the file has already been deleted?

2. 

We sanitize the entire system

© Copyright 2013 EMC Corporation. All rights reserved.

10

Sanitization Requirements 1.  All deleted data are erased 2.  All live data are available 3.  Sanitization is efficient 4.  The storage system is usable while sanitization runs.

© Copyright 2013 EMC Corporation. All rights reserved.

11

Deduplicated Storage NFS

CIFS

VTL

Files represented with fingerprints File 0

Afp

Bfp

File m

Afp

Bfp

Cfp … Cfp



Fingerprint to container index

Dfp

Efp

Afp 0

Yfp

Zfp

Bfp 0 Cfp 0

Containers holding data chunks

Dfp 0

Container 0

A

B

C

D

Efp 1

Container 1

E









Container n

© Copyright 2013 EMC Corporation. All rights reserved.







Y

Z

Yfp n Zfp n

12

Challenge With Chunk References Ÿ Huge Fingerprint Set In High-End Systems –  –  – 

Physical capacity: 560TiB Avg chunk size: 8 KiB, 4KiB after compression Number of chunks: 140 billions

© Copyright 2013 EMC Corporation. All rights reserved.

13

Memory Usage Of Each Approach

Memory Requirements

32TiB 1TiB

160 bits/chunk

Reference Counts Bloom Filter Perfect Hash Bit Vector

32GiB 28.76 bits/chunk

1GiB

2.54 bits/chunk

32MiB

1 bit/chunk

1MiB 220

© Copyright 2013 EMC Corporation. All rights reserved.

225

230 235 Number of Chunks

240

14

Perfect Hashing Vector (PHvec) 0

1

s1

s2



n-1

Collision free hash function for the fingerprints in S

PHF (m ≥ n)

1

0

0

1



Fingerprint set S

sn

1

PH vector

m-1

|PHvec| = |PHF| + |PH vector|

© Copyright 2013 EMC Corporation. All rights reserved.

15

Bucketizing The Huge Fingerprint Set

© Copyright 2013 EMC Corporation. All rights reserved.

16

Bucketizing The Huge Fingerprint Set Buckets are variable size 16 K fingerprints per bucket on average

© Copyright 2013 EMC Corporation. All rights reserved.

17

Sanitization Process – Merge (1) In Memory Fingerprint to container index

Disk

Files represented with fingerprints File 0

Afp

Bfp

Cfp …

Container 0

Container n+1

Efp

Fingerprint to container index

Containers Container 1 … Container n

Dfp

Afp 0

1 CP0

Afp

0

Bfp

0

Cfp

0

Dfp

0

Efp

1



Bfp 0

1

Cfp 0

Merge

Dfp 0 Efp 1 …

© Copyright 2013 EMC Corporation. All rights reserved.

18

Sanitization Process – Analysis (2) Container Range Covered by PHvec: {0, …, n}

Disk

Files represented with fingerprints File 0

Afp

Bfp

Cfp …

Container 0

Container n+1

Efp

Fingerprint to container index

Containers Container 1 … Container n

Dfp

Memory

Afp 0

1 CP0

Bfp 0

2

Perfect Hash Vector

1 2 3 4 5 6 …

2

Cfp 0 Dfp 0 Efp 1

#fps



© Copyright 2013 EMC Corporation. All rights reserved.

19

Sanitization Process – Enumeration (3) Container Range Covered by PHvec: {0, …, n}

Disk

Files represented with fingerprints File 0

Afp

Bfp

Cfp …

Container 0

Container n+1

Efp

Afp 0

1 CP0

Bfp 0

2

Perfect Hash Vector

1 2 3 4 5 6 …

2

Cfp 0 Dfp 0 Efp 1 …

© Copyright 2013 EMC Corporation. All rights reserved.

Memory

Fingerprint to container index

Containers Container 1 … Container n

Dfp

3

#fps

3

Mark live fingerprints

20

Sanitization Process – Copy (4) Container Range Covered by PHvec: {0, …, n}

Disk

Files represented with fingerprints File 0

Afp

Bfp

Cfp …

Container 0

Container n+1

4 Copy live data forward

© Copyright 2013 EMC Corporation. All rights reserved.

Memory

Efp

Perfect Hash Vector

Fingerprint to container index

Containers Container 1 … Container n

Dfp

3

Afp 0

1 CP0

Bfp 0

2

1 2 3 4 5 6 …

2

Cfp 0 Dfp 0 Efp 1 …

#fps

3

Mark live fingerprints

21

Sanitization Process – Zero (5) Container Range Covered by PHvec: {0, …, n}

Disk

Files represented with fingerprints File 0

Afp

Bfp

Cfp …

Container 0

Container n+1

4 Copy live data forward 5

Memory

Efp

Fingerprint to container index

Containers Container 1 … Container n

Dfp

3

Afp 0

1 CP0

Bfp 0 Cfp 0 Dfp 0 Efp 1 …

2

Perfect Hash Vector

1 2 3 4 5 6 …

#fps

2

3

Mark live fingerprints

Zero free blocks

© Copyright 2013 EMC Corporation. All rights reserved.

22

Issues to Support Read-Write Mode 1.  How do we handle resurrections? 2.  How do we update the PHvec structure for fingerprints that came in after CP0 has been taken but before the PHvec structure was constructed in the analysis phase?

© Copyright 2013 EMC Corporation. All rights reserved.

23

Handling Resurrections Notify Mechanism Container Range Covered by PHvec: {0, …, n} Memory Perfect Hash Vector

All incoming chunks

Dedupe Engine



1 2 3 4 5 6 …

#fps

Mark live fingerprints

© Copyright 2013 EMC Corporation. All rights reserved.

24

Handling Fingerprints Not Referenced by First Consistency Point (CP0)

Namespace – CP0

F1

© Copyright 2013 EMC Corporation. All rights reserved.

F2

F3



Fn

25

Handling Fingerprints Not Referenced by First Consistency Point (CP0) Namespace – CP1

F1

F2

F3



Fn

After creating PHvec and setting up notify mechanism Namespace – CP0

F1

© Copyright 2013 EMC Corporation. All rights reserved.

F2

F3



Fn

26

Enumerate CP1 Namespace – CP1

F1

F2

F3



Container Range Covered by PHvec: {0, …, n}

Fn

Memory Perfect Hash Vector

Fingerprint to container index Afp 0 Fingerprints from F1 and Fn

Lookup

Bfp 0 Cfp 0 Dfp 0 Efp 1 …



1 2 3 4 5 6 …

#fps

Mark live fingerprints

© Copyright 2013 EMC Corporation. All rights reserved.

27

Enumerate CP0 Namespace – CP1

F1

F2

F3



Fn

Container Range Covered by PHvec: {0, …, n} Memory Perfect Hash Vector

Fingerprints from F1 to Fn

Add fp to PHvec

1 2 3 4 5 6 …

#fps

Mark live fingerprints

© Copyright 2013 EMC Corporation. All rights reserved.

28

Experimental Setup and Synthetic Tool Ÿ  16-Core Intel Xeon, 2.53 GHz, 8 MiB cache Ÿ  6 RAID6 groups with 2TiB drives –  129.4 TiB of usable capacity

Ÿ  72 GiB of RAM Ÿ  Synthetic tool that mimics backup workload by leveraging our prior knowledge of such workloads

© Copyright 2013 EMC Corporation. All rights reserved.

29

No Deduplication Deleted Space vs Sanitization Time

Time (seconds)

120000

Merge Analysis Enumeration Copy Zero Sanitization

100000 80000 60000 40000 20000 0 0

10

20

30 40 50 Space (TiB)

60

70

80

731 MiB/second on average

© Copyright 2013 EMC Corporation. All rights reserved.

30

With Deduplication

Time (seconds)

Deleted Space vs Sanitization Time 40000 35000 30000 25000 20000 15000 10000 5000 0

Merge Analysis Enumeration Copy Zero Sanitization

0

20

40 60 80 100 120 Deleted logical bytes (TiB)

140

160

5.06 GiB/second on average

© Copyright 2013 EMC Corporation. All rights reserved.

31

No Deduplication vs With Deduplication Ÿ  Deduplication factor 7.38X Ÿ  Throughput (No dedup): 731 MiB/second Ÿ  Throughput (with dedup): 5.06 GiB/second Ÿ  Speedup Factor: 7.1X

© Copyright 2013 EMC Corporation. All rights reserved.

32

Concurrent Data Ingest and Sanitization with Deduplication Ÿ  Data ingest runs at 70% of its peak throughput Ÿ  Sanitization runs at least at 59% of its peak throughput Ÿ  Data Ingest: CPU intensive Ÿ  Sanitization: IO intensive

© Copyright 2013 EMC Corporation. All rights reserved.

33

Concurrent Data Ingest and Sanitization – No Deduplication Ÿ  Data ingest runs at 70% of its peak throughput Ÿ  Sanitization runs at least at 45% of its peak throughput Ÿ  Data Ingest: CPU and IO intensive Ÿ  Sanitization: IO intensive

© Copyright 2013 EMC Corporation. All rights reserved.

34

Conclusions Ÿ  Sanitization is a critical feature for security matters Ÿ  We have proposed a process to carry out bulksanitization of a storage system rather than devices Ÿ  We use perfect hashing to minimize memory and I/O requirements Ÿ  Nearly linear performance as storage grows Ÿ  Effective throughput multiplying with the deduplication factor. Ÿ  Sanitization without zero phase can be used as a process to reclaim dead space

© Copyright 2013 EMC Corporation. All rights reserved.

35

Q&A

© Copyright 2013 EMC Corporation. All rights reserved.

36