Distributed File Systems

Distributed file systems 

What are they good for?   



Why are they difficult?   

 

Sharing information with others Accessing information remotely Remote backup Consistency Transparency Replication

Techniques Examples

NFS (Network File System)  Andrew File System (later DFS)  Ceph Winter 2006 CMPS 128, UC Santa Cruz 

2

1

What does a file system do?  

A file system stores data and allows users to retrieve it Must support several features    

 

Naming: relate file names to file IDs Storage management: relate file IDs to storage Access control Handle low-level storage access

Goal: allow users to associate names with chunks of data Pretty straightforward on a single computer…

Winter 2006

CMPS 128, UC Santa Cruz

3

Why distributed file systems? 

Users want to access files on multiple systems  



Users want to share their data  



Network of computers Access from anywhere in the world Use on multiple computers One copy of a file used by multiple people

Users want better performance  

Winter 2006

Single file server can be a bottleneck Distribute the storage across servers CMPS 128, UC Santa Cruz

4

2

Distributed file system requirements: transparency 

Access transparency  



Location transparency  



Client doesn’t know where the file is stored Relocating files doesn’t force lots of changes

Performance transparency 



Client doesn’t know if the file is local Methods are the same for local & remote

Acceptable performance regardless of load & location

Scaling transparency  

Service can be expanded without loss of performance May be relatively dynamic or static

Winter 2006

CMPS 128, UC Santa Cruz

5

More distributed file system requirements 

Concurrent file updates  

  

Provide one-copy file update semantics File replication: keep multiple copies of files Fault tolerance  



Replicate servers Guard against client and server failures

Security  



Files can be modified by multiple clients at the same time Requires good concurrency control

Must be at least as good as a single-computer system Can’t make assumptions about user of client system

Efficiency: not much good if it’s slow!

Winter 2006

CMPS 128, UC Santa Cruz

6

3

File service architecture  

Three basic components Flat file service 





Directory service  



Uses unique file IDs for all requests  Typically 32–128 bit integers  Unique for all files in the distributed file system Reads/writes data used UFIDs to identify files Translates human-readable names into unique file IDs Often is a client of the flat file service

Client module  

Winter 2006

Interacts with the other two Provides files to user programs CMPS 128, UC Santa Cruz

7

Flat file service interface 







Six basic functions for flat files  All use unique file IDs to identify files Most operations repeatable  Create() isn’t repeatable Can be done by a stateless server  Restart by replaying calls to which there was no reply Differs from Unix  No “current position” ptr  No open or close: done by directory service

Winter 2006

Operation

Function

Read (fid, k, n) -> data

Reads n bytes of data starting at position k from file fid

Write (fid, k, data)

Writes data to file fid starting at position k

Create() -> fid

Creates a file and returns its unique ID

Delete (fid)

Deletes a file from the flat file store

GetAttr (fid) -> attr

Gets the attributes of file fid

SetAttr (fid, attr)

Sets the attributes of file fid

CMPS 128, UC Santa Cruz

8

4

Access control 

Need to ensure that user is allowed to perform the operations 



May need to check on every flat file access!

Two possible approaches 

Set up capability when a file is opened (using directory service) 



Client provides identity with each access 



Client caches and provides with each access Server checks permissions on each request

Both are stateless 

Second is more common (NFS, AFS)

Winter 2006

CMPS 128, UC Santa Cruz

9

Directory service interface 

Lookup (name) -> fid  



AddName (name, fid)   



Adds a name to the directory service Error if the name already exists Note: no check for validity of fid!

UnName (name)  



Looks up name and returns the unique file ID Error if the name isn’t found

Removes a name from the directory service Error if name not found

GetNames (pattern) -> nameList

Gets a list of names that matches a pattern Example: directory listing Winter 2006 CMPS 128, UC Santa Cruz  

10

5

Hierarchical file systems 

Unix uses a hierarchical name scheme 



This can be done 





Directories contain other directories, files By the client  Translate directory into unique ID  Look up next component of name  Use flat file service to read directories as needed By the server  Full name passed to server  Server does the lookup

Flexibility in attaching new directory trees 

Client knows which trees are attached at which names

Winter 2006

CMPS 128, UC Santa Cruz

11

File groups 

Collect files into file groups   



Identifier often contains group ID as well as file ID  

 

File group is located on a single server Server can hold multiple groups Groups can be moved between servers Group ID is mapped to machine address by naming system This allows transparent moves of file groups

File group IDs are unique through the distributed FS Replication and other functions can be done on a file group basis 

Winter 2006

This is the foundation for UCSC’s data distribution algorithms RUSH & CRUSH CMPS 128, UC Santa Cruz

12

6

Distributing data across servers   

Single server isn’t scalable Use multiple servers to store the data Approaches to scaling Distribute file groups: parallelism between file groups, but not within them Distribute files: more parallelism, but individual file bandwidth is limited Decluster files: spread a file across multiple servers  Use naming service to map different offsets in the file to the appropriate server









This can be used to gain redundancy as well

Winter 2006

CMPS 128, UC Santa Cruz

13

Sun NFS (Network File System) 

Client & server modules Exist for Unix (including MacOS X) & Windows



 

Uses standard Unix file system for underlying storage Implemented as a user-level server with kernel help Client App

App

Server

Virtual file system Unix file system

Winter 2006

NFS server

App

NFS client

Virtual file system

NFS protocol

CMPS 128, UC Santa Cruz

NFS kernel code

Unix file system

14

7

Virtual file system 

Kernel has a “switch” that allows the use of multiple file systems in the same way   



Local file systems Remote file systems Things that resemble file systems (/proc)

NFS uses the switch in two ways  

On client, to send requests to NFS server On server, to allow the use of different local file systems

Winter 2006

CMPS 128, UC Santa Cruz

15

NFS details 

NFS file system “mounted” at a location in the local file system  

 

VFS has to track which file systems are mounted where VFS uses v-nodes for open files  



Local file system traverses the directory tree If it hits a “mount point”, it switches to the NFS client

Indicate local or remote Contain file-specific information

NFS identifiers (file handles) contain   

Winter 2006

Local file identifier (inode number) File system ID Inode generation number (needed to deal with inode reuse) CMPS 128, UC Santa Cruz

16

8

NFS client 

Integrated into the kernel   



Transfers blocks to and from server  



No difference between local and remote files Single instance of the FS for all applications Authentication done in kernel Shares client block cache with local file systems Potential cache consistency issue: what if file is in two client caches at the same time?

Coordinates lookup of names with server

Winter 2006

CMPS 128, UC Santa Cruz

17

Access control & authentication 



Server is stateless: must check identity on every call! RPC includes user authentication on each request 

User ID and group ID (easy to fake)  

  Winter 2006

Kernel can include any UID and GID Assumes that kernel is secure (not a good assumption!)

DES encryption (better) Kerberos (best) CMPS 128, UC Santa Cruz

18

9

NFS server  

Integrates lookup and flat file service Lookup is done iteratively (by VFS!) 

Lookup one pathname component at a time  





May be local or remote: VFS decides Caching can make this faster…

Lookup translates names into file handles

Flat file service uses file handles  

Winter 2006

Read and write include explicit offsets Directory operations include file handle of directory (must be looked up first) CMPS 128, UC Santa Cruz

19

Mounting & automounting 

NFS file systems are mounted at points in the directory tree   



Specified in a configuration file Helpful to have the same mount point for all clients, but not required… Server makes directory trees available to clients based on server config file

Automounter: automatically mount file systems as needed   

Winter 2006

Specify an empty directory Requests for subdirectories are handled as mount requests to the server Supports read-only replication if multiple servers listed CMPS 128, UC Santa Cruz

20

10

Client caching in NFS 

Reads  

Caches files and directories Uses timestamps to ensure freshness   





If age is sufficiently short, reuse without asking server Files: typically 3–30 seconds Directories: typically 30–60 seconds

Shorter time -> closer to one-copy semantics

Writes 

Blocks cached locally and flushed periodically 

 Winter 2006

Similar to behavior for local file systems

Writes are asynchronous CMPS 128, UC Santa Cruz

21

Server caching in NFS 

NFS server can do readahead (prefetching) 



Two options for writes 





Keeps results in server’s cache Keep data in cache and write immediately (writethrough)  Safer, especially since write was delayed at client already Keep data in cache and do delayed write (on commit): default for most NFS clients  Faster: no need to wait for disk write  Commit typically done on file close

Performance is critical for servers because they may serve many clients

Winter 2006

CMPS 128, UC Santa Cruz

22

11

NFS & Kerberos 

Default: trust the kernel to do proper authentication 



One option: full Kerberos ticket on every request   



Very secure Potentially slow Required a lot of changes!

Hybrid approach   



Easy to circumvent!

Mount server gets full Kerberos authentication when home directory is mounted Further requests are trusted This is more secure, assuming each computer has at most one user

NFSv4 has more complete authentication & security

Winter 2006

CMPS 128, UC Santa Cruz

23

Andrew File System (AFS)  

Transparent access to remote shared files More scalable than NFS  

 

Large numbers of users Wide-area access

Serves whole files rather than blocks Caches whole files (or chunks) on local disk    

Winter 2006

Reduces traffic Allows clients to cache lots of read-only or read-mostly files (like binaries) Clients can also cache files that are read-write but only accessed locally (user’s personal files) Not good for databases! CMPS 128, UC Santa Cruz

24

12

AFS operation 

User issues open call 

  

Client fetches file and stores on local disk if a copy isn’t already there

File is opened locally and handle returned to user Subsequent read & write requests go to local copy User issues close call 

 

Winter 2006

If file has been updated, contents are sent back to the server Server stores data and updates timestamps Client keeps a cached copy of the file in case it’s needed later CMPS 128, UC Santa Cruz

25

AFS implementation 

Two AFS code modules  





Non-local operations on client pushed up to Venus process and sent to server No local files except for /tmp and similar 



Server: Vice (user-level process) Client: Venus (user-level process)

AFS is good for distributing binaries that might normally be stored locally

User directories in shared space

Winter 2006

CMPS 128, UC Santa Cruz

26

13

AFS: Venus client module 

Venus caches files using local disk partition   



Venus manages translation from names to 96-bit file identifiers 



Manages disk space Handles callbacks (more on that in a bit) Returns (local) file IDs to user processes

Iterative, similar to what NFS does

Venus flushes changed local files to server on close

Winter 2006

CMPS 128, UC Santa Cruz

27

AFS cache consistency 

AFS uses “open to close” consistency  



Changes to files seen by other clients only when file is closed (local processes see changes immediately) Changes written back on close

Uses callbacks to ensure consistency  

  Winter 2006

Vice hands out “callback promises” with each file Vice calls back the clients caching each file when the file changes  Client then can invalidate their copy  Client must get a new copy if it’s needed again Client must check validity after crash (may have missed callback!) Callbacks expire after a fixed time CMPS 128, UC Santa Cruz

28

14

Updates in AFS   

Local processes see all changes immediately Remote clients see changes on close If multiple clients write, latest close “sticks” 



Even if changes are to different parts of the file!

For the common case, this is OK    

Winter 2006

Single user on a single system: updates done to a single (local) copy Single user serially using multiple systems: files follow her to a new system Commonly used binaries: updated infrequently and only on one system Issue: update semantics are different for local and remote files! CMPS 128, UC Santa Cruz

29

AFS summary 

Caches at a coarser granularity than AFS  



Whole file caching on disk (also large chunks)  



Updates on close, not block writes Better scalability than NFS Better for wide area access Better for thousands of users

Requires callbacks for consistency  

Winter 2006

Server notifies clients that files are no longer valid Reduces traffic but requires more server state CMPS 128, UC Santa Cruz

30

15

Peta-scale Data Storage: Ceph Goals  









Performance 2 PB data  2000–5000 hard drives 100 GB/sec aggregate throughput  1-5000 hard drives pumping out data as fast as they can Billions of files  1-10,000+ files/directory Files ranging from bytes to terabytes  ~1000 of times larger than the current largest 50 µsec metadata access times

Winter 2006

 





Usage High-performance direct access from up to 10,000 clients, to  Different files in different directories  Different files in the same directory  The same file Mid-performance local access by visualization workstations  QoS requirements Wide-area general-purpose access

CMPS 128, UC Santa Cruz

31

Peta-scale Data Storage Challenges 









Massive scale of everything  Huge files, directories, data transfers, etc. Managing the data  Coordinating the activity of 1000 disks Managing the metadata  Massive parallelism required Workload  Handling both scientific and general purpose workloads Scalability  Must be able to grow (or shrink) dynamically

Winter 2006



Reliability 



Security 



Authentication, encryption, etc.

Performance 



1000 hard drives ⇒ frequent failures

Complex system ⇒ many possible bottlenecks

Human interface 

CMPS 128, UC Santa Cruz

Finding anything among all of that data 32

16

First Key Idea: Object-based Storage Traditional Storage

Object-based Storage Applications

Applications

System Call Interface

System Call Interface

Operating System

File System Client Component

Operating System

File System Object Interface File System Storage Component

Logical Block Interface Block I/O Manager

Block I/O Manager

Hard Drive

Winter 2006

CMPS 128, UC Santa Cruz

Objectbased Storage Device (OSD) 33

2nd Key Idea: Manage Data and Metadata Separately Applications System Call Interface

Operating System

File System Client Component Data requests

Metadata requests

Metadata Interface

Metadata Server (MDS)

Winter 2006

Metadata storage and/or system management

File System Metadata Manager

Object Interface File System Data Manager Block I/O Manager

Metadata storage

CMPS 128, UC Santa Cruz

Objectbased Storage Device (OSD)

34

17

Peta-scale Object-based Storage System Architecture Cluster of Metadata Servers (1–10)

Objectbased Storage Devices (2000–5000)

Clients (10,000+)

Winter 2006

CMPS 128, UC Santa Cruz

35

Challenges and Solutions Client SW 1. Interface 2. Cache Mgmt 3. Workload

MDS Cluster SW 1. Lazy Hybrid 2. Dynamic Subtree Partitioning Other 1. Reliability 2. Data Distribution 3. Quality of Service 4. Network 5. Security 6. Locking/Leasing 7. Performance 8. Scalability 9. Simulation 10. Analysis

Winter 2006

CMPS 128, UC Santa Cruz

OSD SW 1. OBFS 2. EBOFS

36

18

Ceph features we’ll discuss 1. MDS SW: Dynamic Subtree Partitioning 2. Data Distribution: RUSH & CRUSH 3. Reliability: FaRMs 4. Security

Winter 2006

CMPS 128, UC Santa Cruz

37

Why is Metadata Management Hard? File data storage is trivially parallelizable  File I/O occurs independent of other files/objects  Scalability of OSD array limited only by network architecture  Metadata semantics are more complex  Hierarchical directory structure defines object interdependency  Metadata location & POSIX permissions depend on parent directories  MDS must maintain file system consistency  Heavy workload  Metadata is small, but there are lots of objects and lots of transactions  30-80% of all file system operations involve metadata  Variety of usage patterns: scientific and general purpose  Good metadata performance is critical to overall system performance Winter 2006 CMPS 128, UC Santa Cruz  Hot spots  Popular files and directories are common and concurrent accesses can overwhelm many schemes 

38

19

Metadata Management Goals POSIX-compliant API  Standard UNIX-style file and directory semantics  High Performance  Efficient metadata access, directory operations, access control, and high degree of parallelism  Scalability  Performance scales with the number of metadata servers  Uniform namespace  Load balancing among metadata servers under various conditions Winter 2006 CMPS 128, UC Santa Cruz  Easy addition and removal of metadata servers 

39

Metadata Partitioning Alternatives Coarse partition

Static Subtree Partitioning Portions of file hierarchy are statically assigned to MDS nodes (ala NFS, AFS, etc.)





Fine partition

Directory Hashing Hash on directory portion of path only

File Hashing Metadata distributed based on hash of full path (or inode number)

Coarse distribution (static subtree partitioning)  Preserves locality  Leads to imbalanced distribution as file system, workload change Finer distribution (hash-based partitioning)  Destroys locality (ignores hierarchical structure)  Probabilistically less vulnerable to “hot spots,” workload change

Winter 2006

CMPS 128, UC Santa Cruz

40

20

Dynamic Subtree Partitioning Coarse partition Static Subtree





Fine partition Dynamic Subtree Partitioning

Directory Hashing

File Hashing

Distribute subtrees of directory hierarchy  Somewhat coarse distribution of variably-sized subtrees  Preserve locality within entire branches of the directory hierarchy Must intelligently manage distribution based on workload demands  Keep MDS cluster load balanced  Actively repartition as workload and file system change instead of relying on a (fixed) probabilistic distribution

Winter 2006

CMPS 128, UC Santa Cruz

41

A Sample Partition Root

MDS 0 MDS 1 MDS 2 MDS 3 MDS 4









The system dynamically and intelligently redelegates responsibility for arbitrary subtrees based on usage patterns Coarser, subtree-based partition means higher efficiency  Fewer prefixes need to be replicated for path traversal Granularity of distribution can range from large subtrees to individual directories Directories or files can be selectively replicated based on workload demands

Winter 2006

CMPS 128, UC Santa Cruz

42

21

DSP Design Details Metadata Storage



 

Metadata updates logged to local storage Later committed to shared metadata storage

Consistency



 

Primary copy replication Single MDS acts as authority for each metadata object

Leverages locality of reference



 

Subtrees collocated Inodes collocated with directory entries

Workload partitioning





Subtrees dynamically redelegated

Traffic management



 Winter 2006

Dynamic replication CMPS as needed 128, UC Santa Cruz

43

Distributing Directory Contents Coarse partition Static Subtree 

Directory Hashing

Fully Dynamic Partitioning

File Hashing

If an individual directory is large or busy, its contents are selectively distributed across the cluster  



Fine partition

For read-dominated workloads, replication is sufficient For workloads involving creates, hashing distributes updates  Directory entry/inode distribution based on hash of parent directory id and file name

Whether a directory is replicated or hashed is dynamically determined /*

/var/*

/home/foo/* Winter 2006

/home/*

So busy!

/usr/*

/home/bar/*

/tmp/*

/home/baz/*

CMPS 128, UC Santa Cruz

44

22

Simulation: Metadata Scaling Simulate partitioning strategies to evaluate relative scalability  Identical MDS nodes  Scale system size (# nodes, disks, clients, and FS size)  Simulation parameters  20,000 cached records/MDS  1000 clients/MDS  80,000 files on 1 disk per MDS  Collective cache always ~20% of file system metadata  Workload  Static: collection of user home directories  Semi-localized requests  Maintaining localized directories is Winter 2006 CMPS 128, UC Santa Cruz good  Hashing scales poorly Per-MDS throughput (ops/sec)



MDS Cluster size 45

Scalable Data Placement   



Storage systems start relatively small, and grow over time Clients must be able to quickly locate any object in the system Storage must remain balanced over time  Using new disks only for new data creates hot spots  New disks may be larger or faster Goals  Copy as little data as possible to keep the system in balance  Keep data lookup as decentralized as possible: no centralized lookup!

Winter 2006

CMPS 128, UC Santa Cruz

46

23

RUSH: Replication Under Scalable Hashing  

RUSH is a family of algorithms that map an object ID to a set of storage devices on which the object is stored When a new cluster of disks is added  



RUSH properties include    



Data is relocated to rebalance the system A (computational) step is added to the lookup process Support for replication: replicas stored on different devices Balanced data distribution: objects and replicas distributed across disks according to weights Decentralized lookup: algorithmic Fast: typically a few microseconds to do the lookup

Variants trade off speed and flexibility in system organization and replica identification

Winter 2006

CMPS 128, UC Santa Cruz

47

RUSH: The Basic Idea 



Disks are added in sub-clusters  Rebalance the system by randomly selecting the needed volume of objects and moving them to the new disks Mapping function is recursive:  Divide the system into the most-recently added cluster, and the rest  Decide whether the object is in the most-recently added cluster  If not, recursively run the function on the system without the latest cluster  If it is, compute location within the cluster

Winter 2006

CMPS 128, UC Santa Cruz

48

24

Three RUSH Variants 





RUSHP  Optimal reorganization when servers added  Distribution within a cluster uses a prime number-based heuristic  Time complexity is linear in the number of clusters added  Fast: 0.25 micro-second per object-replica per cluster added RUSHR  Added Design goal: allow more flexibility in reorganization  Uses “card dealing” to map replicas to servers in a sub-cluster  Locates all of the replicas of an object at once  Can’t distinguish the replicas RUSHT  Added Design goal: to bound the lookup time to O(log n)  Reorganizations slightly sub-optimal but flexible  Every cluster must have as many disks as objects have replicas

Winter 2006

CMPS 128, UC Santa Cruz

49

RUSH Performance



Repeat 100 times (using the same object IDs each time)  



Look up 100,000 IDs, with 4 replicas each Add a new cluster (perhaps with higher weight) and rebalance

Lookup time per replica increases with the number of reorganizations  

Winter 2006

Linear when the weight of each sub-cluster is the same Logarithmic when the weight of each sub-cluster added to the system has higher weight (multiplicative) CMPS 128, UC Santa Cruz

50

25

Distribution Accuracy in RUSH 

Calculate Normalized Mean Square Error 



Measure of accuracy of distribution (distance from expected value)

Distribution “error” is bounded by  

About 2% if weight remains even Less if weight increases with each new cluster

Winter 2006

CMPS 128, UC Santa Cruz

51

Reorganization Effectiveness 







40,000 object replicas total 6 clusters, 4 servers each RUSHR does the best overall RUSHT not optimal, but close

Winter 2006

CMPS 128, UC Santa Cruz

52

26

CRUSH: Controlled RUSH 

RUSH is nice, but can be unwieldy for large installations  



Multiple racks, shelves within a rack, even data centers Want to ensure that replicas are stored in separate racks (for example) to ensure they’re in different failure domains

The solution: CRUSH  

Use RUSH to distribute data within domains Enforce flexible constraints on replica distribution to enhance reliability by distributed replicas across failure domains

Winter 2006

CMPS 128, UC Santa Cruz

53

CRUSH Design 

Cluster map composed of devices and buckets 

Buckets are of one of four types and contain devices or other buckets  





Uniform, List, Tree, Straw Bucket types have tradeoffs in performance, data reorganization efficiency

Devices are leaves in the hierarchy

Hierarchy reflects underlying storage organization in terms of physical placement or infrastructure 

Shelves of disks, cabinets of shelves, rows of cabinets, etc. Bucket type

Weighted contents

Straw bucket

Tree bucket

List bucket Uniform buckets

··· Winter 2006

···

···

··· ···

···

···

···

CMPS 128, UC Santa Cruz

···

··· 54

27

Placing Data with CRUSH 





Working variable – vector  Input for each command  Output stored back into variable Simple command set  take(a) – set working variable to bucket a  select(n,t) – choose n distinct items of type t beneath specified point(s) in overall storage hierarchy: resulting n items are placed back in the working variable  emit – move working value into final result vector row3 row4 Collisions result in row1 backtracking until a suitable place can be found choose(1,row) row2 cab22

choose(3,cabinet)

cab21

cab23

choose(1,disk)

··· Winter 2006

···

···

··· ···

···

CMPS 128, UC Santa Cruz

···

cab24

··· ···

···

···

··· 55

CRUSH Performance

Combined movement in a 3-level hierarchy

Winter 2006

CMPS 128, UC Santa Cruz

56

28

Reliability Challenges in Large Storage Systems   

Storage technology advances quickly Reliability has improved slowly The scale of large storage systems creates new reliability problems:  Huge data capacity  Disk failures will be common  Long recovery times  Capacity increasing faster than bandwidth  RAID alone cannot guarantee enough reliability System MTTDL

Winter 2006

data capacity

2-way mirroring

RAID 5

100 TB

100 yrs

60 yrs

2 PB

5 yrs

3 yrs

1 EB

90 hrs

55 hrs

CMPS 128, UC Santa Cruz

57

Redundancy Mechanisms 

Basic redundancy mechanisms   



Mirroring (simple, high storage overhead) Parity (low storage overhead, small write problem) Erasure coding (low storage overhead, high failure tolerance, complex data update)

Goals for additional redundancy 

Fast, distributed data recovery



Redundancy only when needed Reduced chance of data loss







Reduce window of vulnerability

Fast Recovery Mechanisms (FaRMs)  

Winter 2006

Fast Mirroring Copy – quickly recover lost objects Lazy Parity Backup – mirror, then lazily store object parity CMPS 128, UC Santa Cruz

58

29

FaRM: Fast Mirror Copy









OSD 0  

OSD 2

OSD 3

OSD 4

Quickly make copies of objects on failed disks Fast, distributed recovery  



OSD 1



Use RUSH to distribute data No need to rebuild original disk

Narrows window of vulnerability

Winter 2006

CMPS 128, UC Santa Cruz

59

System Reliability Probability of data loss in a 2 PB system in 6 years

-- 1/2 : 2-way mirroring, 1/3 : 3-way mirroring; -- 2/3, 4/5: RAID 5; -- 4/6, 8/10: m out of n erasure coding.

FaRMs greatly reduce probability of data loss under various redundancy configurations Winter 2006

CMPS 128, UC Santa Cruz

60

30

Performance During Recovery Bandwidth usage is high for a short period of time Many disks involved in the recovery Recovery occurs quickly System remains in balance (but slower) during recovery

Parameters: --2-way mirroring --Data capacity: 2PB --Group size: 25G --Recovery rate: 28MB/sec --Detection latency: 100 sec. --Snapshot period: 24 hours Winter 2006

(a) Disk Bandwidth Usage of A Target Disk

(a) Aggregate Disk Bandwidth Usage for Data Recovery CMPS 128, UC Santa Cruz

61

Disk Failure Detection Latency

Systems with smaller groups need to detect failures more quickly

The ratio of detection latency to recovery time is the key factor!

Parameters: --2-way mirroring --Data capacity: 2PB --Recovery rate: 28MB/sec --Bathtub vintage: HIM Winter 2006

CMPS 128, UC Santa Cruz

62

31

Securing Petabyte Scale Storage  Client opens file. U , open ( path, mode) Attacker may load malware on client. Must limit damage.

M  Server returns handle. H  Client writes data block. write(oid , bno, data )

U

 Disk confirms success. Okay

Strong

Di safeguards

Attacker may add, change or destroy messages. Must prevent damage. Winter 2006

CMPS 128, UC Santa Cruz

63

Desired Characteristics 

Integrity and Confidentiality. 



Low overhead.  



Limit use of public key cryptography. Piggyback security on existing messages.

No per-client state on the disk. 



Other mechanisms provide reliability/availability.

Caching tolerable.

No per-object state on the client.  

Winter 2006

Per-file state acceptable. Per-disk state tolerable. CMPS 128, UC Santa Cruz

64

32

Handling Vulnerable Clients  Client opens file. U , open ( path, mode)

M  Server returns handle and capability. C H  Client writes data block. write(oid , bno, data ) C

U    

Integrity, Confidentiality. No new messages.  Disk confirms success. Symmetric keys only. Okay Disks share the blue key.

Winter 2006

Di

CMPS 128, UC Santa Cruz

65

Handling Vulnerable Disks and Clients  Client opens file. U,open(path,mode)

 Server returns handle and capability. H,C

!

M

 Client requests ticket for disk.  Additional messages, but  ticket valid for many files. U , Di

U

 Public key verification, but !  capability for many clients.  Server returns ticket.  Symmetric key per disk. T  Client reads data block. C,read(oid,bno) T

 Disk returns data. data

! Winter 2006

CMPS 128, UC Santa Cruz

Di 66

!

33

Public Key Usage Isn’t Too Expensive!

Basic protocol

Public key protocol

Both add about 80% overhead (block encryption) Winter 2006

CMPS 128, UC Santa Cruz

67

Peta-Scale Object-based Storage Summary 

Efficient high-performance object storage is relatively straightforward  

OBFS and EBOFS appear to work similarly well Both work much better than ext2/3 and often better than XFS (with much less code)



Metadata management is somewhat more difficult



There are good techniques for data distribution



 



Decentralize lookup Accommodate different failure domains!

Reliability is critical  



Dynamic Subtree Partitioning looks promising

Standard techniques begin to fail at this scale Fast Recovery Mechanisms appear to provide acceptable reliability

Security can be made scalable

Winter 2006

CMPS 128, UC Santa Cruz

68

34