Distributed File Systems
Distributed file systems
What are they good for?
Why are they difficult?
Sharing information with others Accessing information remotely Remote backup Consistency Transparency Replication
Techniques Examples
NFS (Network File System) Andrew File System (later DFS) Ceph Winter 2006 CMPS 128, UC Santa Cruz
2
1
What does a file system do?
A file system stores data and allows users to retrieve it Must support several features
Naming: relate file names to file IDs Storage management: relate file IDs to storage Access control Handle low-level storage access
Goal: allow users to associate names with chunks of data Pretty straightforward on a single computer…
Winter 2006
CMPS 128, UC Santa Cruz
3
Why distributed file systems?
Users want to access files on multiple systems
Users want to share their data
Network of computers Access from anywhere in the world Use on multiple computers One copy of a file used by multiple people
Users want better performance
Winter 2006
Single file server can be a bottleneck Distribute the storage across servers CMPS 128, UC Santa Cruz
4
2
Distributed file system requirements: transparency
Access transparency
Location transparency
Client doesn’t know where the file is stored Relocating files doesn’t force lots of changes
Performance transparency
Client doesn’t know if the file is local Methods are the same for local & remote
Acceptable performance regardless of load & location
Scaling transparency
Service can be expanded without loss of performance May be relatively dynamic or static
Winter 2006
CMPS 128, UC Santa Cruz
5
More distributed file system requirements
Concurrent file updates
Provide one-copy file update semantics File replication: keep multiple copies of files Fault tolerance
Replicate servers Guard against client and server failures
Security
Files can be modified by multiple clients at the same time Requires good concurrency control
Must be at least as good as a single-computer system Can’t make assumptions about user of client system
Efficiency: not much good if it’s slow!
Winter 2006
CMPS 128, UC Santa Cruz
6
3
File service architecture
Three basic components Flat file service
Directory service
Uses unique file IDs for all requests Typically 32–128 bit integers Unique for all files in the distributed file system Reads/writes data used UFIDs to identify files Translates human-readable names into unique file IDs Often is a client of the flat file service
Client module
Winter 2006
Interacts with the other two Provides files to user programs CMPS 128, UC Santa Cruz
7
Flat file service interface
Six basic functions for flat files All use unique file IDs to identify files Most operations repeatable Create() isn’t repeatable Can be done by a stateless server Restart by replaying calls to which there was no reply Differs from Unix No “current position” ptr No open or close: done by directory service
Winter 2006
Operation
Function
Read (fid, k, n) -> data
Reads n bytes of data starting at position k from file fid
Write (fid, k, data)
Writes data to file fid starting at position k
Create() -> fid
Creates a file and returns its unique ID
Delete (fid)
Deletes a file from the flat file store
GetAttr (fid) -> attr
Gets the attributes of file fid
SetAttr (fid, attr)
Sets the attributes of file fid
CMPS 128, UC Santa Cruz
8
4
Access control
Need to ensure that user is allowed to perform the operations
May need to check on every flat file access!
Two possible approaches
Set up capability when a file is opened (using directory service)
Client provides identity with each access
Client caches and provides with each access Server checks permissions on each request
Both are stateless
Second is more common (NFS, AFS)
Winter 2006
CMPS 128, UC Santa Cruz
9
Directory service interface
Lookup (name) -> fid
AddName (name, fid)
Adds a name to the directory service Error if the name already exists Note: no check for validity of fid!
UnName (name)
Looks up name and returns the unique file ID Error if the name isn’t found
Removes a name from the directory service Error if name not found
GetNames (pattern) -> nameList
Gets a list of names that matches a pattern Example: directory listing Winter 2006 CMPS 128, UC Santa Cruz
10
5
Hierarchical file systems
Unix uses a hierarchical name scheme
This can be done
Directories contain other directories, files By the client Translate directory into unique ID Look up next component of name Use flat file service to read directories as needed By the server Full name passed to server Server does the lookup
Flexibility in attaching new directory trees
Client knows which trees are attached at which names
Winter 2006
CMPS 128, UC Santa Cruz
11
File groups
Collect files into file groups
Identifier often contains group ID as well as file ID
File group is located on a single server Server can hold multiple groups Groups can be moved between servers Group ID is mapped to machine address by naming system This allows transparent moves of file groups
File group IDs are unique through the distributed FS Replication and other functions can be done on a file group basis
Winter 2006
This is the foundation for UCSC’s data distribution algorithms RUSH & CRUSH CMPS 128, UC Santa Cruz
12
6
Distributing data across servers
Single server isn’t scalable Use multiple servers to store the data Approaches to scaling Distribute file groups: parallelism between file groups, but not within them Distribute files: more parallelism, but individual file bandwidth is limited Decluster files: spread a file across multiple servers Use naming service to map different offsets in the file to the appropriate server
This can be used to gain redundancy as well
Winter 2006
CMPS 128, UC Santa Cruz
13
Sun NFS (Network File System)
Client & server modules Exist for Unix (including MacOS X) & Windows
Uses standard Unix file system for underlying storage Implemented as a user-level server with kernel help Client App
App
Server
Virtual file system Unix file system
Winter 2006
NFS server
App
NFS client
Virtual file system
NFS protocol
CMPS 128, UC Santa Cruz
NFS kernel code
Unix file system
14
7
Virtual file system
Kernel has a “switch” that allows the use of multiple file systems in the same way
Local file systems Remote file systems Things that resemble file systems (/proc)
NFS uses the switch in two ways
On client, to send requests to NFS server On server, to allow the use of different local file systems
Winter 2006
CMPS 128, UC Santa Cruz
15
NFS details
NFS file system “mounted” at a location in the local file system
VFS has to track which file systems are mounted where VFS uses v-nodes for open files
Local file system traverses the directory tree If it hits a “mount point”, it switches to the NFS client
Indicate local or remote Contain file-specific information
NFS identifiers (file handles) contain
Winter 2006
Local file identifier (inode number) File system ID Inode generation number (needed to deal with inode reuse) CMPS 128, UC Santa Cruz
16
8
NFS client
Integrated into the kernel
Transfers blocks to and from server
No difference between local and remote files Single instance of the FS for all applications Authentication done in kernel Shares client block cache with local file systems Potential cache consistency issue: what if file is in two client caches at the same time?
Coordinates lookup of names with server
Winter 2006
CMPS 128, UC Santa Cruz
17
Access control & authentication
Server is stateless: must check identity on every call! RPC includes user authentication on each request
User ID and group ID (easy to fake)
Winter 2006
Kernel can include any UID and GID Assumes that kernel is secure (not a good assumption!)
DES encryption (better) Kerberos (best) CMPS 128, UC Santa Cruz
18
9
NFS server
Integrates lookup and flat file service Lookup is done iteratively (by VFS!)
Lookup one pathname component at a time
May be local or remote: VFS decides Caching can make this faster…
Lookup translates names into file handles
Flat file service uses file handles
Winter 2006
Read and write include explicit offsets Directory operations include file handle of directory (must be looked up first) CMPS 128, UC Santa Cruz
19
Mounting & automounting
NFS file systems are mounted at points in the directory tree
Specified in a configuration file Helpful to have the same mount point for all clients, but not required… Server makes directory trees available to clients based on server config file
Automounter: automatically mount file systems as needed
Winter 2006
Specify an empty directory Requests for subdirectories are handled as mount requests to the server Supports read-only replication if multiple servers listed CMPS 128, UC Santa Cruz
20
10
Client caching in NFS
Reads
Caches files and directories Uses timestamps to ensure freshness
If age is sufficiently short, reuse without asking server Files: typically 3–30 seconds Directories: typically 30–60 seconds
Shorter time -> closer to one-copy semantics
Writes
Blocks cached locally and flushed periodically
Winter 2006
Similar to behavior for local file systems
Writes are asynchronous CMPS 128, UC Santa Cruz
21
Server caching in NFS
NFS server can do readahead (prefetching)
Two options for writes
Keeps results in server’s cache Keep data in cache and write immediately (writethrough) Safer, especially since write was delayed at client already Keep data in cache and do delayed write (on commit): default for most NFS clients Faster: no need to wait for disk write Commit typically done on file close
Performance is critical for servers because they may serve many clients
Winter 2006
CMPS 128, UC Santa Cruz
22
11
NFS & Kerberos
Default: trust the kernel to do proper authentication
One option: full Kerberos ticket on every request
Very secure Potentially slow Required a lot of changes!
Hybrid approach
Easy to circumvent!
Mount server gets full Kerberos authentication when home directory is mounted Further requests are trusted This is more secure, assuming each computer has at most one user
NFSv4 has more complete authentication & security
Winter 2006
CMPS 128, UC Santa Cruz
23
Andrew File System (AFS)
Transparent access to remote shared files More scalable than NFS
Large numbers of users Wide-area access
Serves whole files rather than blocks Caches whole files (or chunks) on local disk
Winter 2006
Reduces traffic Allows clients to cache lots of read-only or read-mostly files (like binaries) Clients can also cache files that are read-write but only accessed locally (user’s personal files) Not good for databases! CMPS 128, UC Santa Cruz
24
12
AFS operation
User issues open call
Client fetches file and stores on local disk if a copy isn’t already there
File is opened locally and handle returned to user Subsequent read & write requests go to local copy User issues close call
Winter 2006
If file has been updated, contents are sent back to the server Server stores data and updates timestamps Client keeps a cached copy of the file in case it’s needed later CMPS 128, UC Santa Cruz
25
AFS implementation
Two AFS code modules
Non-local operations on client pushed up to Venus process and sent to server No local files except for /tmp and similar
Server: Vice (user-level process) Client: Venus (user-level process)
AFS is good for distributing binaries that might normally be stored locally
User directories in shared space
Winter 2006
CMPS 128, UC Santa Cruz
26
13
AFS: Venus client module
Venus caches files using local disk partition
Venus manages translation from names to 96-bit file identifiers
Manages disk space Handles callbacks (more on that in a bit) Returns (local) file IDs to user processes
Iterative, similar to what NFS does
Venus flushes changed local files to server on close
Winter 2006
CMPS 128, UC Santa Cruz
27
AFS cache consistency
AFS uses “open to close” consistency
Changes to files seen by other clients only when file is closed (local processes see changes immediately) Changes written back on close
Uses callbacks to ensure consistency
Winter 2006
Vice hands out “callback promises” with each file Vice calls back the clients caching each file when the file changes Client then can invalidate their copy Client must get a new copy if it’s needed again Client must check validity after crash (may have missed callback!) Callbacks expire after a fixed time CMPS 128, UC Santa Cruz
28
14
Updates in AFS
Local processes see all changes immediately Remote clients see changes on close If multiple clients write, latest close “sticks”
Even if changes are to different parts of the file!
For the common case, this is OK
Winter 2006
Single user on a single system: updates done to a single (local) copy Single user serially using multiple systems: files follow her to a new system Commonly used binaries: updated infrequently and only on one system Issue: update semantics are different for local and remote files! CMPS 128, UC Santa Cruz
29
AFS summary
Caches at a coarser granularity than AFS
Whole file caching on disk (also large chunks)
Updates on close, not block writes Better scalability than NFS Better for wide area access Better for thousands of users
Requires callbacks for consistency
Winter 2006
Server notifies clients that files are no longer valid Reduces traffic but requires more server state CMPS 128, UC Santa Cruz
30
15
Peta-scale Data Storage: Ceph Goals
Performance 2 PB data 2000–5000 hard drives 100 GB/sec aggregate throughput 1-5000 hard drives pumping out data as fast as they can Billions of files 1-10,000+ files/directory Files ranging from bytes to terabytes ~1000 of times larger than the current largest 50 µsec metadata access times
Winter 2006
Usage High-performance direct access from up to 10,000 clients, to Different files in different directories Different files in the same directory The same file Mid-performance local access by visualization workstations QoS requirements Wide-area general-purpose access
CMPS 128, UC Santa Cruz
31
Peta-scale Data Storage Challenges
Massive scale of everything Huge files, directories, data transfers, etc. Managing the data Coordinating the activity of 1000 disks Managing the metadata Massive parallelism required Workload Handling both scientific and general purpose workloads Scalability Must be able to grow (or shrink) dynamically
Winter 2006
Reliability
Security
Authentication, encryption, etc.
Performance
1000 hard drives ⇒ frequent failures
Complex system ⇒ many possible bottlenecks
Human interface
CMPS 128, UC Santa Cruz
Finding anything among all of that data 32
16
First Key Idea: Object-based Storage Traditional Storage
Object-based Storage Applications
Applications
System Call Interface
System Call Interface
Operating System
File System Client Component
Operating System
File System Object Interface File System Storage Component
Logical Block Interface Block I/O Manager
Block I/O Manager
Hard Drive
Winter 2006
CMPS 128, UC Santa Cruz
Objectbased Storage Device (OSD) 33
2nd Key Idea: Manage Data and Metadata Separately Applications System Call Interface
Operating System
File System Client Component Data requests
Metadata requests
Metadata Interface
Metadata Server (MDS)
Winter 2006
Metadata storage and/or system management
File System Metadata Manager
Object Interface File System Data Manager Block I/O Manager
Metadata storage
CMPS 128, UC Santa Cruz
Objectbased Storage Device (OSD)
34
17
Peta-scale Object-based Storage System Architecture Cluster of Metadata Servers (1–10)
Objectbased Storage Devices (2000–5000)
Clients (10,000+)
Winter 2006
CMPS 128, UC Santa Cruz
35
Challenges and Solutions Client SW 1. Interface 2. Cache Mgmt 3. Workload
MDS Cluster SW 1. Lazy Hybrid 2. Dynamic Subtree Partitioning Other 1. Reliability 2. Data Distribution 3. Quality of Service 4. Network 5. Security 6. Locking/Leasing 7. Performance 8. Scalability 9. Simulation 10. Analysis
Winter 2006
CMPS 128, UC Santa Cruz
OSD SW 1. OBFS 2. EBOFS
36
18
Ceph features we’ll discuss 1. MDS SW: Dynamic Subtree Partitioning 2. Data Distribution: RUSH & CRUSH 3. Reliability: FaRMs 4. Security
Winter 2006
CMPS 128, UC Santa Cruz
37
Why is Metadata Management Hard? File data storage is trivially parallelizable File I/O occurs independent of other files/objects Scalability of OSD array limited only by network architecture Metadata semantics are more complex Hierarchical directory structure defines object interdependency Metadata location & POSIX permissions depend on parent directories MDS must maintain file system consistency Heavy workload Metadata is small, but there are lots of objects and lots of transactions 30-80% of all file system operations involve metadata Variety of usage patterns: scientific and general purpose Good metadata performance is critical to overall system performance Winter 2006 CMPS 128, UC Santa Cruz Hot spots Popular files and directories are common and concurrent accesses can overwhelm many schemes
38
19
Metadata Management Goals POSIX-compliant API Standard UNIX-style file and directory semantics High Performance Efficient metadata access, directory operations, access control, and high degree of parallelism Scalability Performance scales with the number of metadata servers Uniform namespace Load balancing among metadata servers under various conditions Winter 2006 CMPS 128, UC Santa Cruz Easy addition and removal of metadata servers
39
Metadata Partitioning Alternatives Coarse partition
Static Subtree Partitioning Portions of file hierarchy are statically assigned to MDS nodes (ala NFS, AFS, etc.)
Fine partition
Directory Hashing Hash on directory portion of path only
File Hashing Metadata distributed based on hash of full path (or inode number)
Coarse distribution (static subtree partitioning) Preserves locality Leads to imbalanced distribution as file system, workload change Finer distribution (hash-based partitioning) Destroys locality (ignores hierarchical structure) Probabilistically less vulnerable to “hot spots,” workload change
Winter 2006
CMPS 128, UC Santa Cruz
40
20
Dynamic Subtree Partitioning Coarse partition Static Subtree
Fine partition Dynamic Subtree Partitioning
Directory Hashing
File Hashing
Distribute subtrees of directory hierarchy Somewhat coarse distribution of variably-sized subtrees Preserve locality within entire branches of the directory hierarchy Must intelligently manage distribution based on workload demands Keep MDS cluster load balanced Actively repartition as workload and file system change instead of relying on a (fixed) probabilistic distribution
Winter 2006
CMPS 128, UC Santa Cruz
41
A Sample Partition Root
MDS 0 MDS 1 MDS 2 MDS 3 MDS 4
The system dynamically and intelligently redelegates responsibility for arbitrary subtrees based on usage patterns Coarser, subtree-based partition means higher efficiency Fewer prefixes need to be replicated for path traversal Granularity of distribution can range from large subtrees to individual directories Directories or files can be selectively replicated based on workload demands
Winter 2006
CMPS 128, UC Santa Cruz
42
21
DSP Design Details Metadata Storage
Metadata updates logged to local storage Later committed to shared metadata storage
Consistency
Primary copy replication Single MDS acts as authority for each metadata object
Leverages locality of reference
Subtrees collocated Inodes collocated with directory entries
Workload partitioning
Subtrees dynamically redelegated
Traffic management
Winter 2006
Dynamic replication CMPS as needed 128, UC Santa Cruz
43
Distributing Directory Contents Coarse partition Static Subtree
Directory Hashing
Fully Dynamic Partitioning
File Hashing
If an individual directory is large or busy, its contents are selectively distributed across the cluster
Fine partition
For read-dominated workloads, replication is sufficient For workloads involving creates, hashing distributes updates Directory entry/inode distribution based on hash of parent directory id and file name
Whether a directory is replicated or hashed is dynamically determined /*
/var/*
/home/foo/* Winter 2006
/home/*
So busy!
/usr/*
/home/bar/*
/tmp/*
/home/baz/*
CMPS 128, UC Santa Cruz
44
22
Simulation: Metadata Scaling Simulate partitioning strategies to evaluate relative scalability Identical MDS nodes Scale system size (# nodes, disks, clients, and FS size) Simulation parameters 20,000 cached records/MDS 1000 clients/MDS 80,000 files on 1 disk per MDS Collective cache always ~20% of file system metadata Workload Static: collection of user home directories Semi-localized requests Maintaining localized directories is Winter 2006 CMPS 128, UC Santa Cruz good Hashing scales poorly Per-MDS throughput (ops/sec)
MDS Cluster size 45
Scalable Data Placement
Storage systems start relatively small, and grow over time Clients must be able to quickly locate any object in the system Storage must remain balanced over time Using new disks only for new data creates hot spots New disks may be larger or faster Goals Copy as little data as possible to keep the system in balance Keep data lookup as decentralized as possible: no centralized lookup!
Winter 2006
CMPS 128, UC Santa Cruz
46
23
RUSH: Replication Under Scalable Hashing
RUSH is a family of algorithms that map an object ID to a set of storage devices on which the object is stored When a new cluster of disks is added
RUSH properties include
Data is relocated to rebalance the system A (computational) step is added to the lookup process Support for replication: replicas stored on different devices Balanced data distribution: objects and replicas distributed across disks according to weights Decentralized lookup: algorithmic Fast: typically a few microseconds to do the lookup
Variants trade off speed and flexibility in system organization and replica identification
Winter 2006
CMPS 128, UC Santa Cruz
47
RUSH: The Basic Idea
Disks are added in sub-clusters Rebalance the system by randomly selecting the needed volume of objects and moving them to the new disks Mapping function is recursive: Divide the system into the most-recently added cluster, and the rest Decide whether the object is in the most-recently added cluster If not, recursively run the function on the system without the latest cluster If it is, compute location within the cluster
Winter 2006
CMPS 128, UC Santa Cruz
48
24
Three RUSH Variants
RUSHP Optimal reorganization when servers added Distribution within a cluster uses a prime number-based heuristic Time complexity is linear in the number of clusters added Fast: 0.25 micro-second per object-replica per cluster added RUSHR Added Design goal: allow more flexibility in reorganization Uses “card dealing” to map replicas to servers in a sub-cluster Locates all of the replicas of an object at once Can’t distinguish the replicas RUSHT Added Design goal: to bound the lookup time to O(log n) Reorganizations slightly sub-optimal but flexible Every cluster must have as many disks as objects have replicas
Winter 2006
CMPS 128, UC Santa Cruz
49
RUSH Performance
Repeat 100 times (using the same object IDs each time)
Look up 100,000 IDs, with 4 replicas each Add a new cluster (perhaps with higher weight) and rebalance
Lookup time per replica increases with the number of reorganizations
Winter 2006
Linear when the weight of each sub-cluster is the same Logarithmic when the weight of each sub-cluster added to the system has higher weight (multiplicative) CMPS 128, UC Santa Cruz
50
25
Distribution Accuracy in RUSH
Calculate Normalized Mean Square Error
Measure of accuracy of distribution (distance from expected value)
Distribution “error” is bounded by
About 2% if weight remains even Less if weight increases with each new cluster
Winter 2006
CMPS 128, UC Santa Cruz
51
Reorganization Effectiveness
40,000 object replicas total 6 clusters, 4 servers each RUSHR does the best overall RUSHT not optimal, but close
Winter 2006
CMPS 128, UC Santa Cruz
52
26
CRUSH: Controlled RUSH
RUSH is nice, but can be unwieldy for large installations
Multiple racks, shelves within a rack, even data centers Want to ensure that replicas are stored in separate racks (for example) to ensure they’re in different failure domains
The solution: CRUSH
Use RUSH to distribute data within domains Enforce flexible constraints on replica distribution to enhance reliability by distributed replicas across failure domains
Winter 2006
CMPS 128, UC Santa Cruz
53
CRUSH Design
Cluster map composed of devices and buckets
Buckets are of one of four types and contain devices or other buckets
Uniform, List, Tree, Straw Bucket types have tradeoffs in performance, data reorganization efficiency
Devices are leaves in the hierarchy
Hierarchy reflects underlying storage organization in terms of physical placement or infrastructure
Shelves of disks, cabinets of shelves, rows of cabinets, etc. Bucket type
Weighted contents
Straw bucket
Tree bucket
List bucket Uniform buckets
··· Winter 2006
···
···
··· ···
···
···
···
CMPS 128, UC Santa Cruz
···
··· 54
27
Placing Data with CRUSH
Working variable – vector Input for each command Output stored back into variable Simple command set take(a) – set working variable to bucket a select(n,t) – choose n distinct items of type t beneath specified point(s) in overall storage hierarchy: resulting n items are placed back in the working variable emit – move working value into final result vector row3 row4 Collisions result in row1 backtracking until a suitable place can be found choose(1,row) row2 cab22
choose(3,cabinet)
cab21
cab23
choose(1,disk)
··· Winter 2006
···
···
··· ···
···
CMPS 128, UC Santa Cruz
···
cab24
··· ···
···
···
··· 55
CRUSH Performance
Combined movement in a 3-level hierarchy
Winter 2006
CMPS 128, UC Santa Cruz
56
28
Reliability Challenges in Large Storage Systems
Storage technology advances quickly Reliability has improved slowly The scale of large storage systems creates new reliability problems: Huge data capacity Disk failures will be common Long recovery times Capacity increasing faster than bandwidth RAID alone cannot guarantee enough reliability System MTTDL
Winter 2006
data capacity
2-way mirroring
RAID 5
100 TB
100 yrs
60 yrs
2 PB
5 yrs
3 yrs
1 EB
90 hrs
55 hrs
CMPS 128, UC Santa Cruz
57
Redundancy Mechanisms
Basic redundancy mechanisms
Mirroring (simple, high storage overhead) Parity (low storage overhead, small write problem) Erasure coding (low storage overhead, high failure tolerance, complex data update)
Goals for additional redundancy
Fast, distributed data recovery
Redundancy only when needed Reduced chance of data loss
Reduce window of vulnerability
Fast Recovery Mechanisms (FaRMs)
Winter 2006
Fast Mirroring Copy – quickly recover lost objects Lazy Parity Backup – mirror, then lazily store object parity CMPS 128, UC Santa Cruz
58
29
FaRM: Fast Mirror Copy
OSD 0
OSD 2
OSD 3
OSD 4
Quickly make copies of objects on failed disks Fast, distributed recovery
OSD 1
Use RUSH to distribute data No need to rebuild original disk
Narrows window of vulnerability
Winter 2006
CMPS 128, UC Santa Cruz
59
System Reliability Probability of data loss in a 2 PB system in 6 years
-- 1/2 : 2-way mirroring, 1/3 : 3-way mirroring; -- 2/3, 4/5: RAID 5; -- 4/6, 8/10: m out of n erasure coding.
FaRMs greatly reduce probability of data loss under various redundancy configurations Winter 2006
CMPS 128, UC Santa Cruz
60
30
Performance During Recovery Bandwidth usage is high for a short period of time Many disks involved in the recovery Recovery occurs quickly System remains in balance (but slower) during recovery
Parameters: --2-way mirroring --Data capacity: 2PB --Group size: 25G --Recovery rate: 28MB/sec --Detection latency: 100 sec. --Snapshot period: 24 hours Winter 2006
(a) Disk Bandwidth Usage of A Target Disk
(a) Aggregate Disk Bandwidth Usage for Data Recovery CMPS 128, UC Santa Cruz
61
Disk Failure Detection Latency
Systems with smaller groups need to detect failures more quickly
The ratio of detection latency to recovery time is the key factor!
Parameters: --2-way mirroring --Data capacity: 2PB --Recovery rate: 28MB/sec --Bathtub vintage: HIM Winter 2006
CMPS 128, UC Santa Cruz
62
31
Securing Petabyte Scale Storage Client opens file. U , open ( path, mode) Attacker may load malware on client. Must limit damage.
M Server returns handle. H Client writes data block. write(oid , bno, data )
U
Disk confirms success. Okay
Strong
Di safeguards
Attacker may add, change or destroy messages. Must prevent damage. Winter 2006
CMPS 128, UC Santa Cruz
63
Desired Characteristics
Integrity and Confidentiality.
Low overhead.
Limit use of public key cryptography. Piggyback security on existing messages.
No per-client state on the disk.
Other mechanisms provide reliability/availability.
Caching tolerable.
No per-object state on the client.
Winter 2006
Per-file state acceptable. Per-disk state tolerable. CMPS 128, UC Santa Cruz
64
32
Handling Vulnerable Clients Client opens file. U , open ( path, mode)
M Server returns handle and capability. C H Client writes data block. write(oid , bno, data ) C
U
Integrity, Confidentiality. No new messages. Disk confirms success. Symmetric keys only. Okay Disks share the blue key.
Winter 2006
Di
CMPS 128, UC Santa Cruz
65
Handling Vulnerable Disks and Clients Client opens file. U,open(path,mode)
Server returns handle and capability. H,C
!
M
Client requests ticket for disk. Additional messages, but ticket valid for many files. U , Di
U
Public key verification, but ! capability for many clients. Server returns ticket. Symmetric key per disk. T Client reads data block. C,read(oid,bno) T
Disk returns data. data
! Winter 2006
CMPS 128, UC Santa Cruz
Di 66
!
33
Public Key Usage Isn’t Too Expensive!
Basic protocol
Public key protocol
Both add about 80% overhead (block encryption) Winter 2006
CMPS 128, UC Santa Cruz
67
Peta-Scale Object-based Storage Summary
Efficient high-performance object storage is relatively straightforward
OBFS and EBOFS appear to work similarly well Both work much better than ext2/3 and often better than XFS (with much less code)
Metadata management is somewhat more difficult
There are good techniques for data distribution
Decentralize lookup Accommodate different failure domains!
Reliability is critical
Dynamic Subtree Partitioning looks promising
Standard techniques begin to fail at this scale Fast Recovery Mechanisms appear to provide acceptable reliability
Security can be made scalable
Winter 2006
CMPS 128, UC Santa Cruz
68
34