The Write Anywhere File Layout System (WAFL)

The Write Anywhere File Layout System (WAFL) Written by Sebastian Scholz, based on “File System Design for an NFS File Server Appliance” by Dave Hitz,...
Author: Rosanna Jacobs
54 downloads 0 Views 46KB Size
The Write Anywhere File Layout System (WAFL) Written by Sebastian Scholz, based on “File System Design for an NFS File Server Appliance” by Dave Hitz, James Lau & Michael Malcolm, Network Appliance, Inc.

1. Abstract The patented Write Anywhere File Layout (WAFL) File System is a fully NFS compatible File System, which Strengths are High Performance NFS Processing, Support for large Disks by using RAID Systems and quick Startup after an unclean Shutdown without checking the File System for Consistency. Data and Metadata is arranged in a Tree of Blocks, which allows Backups to be created rapidly with negligible Effort. Non Volatile RAM is used to cache NFS Requests in order to minimize the File Server Response Time to an NFS Request.

2. Introduction Since File Servers become more and more important today, a robust, reliable and fast File System is necessary. WAFL is a Unix compatible File System optimized for network File Access, which was designed to meet four Requirements: • • • •

It should provide fast NFS Service It should support large File Systems that grow dynamically as disks are added It should provide high Performance while supporting RAID It should restart quickly without checking the File System for Consistency.

The Need for fast NFS Service is obvious, given WAFL’s intended use in an NFS Appliance. RAID strain Write Performance, because of the Read Modify Write Sequence it uses to maintain Parity. Because a Write on such a System typically consumes four Disk IO Operations (Data Load, Parity Load, Data Write and Parity Write), one Design Goal was to minimize the Amount of small Writes and, in Contrast, to maximize the Amount of Writes which affect all Stripes in a RAID System.[1, 2] WAFL uses a special Treatment of Data Blocks (Discussed in 3.1) and non-volatile Ram (NVRAM – discussed in 3.3) to optimize Write Performance. Because Data is treated in a different way compared to other File Systems, Snapshots (which are consistent Back Ups of the File System) can be created fast without notable Effort. (Discussed in 3.2) Large File Systems require special Techniques for fast Restart, because checking the File System for Consistency at startup is not a convenient Way, since it can become unacceptable slow as the Dimension of the File System grows. We will see that WAFL uses a mechanism similar to a log Structured File System [4] to avoid checking the File System at Startup. (Discussed in 3.3)

3. WAFL Implementation 3.1. Data Block Arrangement

WAFL uses 4 KB Data Blocks without fragments to store Data. Each Inode contains 16 Block Pointers, which means that a Single Inode can address a File smaller or equal than 64 KB. If a File exceeds this Limit, indirect Blocks will be used to point to actual Data, while small Files

are stored directly in the Inode File. The Way in which Data Blocks are arranged is quite different compared to other File Systems. All Data Blocks (including Metadata) are Nodes in a Tree rooted at the root Inode. Metadata includes Data such as Inode Files, Block Map Files and the Inode Map File. Figure 2 shows that Files are made up of individual Blocks and that large Files have additional Layers of Indirection between the Inode and the actual Data. Because of this dynamic Data Block Arrangement, WAFL can optimize Writes in many ways more creatively. Actually, the whole point of the "Write Anywhere" in WAFL is that you can update 10 leaf nodes in the tree, and then choose to write them to physical blocks X to X+9 on disk. Then you might have 3 parent nodes to update as part of the transaction. You can now choose to write them from X+10 to X+12 and combine this Write with the prior one, therefore utilizing more Disks in the RAID system at once and speeding up the Transaction. This is where the Name of the File System comes from. If the Inode File would have a fixed Location in the File System, as it appears in the FFS File System[3], the Update Operation would result in two Writes and poor Performance. However, there is one Exception to WAFL’s Write Anywhere Rule: The Root Inode must stay at a fixed Position, because otherwise the File System couldn’t find the Tree at Startup.

3.2. Snapshots A Snapshot is basically a read only Copy of the File System at a given Time t. Since the Data Blocks are arranged in a Tree, just duplicating the Root Inode can easily create a Snapshot. The newly created Snapshot matches exactly the current active File System. Now, if a Data Block needs to be updated, this Block along with the Modifications and the Parents will be copied to an unused Data Block and referenced by the Active File System’s Root Inode. Figure 2 visualizes this Mechanism also known under the Name “Copy On Write”:

WAFL can be customized to generate and delete Snapshots automatically at prescheduled Times. Every Client Machine can access Snapshot Data by switching to the .snapshot Directory, which is available in every Directory. Normally, a File System keeps track of used Blocks by using a bit map with one Bit per Block. A set Bit indicates that the Block is in use. Since WAFL introduces Snapshots, the Data Block Map File must be changed accordingly. An Entry is 32 bit in Length and belongs to one Data Block. If Bit 0 is set, then the active File System Root Inode references the Block. If bit i (i > 0) is set, then Snapshot i references this Data Block. The following Table shows 5 possible Transactions and how they change the Block Map Entry of one Data Block: TIME T1 T2 T3 T4 T5 T6

Block Map Entry (Lower Byte) 00000000 00000001 00000011 00000010 00000010 00000000

Description Block is unused Active File System allocated the Block Snapshot 1 was created Active File System deleted the Block Snapshot 2 was created Snapshot 1 was deleted (Block unused again)

3.3 Consistency Points and NVRAM As mentioned in the Introduction, one important design Goal was to speed up the Response Time of NFS Requests and the Avoidance for a File System Consistency Check at Startup. WAFL avoids the need for File System Consistency Checking after an unclean Shutdown by creating a special Snapshot called a Consistency Point every few Seconds. After such a Consistency Point (which is not visible to the user) was created, the File system is selfconsistent, which means that in this State the File System can be restarted without any Checks. Between Consistency Points, WAFL does write Data to disk, but it writes only to blocks that are not in use, so the tree of blocks remains unchanged. After a certain Amount of NFS Requests, a new Consistency Point will be written which reflects all Changes done by these NFS Requests. WAFL uses non-volatile RAM (NVRAM) to keep a log of NFS requests it has processed since the last consistency point. (NVRAM is special memory with batteries that allow it to store data even when system power is off.) After an unclean shutdown, WAFL restores the last Consistency Point and replays any requests in the log to prevent them from being lost. When a File Server shuts down normally, it creates one last consistency point after

suspending NFS service. Thus, on a clean shutdown the NVRAM doesn't contain any unprocessed NFS requests, and it is turned off to increase its battery life. WAFL actually divides the NVRAM into two separate logs. When one log gets full, WAFL switches to the other log and starts writing consistency point to store the changes from the first log safely on disk. WAFL schedules a consistency point every 10 seconds, even if the log is not full, to prevent the on- disk image of the file system from getting too far out of date. Processing an NFS request and caching the resulting disk writes generally takes much more NVRAM than simply logging the information required for replaying the request. For instance, to move a file from one directory to another, the file system must update the contents and inodes of both the source and target directories. In FFS, where blocks are 8 KB each, this uses 2 KB of cache space. WAFL uses about 150 bytes to log the information needed to replay a rename Operation. Rename with its factor of 200 difference in NVRAM usage is an extreme case, but even for a simple 8 KB write, caching disk blocks will consume 8 KB for the data, 8 KB for the inode update, and for large files another 8 KB for the indirect block. WAFL logs just the 8 KB of data along with about 120 bytes of header information. With a typical mix of NFS operations, WAFL can store more than 1000 operations per megabyte of NVRAM. Using NVRAM as a cache of unwritten disk blocks turns it into an integral part of the disk subsystem. An NVRAM failure can corrupt the file system in ways that fsck cannot detect or repair. If something goes wrong with WAFL's NVRAM, WAFL may lose a few NFS Requests, but the on-disk image of the file system remains completely self consistent. This is important because NVRAM is reliable, but not as reliable as a RAID disk array. A final advantage of logging NFS requests is that it improves NFS response times. To reply to an NFS request, a file system without any NVRAM must update its in-memory data structures, allocate disk space for new data, and wait for all modified data to reach disk. A file system with an NVRAM write cache does all the same steps, except that it copies modified data into NVRAM instead of waiting for the data to reach disk. WAFL can reply to an NFS request much more quickly because it need only update its in-memory data structures and log the request. It does not allocate disk space for new data or copy modified data to NVRAM.

4. PERFORMANCE & CONCLUSION In Order to measure the Speed of WAFL, LADDIS, the best Benchmark currently available, can be used. This Table shows a Performance Comparison between different NFS File Servers: Server FAServer 8x Cluster Auspex NS 6000 Sun SPARCcluster 1 Sun SPARCcenter 2000 Sun SPARCserver 1000

Best Response Time 3.1 ms 11 ms 16.7 ms 15.9 ms 16.6 ms

Best Throughput 3189 ops/sec 2050 ops/sec 3069 ops/sec 2575 ops/sec 2106 ops/sec

Response Time at best Throughtput 18.2 ms 47 ms 49.7 ms 49.9 ms

49.8 ms

WAFL was developed, and has become stable, surprisingly quickly for a new file system. It has been in use as a production file system for over a year, and there is no case known where it has lost user data. Processing file system requests is simple because WAFL updates only in-memory data structures and the NVRAM log. Consistency points eliminate ordering constraints for disk writes, which are a significant source of bugs in most file systems.

5. REFERENCES [1] Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz and David A. Patterson. RAID: High-Performance, Reliable Secondary Storage. ACM Computing Surveys. [2] P M Chen, D A Patterson, “Maximizing Performance in a striped Disk Array" [3] Marshall K. McKusik, William N. Joy, Sam J. Leffler, and Robert S. Fabry. “A fast File system for UNIX”. ACM Transactions on Computer Systems, 2(3): 181-197, August 1984 [4] Seltzer, K. Bostic, M. McKusick, C. Staelin, ``An Implementation of a Log-Structured File System for UNIX'', Proceedings of the San Diego Usenix Conference, pp 201-218, January 1993.