Build Object-Based Filesystem into Linux

Build Object-Based Filesystem into Linux Technical Report (LCX-SSRC-2002-09) Caixue Lin [email protected] Storage Systems Research Center Jack Baskin S...
4 downloads 0 Views 98KB Size
Build Object-Based Filesystem into Linux Technical Report (LCX-SSRC-2002-09) Caixue Lin [email protected] Storage Systems Research Center Jack Baskin School of Engineering University of California, Santa Cruz Santa Cruz, CA 95064 http://ssrc.cse.ucsc.edu/

Abstract The OBFS(Object-Based Filesystem) is a new filesystem designed for OBSD(Object-Based Storage Device). The performance of OBFS is two times as good as that of EXT2 and is better than some other filesystems, such as XFS, when OBFS is designed as a separate filesystem, but not a filesystem under VFS(Virtual Filesystem) of Linux [3]. In order to have a fair comparison to EXT2 and other filesystems under Linux, we want to build the OBFS into the Linux. In this paper, we describes the design and implementation of building OBFS into the Linux2.4.9 as a specific filesystem under VFS. The source code of OBFS only contains 4 source files, 5 head files, and about 2000 lines of source. Our future work includes a rewriting of the Linux kernel page cache to well support obfs.

1 Introduction OBFS(object-based filesystem)is a new filesystem designed for large object files in the OBSD(Object-Based Storage Device) [3]. In order to test the performance of the OBFS in the real environment and make a fair comparison to EXT2 , we want to build the OBFS into Linux. As Linux 2.4.9 is the highest stable version of current released Linux, we choose it for our experiment. It is trivia and not easy to build a new filesystem with support of large file sizes and large file system sizes into Linux. For example, the SGI guys at least spent 1 years in porting the XFS into Linux [2]. However Linux already provides a VFS, which accepts a set of low level (filesystem dependent) interfaces. If we can successfully provide these interfaces for VFS, the OBFS will be incorporated into Linux easily.

2 Creating an OBFS Device To build the OBFS and test the performance of it, we need a disk device to build OBFS filesystem data. There are two ways to do that: • Use a loopback device simulating a real disk 1. Install a loopback device, such as /dev/loop0 2. Create an OBFS test file under any filesystem, e.g. dd if=/dev/zero of=/obfs test device bs=512 count=1048576. 3. Use losetup to associate /dev/loop0 with the regular file: obfs test device. For example: losetup /dev/loop0 /obfs test device • Install a new hard disk, e.g. /dev/hdb1, and format it into our filesystem: OBFS.

3 The File System Interface: VFS The Virtual File System(VFS) is a kernel software layer that handles all system calls related to a standard UNIX file system(includes Linux). Its main strength is providing a common interface to several kinds of file systems. Figure 1 gives the common file model of VFS [1]. The key idea of VFS is providing a common file model representing all supported filesytems. However each specific filesystem implementation must incorporate its physical structures into the VFS’s common file model. For example, the directory under VFS is considered as a normal file, which contains a list of files and other sub-directories. However, there is no directory in OBFS because all files are representing objects and there is no need any directory tree structure on the disk. So in OBFS, directories are not needed. In order to stick to the VFS’s common file model, the implementation of OBFS must be able to construct the files corresponding to the directories. Such files should exist only as objects in memory, but not on the disk. The VFS common file model consists of the superblock object, the inode object, the file object and the dentry object. Each VFS object is stored in a suitable data structure, which includes both the other attributes and a pointer to a table of object methods. The kernel may dynamically load the methods of the object, and hence it may install specialized behavior for the object [1].

1

Inode Cache

VFS

OBFS

Directory Cache

EXT2

UMSDOS

Buffer Cache

Disk Drivers Figure 1: VFS model

4 OBFS Implementation 4.1 OBFS Disk Layout In order to use the functions and interfaces provided by VFS, we need to construct following disk layout of the OBFS filesystem(figure 2) [3] • Boot block: It is a single block that is reserved for booting • Super block: It holds some general information about the OBFS filesystem, such as disk space, region number, free region number, large free region number,and so on. • Root inode: It is the only directory inode in OBFS. • Root directory data blocks: It keeps the directory entry of each object file (each entry only holds object ids(file name) and corresponding inode numbers). • regions: Each region consists of following data structures: 1. Region head: It keeps the general information of the region,such as region type(large-block region, small-block region), free inode number, free block number,and so on. 2. Inode bitmap(Onode bitmap): This is a bitmap which indicates what onodes are free for use and which ones are in use. If an onode is free then the bit will be set to zero, otherwise set to 1. 2

Disk Layout for OBFS built in Linux

Boot Block Super Block Root Onode Root Dir

Region 1

n Blocks

Region Head

Onode Bitmap

Data Blocks

(Large)

............

............ ............

Region n (Small)

Data Blocks

Region Head

Onode Bitmap

Block Bitmap

Onode Table

Data Blocks

Figure 2: OBFS Disk Layout 3. Block bitmap: It is another bitmap which keeps what data blocks are free for use and which ones are in use. If a block is free then the bit will be set to zero, otherwise set to 1. This data structure only exists in small-block region. 4. Inodes table(Onodes table): This is a series of blocks that store the actual onodes.This data structure also only exists in small-block region. 5. Data blocks: These are the actual data for file objects. For OBFS, the large block size is 512K in the large-block region, while the small block size is 4K in the small-block region. However all the operations on the large blocks are still based on the small blocks because of the requirement of VFS. A new region will be allocated when there is no data block available for a new object file. And the region type will be determined by the size of the object file: if it is larger than 512k, a large-block region will be allocated , otherwise a small-block region will be allocated just behind the allocated regions [3]. We could build a layout of OBFS into a group of regions with each super block in each region. The redundant copies of superblocks could store enough information to recover the OBFS partition back to a consistent state when a corruption happens. However, we have not done this at this time. It should be implemented in a newer version of OBFS.

3

4.2 Format the Device We need to format the device so that it can be mounted by OBFS filesystem. First we have to get the disk profiles, such as disk sector size, disk capacity, and so on. With these information, we can initialize the data on the disk according to the OBFS disk layout: • Calculate how many regions available on the disk. • Fill in the superblock, and write it out to the disk • Fill in the root inode, and write it out to the disk • Initialize the root directory data blocks to zero. • Initialize all the region headers and other data blocks to zero. So actually, the OBFS format utility just simply fills in the needed information for OBFS on-disk data structures and write them out to the device.

4.3 Register the Filesystem: OBFS The first thing we have to do with the kernel code is to declare the OBFS filesystem, and then have it be registered and unregistered when initializing the OBFS filesystem and exiting the OBFS filesystem. To declare OBFS filesystem we will need to fill in a structure called file system type. There are 4 fields to this structure. These are the name of the filesystem, a function pointer which will read the superblock, misc flags, and an indication of what module is supporting this filesystem (if any). There is a macro ,DECLARE FSTYPE DEV , to help fill in this structure: static

DECLARE_FSTYPE_

DE V(o bf s_ fs _t ype , "obfs",

obfs_read_supe

r) ;

And we need to write an initialize function and an exit function. These functions will register and unregister the OBFS filesystem respectively. We will also need to create a read super function that is called when we mount OBFS filesystem. This function does following: • read the superblock from the disk, and copy it into the superblock in the memory. • read the root inode from the disk into memory. • read all the region headers from the disk into memory. Finally we have to wrap the initialize and exit function in a module init and module exit function so that the OBFS filesystem can be loaded as a module.

4.4 Build the OBFS Superblock To build a valid superblock structure, we have to do a couple of things: First, declare a structure of super operations: in fs/obfs/inode.c

4

struct

super_operation read_inode: write_inode: delete_inode: put_super: write_super: statfs:

s obfs_sops = { obfs_read_inode, obfs_write_inode obfs_delete_inod obfs_put_super, obfs_write_super obfs_statfs,

, e, ,

}; Second, we have to fill in several critical methods for inode operations: • read inode - This will tell the VFS layer how to read a specific onode from the OBFS filesystem. • write inode - This will tell the VFS layer how to write a specific onode back to disk. • delete inode - This will tell the VFS layer how to delete an onode from disk. This is called when the last hard link is removed. • put super - This will tell the VFS layer how to save the super block back to disk. This is called when the OBFS filesystem is unmounted. As soon as we read the superblock from the disk when we mount the OBFS filesystem, we need to read the root inode into memory by using the read inode operation. read inode only reads the inode from the buffer cahce if the inode is in the buffer cache, otherwise reads it from disk. So does the write inode.

4.5 Root Directory Support Though we don’t need any directory for OBFS, we are required to maintain a root directory since our filesystem has to stick the common file model of the VFS, which doesn’t work without directory support. To do this we have to implement a file operations structure for the i fop field if the inode is for a directory. There are really only three fields we need to fill out: read, readdir, and fsync. For the read field, we can simply use the generic read dir function, which just returns an error. Because we don’t want people to read the directory with the read system call. Instead we would use the readdir library call. Next is the readdir field. This one is one that we will have to implement. We are passing a open file object and a directory entry to look for. We just simply need to start searching at the current position in the open file object (the f pos field) and find that dentry. After find it, use the filldir function to give the information to the VFS layer, and update the access time. The final field for this file operations structure is the fsync field. We can simply use the file fsync function that is already implemented for us. This time, we don’t use directory cache to improve the performance because of limited coding time. Because all the object files are under the root directory, so it is very slow to find the entry in the root directory for the object file when we do a file operation. In the future, we may need to incorporate the directory cache into OBFS. An alternative way is to discard the root directory, but using a memory object instead since we do not need any directory organization in OBFS at all.

5

4.6 Implementing File Objects There are three operation structures that we have to implement to get files working. Fortunately, because most filesystem handle files in a similar way, the VFS layer has done a lot of work for us. • Fill in inode operations structure for a normal file: fs/obfs/dir.c struct

inode_operation create: lookup: link: unlink: mkdir: mknod: rename:

s obfs_inode_ope obfs_create, obfs_lookup, obfs_link, obfs_unlink, obfs_mkdir, obfs_mknod, obfs_rename,

ra ti on s = {

}; • Fill in file operations structure: fs/obfs/file.c struct

file_operations llseek: read: write: mmap: open:

obfs_file_opera generic_file_l generic_file_r obfs_file_writ generic_file_m generic_file_o

ti on s = { ls ee k, ea d, e, ma p, pe n,

}; Here, we illustrate VFS common file model by showing how the the file read() works. The application’s call to read() makes the kernel invoke sys read(), just like any other system calls. The file is represented by a file data structure in kernel memory. This data structure contains a field called f op that contains pointers to functions in the file operations structure(in OBFS, it’s obfs file operations), including a function that reads a file. sys read() finds the pointer to this function and invokes it. Thus the application’s read() is translated into following indirect call: file->f_op->re

ad (. ..) ;

Similarly, other file system calls will trigger corresponding function calls in the file operations structure. So we need to fill in the fields of the file operations structure. But for all the fields except write we can just use the generic versions provided by VFS. On the other hand, the implementation of the generic file read(generic file read) in VFS has a line that looks something like the following: mapping->a_ops

-> re adp ag e( fi lp , page);

So when we do a read system call, and the filesystem uses the generic read, then it actually uses memory mapped i/o to accomplish the read. This mechanism is similarly applied to other generic function calls in the file operations structure. 6

• Implement the get block() Now the final issue is implementing the memory mapped i/o so that the generic reads and writes will work. To do this we have to fill out an address space operations structure. Again the VFS layer will do most of it for us. struct

address_space_o readpage: writepage: sync_page: prepare_write: commit_write: bmap: direct_IO:

per at io ns obfs_aops = { obfs_readpage, obfs_writepage , block_sync_pag e, obfs_prepare_w ri te , generic_commit _w ri te , obfs_bmap, obfs_direct_IO ,

}; Actually, all we have to do is to implement a ”get block” function. After that we can just call other VFS functions (such as block read full page) to do everything for it. The advantage for doing this is: we can use the page cache operations done by VFS. Please refer to obfs get block in fs/obfs/file.c for the detailed algorithm of the get block(). The basic idea for get block() in OBFS is very simple: 1. For large block allocation: Just return the next available block if the required file block number is no more than 512K/4K = 128. 2. For small block allocation: It is complicated. However we can also give the next available block preallocated when we write to the object at the first time. Anyway the get block does not really allocate any data block, but returns the block number, which is preallocated when we first write to the object since we have the required size of the object at that time. The preallocation algorithm is one of the critical techniques of OBFS. We have the detailed description of the algorithm in [3].

5 OBFS Source Code Description The code of the OBFS includes format utility, OBFS filesystem, OBFS test code. All of these add up to only about 2000 lines of code in C language. • OBFS Format utility: obfs format.c, obfs diskio.c • OBFS filesystem: 1. head files: obfs fs.h, obfs fs i.h, obfs fs sb.h, obfs defs.h, obfs global.h 2. source files: dir.c, file.c, inode.c, sballoc.c • OBFS performance test: 1. test code: obfs test.c 2. test workload: obfs.workload 3. test helpfile: obfs test.txt 7

6 Performance Analysis To test the performance, I used the characteristics of OBFS described in table 1. I tested the OBFS performance with workload1 and workload2. Please refer to cvs/srt/lcx/doc/obsd/test/obfs.workload for detail: workload2 is obfs.workload, while workload1 is half load of obfs.workload. Table 1: OBFS Characteristics region size large-block size small-block size

128MB 512KB 4KB

And I also used both synchronous and asynchronous file access modes to test the file performance. It seems that the OBFS performance is not better than ext2 after I built it into Linux(Table 2 and Table 3). Table 2: OBFS&EXT2 Performance with workload1(Throughput: MB/s)

average throughput read average throughput small read average throughput large read average throughput write average throughput small write average throughput large write average throughput

Async 4.737324 4.162468

OBFS Sync 6.090714 6.813434

Async 5.925337 4.851452

EXT2 Sync 5.191512 6.414849

3.011786

5.511641

3.493870

4.889431

4.278934

6.930042

4.989571

6.560666

6.838092

4.903216

11.728132

3.622164

2.440962

3.568708

26.140828

3.149225

7.796309

5.031591

11.302941

3.659662

The disk I used is an IBM-DAQA-33240, ATA disk, with capacity only 3.2GB. Please refer to the cvs/srt/lcx/doc/obsd/doc/daqa sp.pdf for detailed specification of the IBM-DAQA-33240. The main reason that ext2 beats OBFS is that VFS separates a large block access into a bunch of small block accesses. i.e. an large access with 512K data actually needs 512K/4K=128 accesses if the block size in VFS is 4kB.

7 Future Works It took us about three weeks to port the OBFS into Linux2.4.9. However, the current version is a partial implementation of OBFS. To improve the performance of OBFS, the future works should include following: • Supports Directory Cache 8

Table 3: OBFS&EXT2 Performance with workload2(Throughput: MB/s)

average throughput read average throughput small read average throughput large read average throughput write average throughput small write average throughput large write average throughput

Async 4.256734 3.586226

OBFS Sync 4.405239 4.741487

Async 5.241939 4.069672

EXT2 Sync 4.699337 5.476231

2.611790

3.651273

2.783451

4.103893

3.684608

4.845018

4.208861

5.610499

7.271926

3.695182

14.872777

3.471388

2.609508

3.148728

14.935754

2.966830

8.298571

3.740125

14.868436

3.512753

• Provides Dynamic Inode Cache • Provides other functionalities: such as file expanding, recovery support, and so on. • Supports real large block access: do not divide a large block access into small accesses by rewriting the page cache and partial disk drivers in Linux [2].

8 Acknowledgments This project is done under the support of the Storage Systems Research Center and its faculty: Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long. And I acknowledge Feng Wang and Lan Xue for helping me think out some good ideas of building the OBFS into Linux.

References [1] D. P. Bovet and M. Cesati, Understanding the Linux Kernel. O’Reilly&Associates, Inc., Oct. 2000. [2] J. Mostek, B. Earl, S. Levine, S. Lord, R. Cattelan, K. McDonell, T. Kline, B. Gaffey, and R. Ananthanarayanan, “Porting the SGI XFS file system to Linux,” Proceedings of FREENIX Track: 2000 USENIX Annual Technical Conference, 2000. [3] F. Wang, S. A. Brandt, E. L. Miller, and D. D. E. Long, “OBFS: an effective and lightweight file system for OBSD,” research report, Storage Systems Research Center, 2002.

9

Suggest Documents