Design and Implementation of Parallel File Aggregation Mechanism
Design and Implementation of Parallel File Aggregation Mechanism Jun Kato* and Yutaka Ishikawa The University of Tokyo * Currently affiliated with Fuj...
Design and Implementation of Parallel File Aggregation Mechanism Jun Kato* and Yutaka Ishikawa The University of Tokyo * Currently affiliated with Fujitsu Laboratories Limited
Agenda
File organization trend of HPC applications
Problem of single shared file approach for reducing the number of files
providing single shared file APIs for high I/O performance
Evaluation result on a real HPC application
exhibiting low I/O performance through a benchmark program
PFA (Parallel File Aggregation) Mechanism
use of millions of small files
3.8 times faster than the original with reducing the number of files by about 100,000 files
Conclusion Q&A International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
2
Agenda
File organization trend of HPC applications
Problem of single shared file approach for reducing the number of files
providing single shared file APIs for high I/O performance
Evaluation result on a real HPC application
exhibiting low I/O performance through a benchmark program
PFA (Parallel File Aggregation) Mechanism
use of millions of small files
3.8 times faster than the original with reducing the number of files by about 100,000 files
Conclusion Q&A International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
3
File Organization Trend of HPC Applications
Use of millions of several-MB-sized files
Examples of real HPC applications
Integrated Microbial Genomes System 65 million files Average file size : < 1KB Nearby Supernova Factory [Cecilia 2009] over 100 million files Max file size : 8MB
Statistics on HPC file systems
[Rockville 2009]
[Shobhit 2008]
60% of files : < 1MB 80% of files : < 8MB 99% of files : < 64MB
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
4
Design of Current HPC Applications
N-N pattern
N processes utilize N independent files Application Process B File B Process A File A
Process C File C
Millions of process utilize millions of files on millions of CPU cores
Hard file management Heavy metadata workload
Each process accesses its own independent file International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
5
Goal of This Research
N-1 pattern
N processes utilize 1 shared file
Application
Application Process B File B
Good
Process A File A
Process C File C
Employed N-N pattern
Change of application pattern
Process B
Bad Shared File
Process A
Process C
Not Employed N-1 pattern
Why do current HPC applications not employ the N-1 pattern ? International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
6
Problem of the N-1 pattern (1/2)
Low I/O Performance
Benchmark Program : MPI-IO Test File System : Lustre Parallel File System 8000 7000 6000 5000 4000 3000 2000 1000 0
Write Bandwidth [MB/sec]
Bandwidth [MB/sec]
Read
Legend N-N pattern N-1 pattern
600 500 400
300 200 100 0
16 24 32 40 48 56 64 # of processes
over 3 times lower
8 24 40 56 72 88 104120 # of processes
over 5 times lower
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
7
Problem of the N-1 pattern (2/2)
File lock contention [Richard 2005]
Each process must acquire file lock every stripe block before data access for consistency Node
Node Wait
Application Process A
Process B
Wait Process C
Process D
Blocked ! Shared File
locked by A
locked by D
Process B and C must wait until the lock is released Stripe Block
Performance degradation
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
8
Agenda
File organization trend of HPC applications
Problem of single shared file approach for reducing the number of files
providing single shared file APIs for high I/O performance
Evaluation result on a real HPC application
exhibiting low I/O performance through a benchmark program
PFA (Parallel File Aggregation) Mechanism
use of millions of small files
3.8 times faster than the original with reducing the number of files by about 100,000 files
Conclusion Q&A International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
9
Proposed Mechanism
PFA (Parallel File Aggregation) Mechanism
provides N-1 pattern APIs based on memory-map reduces I/O contention by aggregating I/Os does not need file lock reduces amount of data by incremental logging feature
improves the write bandwidth of the N-1 pattern reduces the # of files with the use of the N-1 pattern
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
10
APIs of the PFA Mechanism
Data are read and written sequentially through the APIs based on memory-map
Write data
Read data
const size_t buf_size = 272,383; /* allocate a memory region for write */ char* buf = pfa_mmap( “foo.txt”, buf_size, rank,… );
const size_t buf_size = 272,383; /* allocate a memory region for read */ char* buf = pfa_mmap( “foo.txt”, buf_size, rank, … );
while ( condition ) { buf[ … ] = …; /* edit data */
while ( condition && ! pfa_eof( buf ) ) { … = buf[ … ]; /* read data */
pfa_append( buf, … ); /* append data */ } /* free the memory region */ pfa_munmap( buf );
pfa_seek( buf, … ); /* read data */ } /* free the memory region */ pfa_munmap( buf );
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
11
Overview of the PFA mechanism
The PFA mechanism works on file system client
It does not need to modify file system server Node Application Process A
Process B
User Address Space Kernel Address Space
APIs based on memory-map
File System Client Chunk
Chunk
Direct I/O
File System Server Shared File
I/O aggregation on chunk Incremental Logging Feature
Chunk A
Chunk B
Stripe aware data layout
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
12
Memory-map
APIs based on memory-map transfer data from the user address space directly Application
fwrite/MPI_File_write ( MPI-IO [Rajeev 1999] )
pfa_append
Process
Process
Data
Data
User Address Space Kernel Address Space File System Client Data
1 copy
0 copy
To File System Server International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
13
I/O Aggregation
Data are aggregated into chunk on file system client Application
Without Aggregation
With Aggregation
Process
Data
Data
Process
Data
Data
Data
Data
User Address Space Kernel Address Space
Chunk
File System Client
Data 3 I/O requests
Data
To File System Server
Data 1 I/O request
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
14
Incremental Logging Feature - Overview
Unmodified data from the previous store are not stored again Application
Without Incremental Logging Feature
With Incremental Logging Feature
Process
Data
Data
Process
Data
Data
Data
Data
User Address Space Kernel Address Space
Chunk
File System Client
Data
2nd stored data is same as 1st stored data
Data
Data
Chunk
Data
Metadata 2nd data == 1st data
To File System Server
Data
De-duplicate data
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
15
Incremental Logging Feature - Detection of Modified Data
Page protection fault is used to detect modified data
Storing data Turning off the write bit of the all pages
Page 0 Page 1 Page 2
Handling page protection fault on Page 0 Turning on the write bit of the Page 0
Page 0 Page 1 Page 2
Storing data only on modified pages (= Page 0)
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
16
Direct I/O
Direct I/O avoids cache duplication between file system cache and chunk of the PFA mechanism File System Client Without direct I/O Data
Data
Data
File System Cache
Chunk acts the same as the file system cache
Chunk
With direct I/O Data
Data
Chunk
Data
File System Cache
To File System Server
Direct I/O bypasses the file system cache
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
17
Data Layout on Shared File
Each chunk is aligned on stripe block Node
Node
Application Process A
Shared File
Chunk A 1st
Process B
Chunk B 1st
Chunk C 1st
Process C
Chunk D 1st
Process D
Chunk A 2nd
Chunk B 2nd
Stripe Block
Each process does not need to acquire file lock International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
18
Agenda
File organization trend of HPC applications
Problem of single shared file approach for reducing the number of files
providing single shared file APIs for high I/O performance
Evaluation result on a real HPC application
exhibiting low I/O performance through a benchmark program
PFA (Parallel File Aggregation) Mechanism
use of millions of small files
3.8 times faster than the original with reducing the number of files by about 100,000 files
Conclusion Q&A International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
19
Evaluation Environment
Evaluated on Lustre Parallel File System
Lustre Client
128 cores ( = 4 cores * 2 sockets * 16 nodes )
Lustre Server
1 MDS (Meta Data Server) on VMWare vSphere 4 4 OSS (Object Storage Server ) + 6 OST (Object Storage Target) Client
MDS
OSS
CPU
Intel Xeon X5550 2.67GHz, 8cores
Intel Xeon L5640 2.26GHz, 4 cores in 12 cores
Intel Xeon L5640 2.26GHz, 12 cores
Memory
DDR3 24GB
DDR3 16008MB in 48GB
DDR3 48GB
Disk
160GB SATA
6Gbps 7,200 rpm SAS 500GB x 4
6Gbps 7,200 rpm SAS 500GB x 2
Interconnect
Infiniband 4x QDR
Infiniband 4x QDR
Infiniband 4x QDR
OS
RHEL5(2.6.18-194)
RHEL5(2.6.18-164)
RHEL5(2.6.18-164)
Lustre
1.8.4
1.8.3
1.8.3
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
20
MPI-IO Test Benchmark 1. 2.
Legend N-N pattern N-1 pattern N-1 pattern with the PFA
Test Configuration Write 272,383 bytes for a minute Read written data
Result
N-N > N-1 with the PFA > N-1 N-N pattern generates 128 files at most … too low
8000 7000 6000 5000 4000 3000 2000 1000 0
Write
Over 2 times
Bandwidth [MB/sec]
Bandwidth [MB/sec]
Read
600 500 400 300
Over 5 times
200 100 0
16 24 32 40 48 56 64 # of processes
8 24 40 56 72 88 104 120
# of processes
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
21
Athena Application 450
Original 99584 files
Elapsed Time [sec]
400 350 300
With the PFA 2 files
Simulating Rayleigh-Taylor instability with 128 processes Total 99584 files in original
49792 simulation data files
250
200 150
0
Without I/O Limit Value Stripe Size [MB]
Saving 30.8% data
49792 checkpoint data files
1 2 4 8 16 32 64 128
50
Average file size : 737534 byte with incremental logging
100
Good
[Stone 2008]
Average file size : 272383 byte without incremental logging
Speeding up 3.8 times faster than the original in I/O part
International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
22
Related Work & Comparison
MPI-IO [Rajeev 1999]
SIONlib [Frings 2009]
converts the N-N pattern into N-1 pattern on the library incurs performance degradation due to the file lock contention
PLFS
provides N-1 pattern APIs based on file requires copy between the user and the kernel address spaces
[Bent 2009]
provides virtual view of the shared file on the file system server incurs metadata stress due to actually employing the N-N pattern
The PFA mechanism
provides N-1 pattern APIs based on memory-map works on the file system client International Workshop on Runtime and Operating Systems for Supercomputers 2011
5/31
23
Conclusion
The N-1 pattern exhibits poor I/O performance
Most applications employ the N-N pattern and generate millions of small files
PFA (Parallel File Aggregation) Mechanism
It improves I/O performance of the N-1 pattern
providing N-1 pattern APIs based on memory-map reducing I/O contention by aggregating I/Os no file lock reducing amount of data by incremental logging feature
The Athena application speeds up 3.8 times than the original with reducing the number of files by about 100,000 files International Workshop on Runtime and Operating Systems for Supercomputers 2011