Design and Implementation of Parallel File Aggregation Mechanism

Design and Implementation of Parallel File Aggregation Mechanism Jun Kato* and Yutaka Ishikawa The University of Tokyo * Currently affiliated with Fuj...
Author: Brenda James
0 downloads 2 Views 1MB Size
Design and Implementation of Parallel File Aggregation Mechanism Jun Kato* and Yutaka Ishikawa The University of Tokyo * Currently affiliated with Fujitsu Laboratories Limited

Agenda 

File organization trend of HPC applications 



Problem of single shared file approach for reducing the number of files 





providing single shared file APIs for high I/O performance

Evaluation result on a real HPC application 



exhibiting low I/O performance through a benchmark program

PFA (Parallel File Aggregation) Mechanism 



use of millions of small files

3.8 times faster than the original with reducing the number of files by about 100,000 files

Conclusion Q&A International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

2

Agenda 

File organization trend of HPC applications 



Problem of single shared file approach for reducing the number of files 





providing single shared file APIs for high I/O performance

Evaluation result on a real HPC application 



exhibiting low I/O performance through a benchmark program

PFA (Parallel File Aggregation) Mechanism 



use of millions of small files

3.8 times faster than the original with reducing the number of files by about 100,000 files

Conclusion Q&A International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

3

File Organization Trend of HPC Applications 

Use of millions of several-MB-sized files 

Examples of real HPC applications 





Integrated Microbial Genomes System  65 million files  Average file size : < 1KB Nearby Supernova Factory [Cecilia 2009]  over 100 million files  Max file size : 8MB

Statistics on HPC file systems   

[Rockville 2009]

[Shobhit 2008]

60% of files : < 1MB 80% of files : < 8MB 99% of files : < 64MB

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

4

Design of Current HPC Applications 

N-N pattern 

N processes utilize N independent files Application Process B File B Process A File A

Process C File C

Millions of process utilize millions of files on millions of CPU cores

 Hard file management  Heavy metadata workload

Each process accesses its own independent file International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

5

Goal of This Research 

N-1 pattern 

N processes utilize 1 shared file

Application

Application Process B File B

Good

Process A File A

Process C File C

Employed N-N pattern

Change of application pattern

Process B

Bad Shared File

Process A

Process C

Not Employed N-1 pattern

Why do current HPC applications not employ the N-1 pattern ? International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

6

Problem of the N-1 pattern (1/2) 

Low I/O Performance  

Benchmark Program : MPI-IO Test File System : Lustre Parallel File System 8000 7000 6000 5000 4000 3000 2000 1000 0

Write Bandwidth [MB/sec]

Bandwidth [MB/sec]

Read

Legend N-N pattern N-1 pattern

600 500 400

300 200 100 0

16 24 32 40 48 56 64 # of processes

over 3 times lower

8 24 40 56 72 88 104120 # of processes

over 5 times lower

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

7

Problem of the N-1 pattern (2/2) 

File lock contention [Richard 2005] 

Each process must acquire file lock every stripe block before data access for consistency Node

Node Wait

Application Process A

Process B

Wait Process C

Process D

Blocked ! Shared File

locked by A

locked by D

Process B and C must wait until the lock is released Stripe Block

Performance degradation

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

8

Agenda 

File organization trend of HPC applications 



Problem of single shared file approach for reducing the number of files 





providing single shared file APIs for high I/O performance

Evaluation result on a real HPC application 



exhibiting low I/O performance through a benchmark program

PFA (Parallel File Aggregation) Mechanism 



use of millions of small files

3.8 times faster than the original with reducing the number of files by about 100,000 files

Conclusion Q&A International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

9

Proposed Mechanism 

PFA (Parallel File Aggregation) Mechanism    

provides N-1 pattern APIs based on memory-map reduces I/O contention by aggregating I/Os does not need file lock reduces amount of data by incremental logging feature

improves the write bandwidth of the N-1 pattern reduces the # of files with the use of the N-1 pattern

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

10

APIs of the PFA Mechanism 



Data are read and written sequentially through the APIs based on memory-map

Write data



Read data

const size_t buf_size = 272,383; /* allocate a memory region for write */ char* buf = pfa_mmap( “foo.txt”, buf_size, rank,… );

const size_t buf_size = 272,383; /* allocate a memory region for read */ char* buf = pfa_mmap( “foo.txt”, buf_size, rank, … );

while ( condition ) { buf[ … ] = …; /* edit data */

while ( condition && ! pfa_eof( buf ) ) { … = buf[ … ]; /* read data */

pfa_append( buf, … ); /* append data */ } /* free the memory region */ pfa_munmap( buf );

pfa_seek( buf, … ); /* read data */ } /* free the memory region */ pfa_munmap( buf );

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

11

Overview of the PFA mechanism 

The PFA mechanism works on file system client 

It does not need to modify file system server Node Application Process A

Process B

User Address Space Kernel Address Space

 APIs based on memory-map

File System Client Chunk

Chunk

 Direct I/O

File System Server Shared File

 I/O aggregation on chunk  Incremental Logging Feature

Chunk A

Chunk B

 Stripe aware data layout

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

12

Memory-map 

APIs based on memory-map transfer data from the user address space directly Application

fwrite/MPI_File_write ( MPI-IO [Rajeev 1999] )

pfa_append

Process

Process

Data

Data

User Address Space Kernel Address Space File System Client Data

1 copy

0 copy

To File System Server International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

13

I/O Aggregation 

Data are aggregated into chunk on file system client Application

Without Aggregation

With Aggregation

Process

Data

Data

Process

Data

Data

Data

Data

User Address Space Kernel Address Space

Chunk

File System Client

Data 3 I/O requests

Data

To File System Server

Data 1 I/O request

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

14

Incremental Logging Feature - Overview 

Unmodified data from the previous store are not stored again Application

Without Incremental Logging Feature

With Incremental Logging Feature

Process

Data

Data

Process

Data

Data

Data

Data

User Address Space Kernel Address Space

Chunk

File System Client

Data

2nd stored data is same as 1st stored data

Data

Data

Chunk

Data

Metadata 2nd data == 1st data

To File System Server

Data

De-duplicate data

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

15

Incremental Logging Feature - Detection of Modified Data 

Page protection fault is used to detect modified data

Sample Code char* buff = pfa_mmap( … ); ・ ・ pfa_append( buff, … ); ・ ・ ・ buff[ 0 ] = …; ・ ・ ・ pfa_append( buff, … ); Modified Data

On Memory Page 0 Page 1 Page 2

Page

Writable Page

Page

Write Protected Page

 Allocating pages for buff

buff Page 0 Page 1 Page 2 write

 Storing data  Turning off the write bit of the all pages

Page 0 Page 1 Page 2

 Handling page protection fault on Page 0  Turning on the write bit of the Page 0

Page 0 Page 1 Page 2

 Storing data only on modified pages (= Page 0)

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

16

Direct I/O 

Direct I/O avoids cache duplication between file system cache and chunk of the PFA mechanism File System Client Without direct I/O Data

Data

Data

File System Cache

Chunk acts the same as the file system cache

Chunk

With direct I/O Data

Data

Chunk

Data

File System Cache

To File System Server

Direct I/O bypasses the file system cache

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

17

Data Layout on Shared File 

Each chunk is aligned on stripe block Node

Node

Application Process A

Shared File

Chunk A 1st

Process B

Chunk B 1st

Chunk C 1st

Process C

Chunk D 1st

Process D

Chunk A 2nd

Chunk B 2nd

Stripe Block

Each process does not need to acquire file lock International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

18

Agenda 

File organization trend of HPC applications 



Problem of single shared file approach for reducing the number of files 





providing single shared file APIs for high I/O performance

Evaluation result on a real HPC application 



exhibiting low I/O performance through a benchmark program

PFA (Parallel File Aggregation) Mechanism 



use of millions of small files

3.8 times faster than the original with reducing the number of files by about 100,000 files

Conclusion Q&A International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

19

Evaluation Environment 

Evaluated on Lustre Parallel File System 

Lustre Client 



128 cores ( = 4 cores * 2 sockets * 16 nodes )

Lustre Server  

1 MDS (Meta Data Server) on VMWare vSphere 4 4 OSS (Object Storage Server ) + 6 OST (Object Storage Target) Client

MDS

OSS

CPU

Intel Xeon X5550 2.67GHz, 8cores

Intel Xeon L5640 2.26GHz, 4 cores in 12 cores

Intel Xeon L5640 2.26GHz, 12 cores

Memory

DDR3 24GB

DDR3 16008MB in 48GB

DDR3 48GB

Disk

160GB SATA

6Gbps 7,200 rpm SAS 500GB x 4

6Gbps 7,200 rpm SAS 500GB x 2

Interconnect

Infiniband 4x QDR

Infiniband 4x QDR

Infiniband 4x QDR

OS

RHEL5(2.6.18-194)

RHEL5(2.6.18-164)

RHEL5(2.6.18-164)

Lustre

1.8.4

1.8.3

1.8.3

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

20

MPI-IO Test Benchmark 1. 2.



Legend N-N pattern N-1 pattern N-1 pattern with the PFA

Test Configuration Write 272,383 bytes for a minute Read written data

Result  

N-N > N-1 with the PFA > N-1 N-N pattern generates 128 files at most … too low

8000 7000 6000 5000 4000 3000 2000 1000 0

Write

Over 2 times

Bandwidth [MB/sec]

Bandwidth [MB/sec]

Read

600 500 400 300

Over 5 times

200 100 0

16 24 32 40 48 56 64 # of processes

8 24 40 56 72 88 104 120



# of processes

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

21

Athena Application 450

Original 99584 files

Elapsed Time [sec]

400 350 300

 With the PFA 2 files

Simulating Rayleigh-Taylor instability with 128 processes Total 99584 files in original 

49792 simulation data files 

250



200 150



0

Without I/O Limit Value Stripe Size [MB]

Saving 30.8% data

49792 checkpoint data files 

1 2 4 8 16 32 64 128

50

Average file size : 737534 byte with incremental logging 

100

Good



[Stone 2008]



Average file size : 272383 byte without incremental logging

Speeding up 3.8 times faster than the original in I/O part

International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

22

Related Work & Comparison 

MPI-IO [Rajeev 1999]  



SIONlib [Frings 2009]  



converts the N-N pattern into N-1 pattern on the library incurs performance degradation due to the file lock contention

PLFS  



provides N-1 pattern APIs based on file requires copy between the user and the kernel address spaces

[Bent 2009]

provides virtual view of the shared file on the file system server incurs metadata stress due to actually employing the N-N pattern

The PFA mechanism  

provides N-1 pattern APIs based on memory-map works on the file system client International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

23

Conclusion 

The N-1 pattern exhibits poor I/O performance 



Most applications employ the N-N pattern and generate millions of small files

PFA (Parallel File Aggregation) Mechanism 

It improves I/O performance of the N-1 pattern    



providing N-1 pattern APIs based on memory-map reducing I/O contention by aggregating I/Os no file lock reducing amount of data by incremental logging feature

The Athena application speeds up 3.8 times than the original with reducing the number of files by about 100,000 files International Workshop on Runtime and Operating Systems for Supercomputers 2011

5/31

24

Suggest Documents