SUN LUSTRE STORAGE SYSTEM

SOLVING THE HPC I/O BOTTLENECK: SUN™ LUSTRE™ STORAGE SYSTEM Sean Cochrane, Global HPC Sales Ken Kutzer, HPC Marketing Lawrence McIntosh, Engineering ...
Author: Anastasia Pope
31 downloads 0 Views 3MB Size
SOLVING THE HPC I/O BOTTLENECK:

SUN™ LUSTRE™ STORAGE SYSTEM Sean Cochrane, Global HPC Sales Ken Kutzer, HPC Marketing Lawrence McIntosh, Engineering Solutions Group Sun BluePrints™ Online

Part No 820-7664-20 Revision 2.0, 11/12/09

Sun Microsystems, Inc.

Table of Contents Solving the HPC I/O Bottleneck: Sun Lustre Storage System................................ 1 Target Environments............................................................................................ 1 The Lustre File System.......................................................................................... 2 Lustre File System Design................................................................................. 3 Sun and Open Storage..................................................................................... 4 Sun Lustre Storage System Overview..................................................................... 5 Design Considerations...................................................................................... 6 Hardware Components.................................................................................... 8 HA MDS Module.......................................................................................... 8 Standard OSS Module.................................................................................. 9 HA OSS Module . ....................................................................................... 11 Software Components ................................................................................... 14 Performance Evaluation..................................................................................... 16 HA OSS Testing and Results............................................................................ 17 HA OSS Benchmark Configuration............................................................... 17 RAID and Disk Configuration....................................................................... 18 IOzone Benchmark Runs............................................................................. 18 Sample IOzone Benchmark Output.............................................................. 20 Standard OSS Testing and Results................................................................... 22 Standard OSS Benchmark Configuration...................................................... 22 IOzone Benchmark Runs............................................................................. 23 IOzone Benchmark Output.......................................................................... 24 Proven Scalability.............................................................................................. 26 CLUMEQ Supercomputing Consortium............................................................ 26 Texas Advanced Computer Center (TACC)......................................................... 27 Summary ......................................................................................................... 28 About the Authors.............................................................................................. 28 Acknowledgements............................................................................................ 29 References........................................................................................................ 30 Ordering Sun Documents................................................................................... 30 Accessing Sun Documentation Online................................................................. 30

1

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Solving the HPC I/O Bottleneck: Sun™ Lustre™ Storage System Much of the focus of high performance computing (HPC) has traditionally centered on CPU performance. However, as computing requirements have grown, HPC clusters are demanding increasingly higher rates of aggregate data throughput. With ongoing increases in CPU performance and the availability of multiple cores per socket, many clusters can now generate I/O loads that a few years ago were observed only in very large systems. Traditional shared file systems, such as NFS, were not originally designed to scale to the required levels of performance of today’s clusters. Note: This Sun BluePrints™ article is an updated version of an article by the same title originally published in April 2009. Specifically, this article contains updated performance results for the High Availability Object Storage Server (HA OSS) module used in the Lustre™ file system implementation. This new HA OSS module uses two Sun Fire™ X4270 servers, each with two quad-core Intel® Xeon® 5500 series (Nehalem) processors and each configured with 24 GB RAM; Quad Data Rate (QDR) InfiniBand; and Lustre 1.8 file system software.

As a parallel or clustered file system, the Lustre™ file system can aggregate I/O across a number of individual storage devices and provide parallel data access that far exceeds the performance of monolithic storage devices. By providing shared file system access for hundreds or even thousands of nodes, the Lustre file system enables the creation of a storage solution that can provide the high aggregate I/O bandwidth required by HPC applications in areas such as manufacturing, electronic design, government, and research. This paper describes the Sun™ Lustre Storage System, a simple-to-deploy storage environment based on the Lustre file system, Sun Fire™ servers and Sun Open Storage platforms: • “Target Environments” on page 1 introduces target environments for the Sun Lustre Storage System. • “The Lustre File System” on page 2 provides an overview of the Lustre file system. • “Sun Lustre Storage System Overview” on page 5 introduces the Sun Lustre Storage System, including design considerations and hardware and software components. • “Performance Evaluation” on page 16 details data obtained from a performance evaluation of the Sun Lustre Storage System.

Target Environments High performance computing covers a diverse set of markets including education, research, weather and climate forecasting, financial modeling, biosciences, seismic processing, computer aided engineering and digital content creation to name a few. The focus of this paper is deploying very high bandwidth storage solutions with the Sun Lustre Storage System. The Sun Lustre Storage System is a very high performance and extremely scalable storage solution for serving compute clusters or grids requiring high aggregate I/O bandwidth. The Sun Lustre Storage System combines the open source, Lustre

2

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

parallel file system, Sun Fire servers and Sun Open Storage products. The result is a simple-to-deploy parallel storage solution that delivers sustained performance ranging from a few gigabytes per second to over 200 GB/sec, capacity scaling to tens of petabytes, and a compelling price to performance ratio. This storage solution is generally deployed using InfiniBand interconnects, but can also be deployed using Gigabit Ethernet or 10 Gigabit Ethernet infrastructure. While not covered in this paper, readers should be aware of the following Sun solutions that may be well suited for data sets that are not subject to the high bandwidth I/O needs outlined later in the document. • Sun Storage 7000 Unified Storage System Sun Storage 7000 Unified Storage Systems are simple-to-use storage appliances designed to deliver leading performance via traditional file sharing protocols such as NFS and CIFS at a radically new price point. Developed using open source software and industry standard components, the Unified Storage family of storage products installs in minutes and provides simple to use yet very powerful analytic capabilities that allow sophisticated performance management. These products are often accessed via NFS or CIFS but incorporate flash technology to provide a performance profile that exceeds typical NFS server products. The Unified Storage product family can be used with Gigabit Ethernet, 10 Gigabit Ethernet, or InfiniBand interconnects. For more on the Sun Storage 7000 Unified Storage Systems, see http://www.sun.com/storage/disk_systems/unified_storage. • Sun Archive Many sites need to retain very large volumes of data in the most economical fashion and facilitate storing that data as well as recalling it for future projects. Sun provides a full set of solutions to address the massive data problem that many sites are facing. Sun provides archiving products to over 48% of the top 50 supercomputers as ranked by top500.org on the June 2009 listing. For more information on Sun’s archiving solutions, see http://www.sun.com/storage/hpc/ and http://www.sun.com/storage/archive.

The Lustre File System The Lustre file system is an open source shared file system designed to address the I/O needs of compute clusters containing up to thousands of nodes. It is best known for powering the largest HPC clusters in the world, with tens of thousands of client systems, petabytes (PB) of storage, and hundreds of gigabytes per second (GB/sec) of I/O throughput. A number of HPC sites use Lustre file system as a site-wide global file system, servicing clusters on an unprecedented scale. The Lustre file system is used by 62% of the top 50 supercomputers as ranked by top500.org on the June 2009

3

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

listing. Additionally, IDC lists Lustre file system as the file system with the largest market share in HPC. (Source: IDC’s HPC User Forum Survey, 2007 HPC Storage and Data Management: User/Vendor Perspectives and Survey Results)

Lustre file system is a natural fit for environments where traditional shared file systems, such as NFS, do not scale to the required aggregate throughput requirements.

With the mass adoption of clusters and explosive growth of data storage needs, I/O bandwidth challenges are becoming common in a variety of public and private sector environments. The Lustre file system is a natural fit for these places where traditional shared file systems, such as NFS, do not scale to the required aggregate throughput requirements of these clusters. Sectors struggling with this challenge include oil and gas, manufacturing, government, scientific research, and digital content creation (DCC). The Sun Lustre Storage System leverages technologies developed for large scale sites, and makes these technologies easier to deploy and use for moderate-sized cluster environments.

Lustre File System Design The Lustre file system is a software-only architecture that allows a number of different hardware implementations. The main components of a Lustre architecture are Lustre file system clients (Lustre clients), Metadata Servers (MDS), and Object Storage Servers (OSS). Lustre clients are typically compute nodes in HPC clusters. These nodes run Lustre client software and access the Lustre file system via InfiniBand, Gigabit Ethernet or 10 Gigabit Ethernet connections. The Lustre file system client software presents a native POSIX file interface to the client nodes it is running on. The Lustre file system is then mounted like any other file system. Metadata Servers and Object Storage Servers implement the file system and communicate with the Lustre clients. The Lustre file system uses an object-based storage model and provides several abstractions designed to improve both performance and scalability. At the file system level, the Lustre technology treats files as objects that are located through Metadata Servers (MDS). Metadata Servers support all file system name space operations, such as file lookups, file creation, and file and directory attribute manipulation. File data is stored in objects on the OSSs. The MDS directs actual file I/O requests from Lustre clients to OSSs, which manage the storage that is physically located on underlying storage devices. Once the MDS identifies the storage location of a file, all subsequent file I/O is performed between the client and the OSS. Separating file system metadata operations from actual file data operations not only improves immediate performance, but also improves long-term aspects of the file system such as recoverability and availability.

This design divides file system updates into two distinct types of operations: file system metadata updates on the MDS, and actual file data updates on the OSS. Separating file system metadata operations from actual file data operations not only improves immediate performance, but also improves long-term aspects of the file system such as recoverability and availability.

4

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

As shown in Figure 1, the Lustre file system can support a variety of configuration options including InfiniBand or Ethernet interconnects, redundant Metadata Servers, and a choice of commodity storage, high availability direct connect storage, or enterprise SANs for use on the Object Storage Servers. To provide the best combination of performance and value, the Sun Lustre Storage System offers both InfiniBand and 1/10 Gigabit Ethernet interconnects and provides options for low-cost storage or high-availability direct connect storage. For more information on the Lustre file system, see http://wiki.lustre.org and http://www.sun.com/software/products/lustre/.

Metadata Servers (MDS)

(active)

Commodity Storage

(standby)

InfiniBand

Multiple networks supported simultaneously

Storage Arrays (Direct Connect)

Clients

Ethernet

File System Failover

Object Storage Servers (OSS)

Enterprise Storage Arrays & SAN Fabrics

Figure 1. Lustre file system high-level architecture.

Sun and Open Storage The Sun Lustre Storage System is architected using Sun Open Storage systems that deliver exceptional performance and value. Almost all modern disk arrays and NAS devices are constructed from proprietary designs and specialized software that sell

5

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

for significant premiums over the market price of the underlying components. In contrast, Sun Open Storage products are built with high volume, industry-standard hardware and open source software. This allows Sun Open Storage products to significantly change the price, performance and value of storage. Just as open source software and operating systems have radically changed the computing landscape, open storage will change the storage landscape. The specific Sun hardware platforms and software components used vary by configuration. See “Hardware Components” on page 8 and “Software Components” on page 14 for more details. See also http://www.sun.com/openstorage/ for more information on the Sun Open Storage products.

Sun Lustre Storage System Overview Two fundamental design goals of the Sun Lustre Storage System are simplified configuration and implementation, and maintaining a price/performance lead over competitive designs. These goals are primarily achieved by standardizing the hardware architecture and choosing leading hardware components. In addition, a modular approach is used to provide scalability, a further design goal. Datacenters can start with the number of OSS modules required for their cluster application and can easily grow capacity and throughput by adding OSS modules. The design can use either InfiniBand or Ethernet interconnects, providing flexibility to meet various deployment requirements. The following modules are used in Sun Lustre Storage System configurations (see Figure 2): • High Availability Metadata Server (HA MDS) module • Object Storage Server (OSS) modules: • Standard OSS module • HA OSS module

6

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

HA MDS Module Shared Storage MDS 1 (active)

MDS 2 (standby)

Interconnect

Standard OSS Module OSS

CPU

Storage

Clients

HA OSS Module

Key:

= Failover MDS = Metadata Server OSS = Object Storage Server

OSS 1 (active)

Shared Storage

OSS 2 (active)

Shared Storage

Figure 2. Overview of Sun Lustre Storage System design.

All Sun Lustre Storage System configurations include a High Availability Lustre Metadata Server (HA MDS) module. This module manages and stores metadata, such as file names, directories, permissions and file layout. Configurations also require one or more Lustre Object Storage Server (OSS) modules, which provide scalable I/O performance and storage capacity. For maximum flexibility, the architecture defines two OSS modules: a Standard OSS module for greatest density and economy, and an HA OSS module that provides OSS failover for environments where automated recovery from OSS failure is important.

Design Considerations Price, performance, scalability, and availability are all important concerns for HPC environments, and these concerns are reflected in the Sun Lustre Storage System design: • Price The Sun Lustre Storage System utilizes Sun Open Storage products. These components are low-cost general-purpose hardware such as SAS/SATA JBOD disks and x86-based servers. Using these components instead of proprietary components such as RAID controllers helps keep costs low without sacrificing performance.

7

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

• Performance I/O and data performance were key criteria when selecting servers and storage, and a number of different servers and storage combinations were tested before deciding on the Sun Lustre Storage System design. For the HA MDS module, the storage configuration consists of RAID 101 with 15K rpm enterprise SAS drives. This storage provides the high random I/O performance that can be required by metadata operations. Since metadata is a very small portion of the overall data, the extra cost associated with SAS drives and RAID 10 was determined to be an appropriate trade-off. For the OSS modules, the storage configuration consists of RAID 6 and high density SATA drives. The OSSs store the actual user data and can grow very large, and RAID 6 provides a much better cost profile than the RAID 10 used on the HA MDS module. Recent benchmarks have shown that RAID 6 imposes a small enough performance penalty over RAID 5, thereby providing a better balance of price/performance, capacity, and availability. • Scalability

Linear Scaling Performance

The Lustre file system has been shown to scale near linearly.

The minimum recommended configuration for a Sun Lustre Storage System includes either two Standard OSS modules or one HA OSS module. The Sun Lustre Storage System can scale to petabytes of storage and over two hundred gigabytes per second of data delivered to clusters with thousands of clients simply by adding OSS modules and switch infrastructure. Since the Lustre file system has been shown to scale in a near linear fashion (see Figure 3), adding additional OSS modules directly adds throughput and capacity.

Lustre File System Scaling

NFS Scaling

Cluster Size

Figure 3. Lustre file system scalability. • Availability The design contains several features that increase the availability of the Sun Lustre Storage System. Redundant Metadata Servers connected to a shared disk array provide metadata storage. One server can be taken down for maintenance, and operation can continue without interruption. The metadata storage uses mirrored pairs with SAS drives to optimize metadata reliability. The disk drives in both the Standard OSS and the HA OSS modules use RAID 6 configurations, 1 RAID 10, mirroring plus striping, is also referred to as RAID 1+0 in some documentation.

8

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

enabling file system operations to continue in the event of a double disk failure. Additionally, the HA OSS configuration allows for OSS failover. The system can continue to serve data even if one OSS server should fail. The OSS failover can also be used to take an OSS out of service temporarily for upgrades or other maintenance. The Sun Fire servers and the Sun Storage J4400 arrays also include redundant, hot-swap components, for increased system availability.

Hardware Components Sun Fire servers provide a fast, energyefficient, and reliable foundation for the Sun Lustre Storage System. • Sun Fire X4270 servers—used in HA MDS and HA  OSS modules • Sun Fire X4540 servers—used in Standard OSS modules

Sun Fire servers provide the foundation of the Sun Lustre Storage System. Specifically, Sun Fire X4270 servers and a Sun Storage J4200 array are used in the High Availability Metadata Servers (HA MDS) module. For the High Availability Object Storage Server (HA OSS) modules, Sun Fire X4270 servers are used in combination with Sun Storage J4400 arrays. Sun Fire X4540 servers are used in the Standard Object Storage Server (Standard OSS) module. These Sun Fire Servers were chosen primarily for their data and I/O performance characteristics, and they provide a fast, energy-efficient, and reliable foundation for the Sun Lustre Storage System. HA MDS Module The HA MDS module, designed to meet the critical requirement of high availability, is common to all Sun Lustre Storage System configurations. This module includes a pair of Sun Fire X4270 servers with an attached Sun Storage J4200 array acting as shared storage (see Figure 4 and Figure 5 on page 9). Internal boot drives in the Sun Fire X4270 server are mirrored for added protection.

HA MDS module: • Two Sun Fire X4270 servers • One Sun Storage J4200 array

The HA MDS module storage consists of a single Sun Storage J4200 array with twelve 300 GB/15,000 rpm SAS drives in a RAID 10 configuration. Two SAS I/O modules are installed in the Sun Storage J4200 array so that both metadata servers can access the shared storage. With a typical metadata allocation of 4 KB per file, the HA MDS module can support a file system with over 250 million files. The MDS should sustain transaction rates sufficient to support most Lustre file system installations where large files (tens of megabytes to terabytes) constitute the primary data sets. Separate MDS modules exist for InfiniBand and 10 Gigabit Ethernet infrastructures. Either version can be ordered to meet customer requirements for their network infrastructure.

9

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

A

SAS IO Module (SIM)

Sun Microsystems, Inc.

Sun Storage J4200 array • 12 300 GB 15K SAS drives

B

Host A SAS Host B SAS

Dual Sun Fire X4270 Servers • 2x Intel Xeon X5570, Quad Core processors at 2.93GHz • 24 GB memory • 4x300 GB internal SAS HDD • PCIe SAS HBA I/O card (InfiniBand or 10 GbE)

Figure 4. HA MDS module configuration. InfiniBand or 10 GbE Host Adapters

HA MDS (Active) Slot 3 Slot 0

HA MDS (Standby) Slot 4 Slot 1

Slot 5 Slot 2

Sun Fire X4270 Server

Slot 3 Slot 0

Slot 4 Slot 1

Slot 5 Slot 2

Sun Fire X4270 Server Internal SAS 2-port SAS HBA InfiniBand/1o GbE

.

Sun Storage J4200 Array

Active SAS path Failover SAS path

Figure 5. HA MDS module configuration (rear view).

Standard OSS Module

Standard OSS module: • One Sun Fire X4540 server • 48 internal 1 TB 7200 rpm SATA drives

The Sun Fire X4540 server, the successor to the world’s first hybrid data server (the Sun Fire X4500 server, also called Thumper), was chosen for use as the Standard OSS module (Figure 6). The Sun Fire X4540 server features an innovative architecture that combines a high-performance server, high I/O bandwidth, and very high-density storage in a single integrated system. Six SATA controllers, each connecting to eight high-performance SATA disk drives, combined with three PCIe slots deliver extremely high data throughput rates.

10

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Sun Fire X4540 Server • 2x Six Core AMD 2.6GHz • 32 GB memory • I/O cards (IB or 10 GbE) • 48 internal 1 TB 7200 RPM SATA drives

Figure 6. Standard OSS module configuration.

A standard OSS module contains a single Sun Fire X4540 server. The Sun Fire X4540 server has redundant power supplies and fans for increased reliability, and disk failures are protected by using RAID 6. Although Sun Lustre Storage System solutions built around the Standard OSS module are designed to be reliable and available, OSS failover is not possible as the Sun Fire X4540 storage is internal to the server. Table 1 shows the physical disk layout of the Sun Fire X4540 server. Table 1. Sun Fire X4540 server physical disk layout (top view). Rear

.

c0t3d0

c0t7d0

c1t3d0

c1t7d0

c2t3d0

c2t7d0

c3t3d0

c3t7d0

c4t3d0

c4t7d0

c5t3d0

c5t7d0

c0t2d0

c0t6d0

c1t2d0

c1t6d0

c2t2d0

c3t6d0

c3t2d0

c3t6d0

c4t2d0

c4t6d0

c5t2d0

c5t6d0

c0t1d0

c0t5d0

c1t1d0

c1t5d0

c2t1d0

c2t5d0

c3t1d0

c3t5d0

c4t1d0

c4t5d0

c5t1d0

c5t5d0

c0t0d0

c0t4d0

c1t0d0

c1t4d0

c2t0d0

c2t4d0

c3t0d0

c3t4d0

c4t0d0

c4t4d0

c5t0d0

c5t4d0

Controller 0

Controller 1

Controller 2

Controller 3

Controller 4

Controller 5

Front

Each Sun Fire X4540 server in the Standard OSS module uses the logical disk configuration shown in Table 2: Table 2. Logical disk configuration of each Sun Fire X4540 server in the Standard OSS module. Controller 0

c0t0d0

c0t1d0

c0t2d0

c0t3d0

c0t4d0

c0t5d0

c0t6d0

c0t7d0

1

c1t0d0

c1t1d0

c1t2d0

c1t3d0

c1t4d0

c1t5d0

c1t6d0

c1t7d0

2

c2t0d0

c2t1d0

c2t2d0

c2t3d0

c2t4d0

c2t5d0

c2t6d0

c2t7d0

3

c3t0d0

c3t1d0

c3t2d0

c3t3d0

c3t4d0

c3t5d0

c3t6d0

c3t7d0

4

c4t0d0

c4t1d0

c4t2d0

c4t3d0

c4t4d0

c4t6d0

c4t7d0

c4t7d0

5

c5t0d0

c5t1d0

c5t2d0

c5t3d0

c5t4d0

c5t5d0

c5t6d0

c5t7d0

RAID 6 Vol.

OST 1

OST 2

OST 3

OST 4

Key: Journal

Data

Journal

Data

Journal

Data

Journal

Data

11

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

As shown in Table 2, mirrored pairs are used for external journals. For example, c0t0d0 and c1t1d0 are used for the mirrored journal for OST 1. The remainder of the drives (40) are configured as RAID 6 8+2 volumes.2 Each of these four volumes are presented as an 8 TB object store target (OST) to the Lustre file system. The Sun Fire X4540 servers are installed using the internal compact flash (CF) card as a boot drive. HA OSS Module HA OSS module: • Two Sun Fire X4270 servers • Four Sun Storage J4400 arrays

Each HA OSS module includes two Sun Fire X4270 servers and four Sun Storage J4400 arrays (see Figure 7). 4 Sun Storage J4400 arrays • 1 TB SATA drives A

Dual Sun Fire X4270 Servers • 2x Intel Xeon X5570, Quad Core at 2.93 GHz

B A B A B A

. SAS IO Module (SIM) Host A SAS Host B SAS

B Primary paths connect to SIM A Secondary paths connect to SIM B

Figure 7. HA OSS module configuration overview.

Sun Fire X4270 servers were chosen for the HA OSS module, because with six PCI Express Gen 2 slots, the Sun Fire X4270 server has the ability to drive the high throughput required in Sun Lustre Storage System environments. Each Sun Fire X4270 server has two quad-core Intel® Xeon® 5500 series (Nehalem) processors and is configured with 24 GB of RAM. The Sun Storage J4400 array was chosen for the HA OSS storage because it offers double the storage density, twice the connectivity, and higher availability at half the price per gigabyte of the leading competitor. With redundant SAS I/O Modules and front-serviceable disk drives, the Sun Storage J4400 array is the right choice to give the Sun Lustre Storage System the best price/performance without sacrificing RAS features.

2 RAID 6 8+2 refers to a RAID 6 configuration using 8 data disks and 2 parity disks.

12

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Figure 8 depicts a block diagram of the Sun Fire X4270 server with the factoryinstalled internal SAS HBA placement, and provides information on the PCIe slot configuration. Intel Xeon Processor 5500 Series

Intel Xeon Processor 5500 Series

XEON

TM

TM

QPI

Side-band Interface

QPI DDR3

DDR3

PCIe x2

Intel 5520

PCIe2 x8 4 GB/s

PCIe2 x16 8 GB/s

PCIe2 Active Riser

SATA

USB to IDE

IDE

CD/DVD

PCIe2 x8

PCIe SAS/RAID Controller

USB

PCIe2 x8 - 0

PCIe2 x8 - 3

PCIe2 x8 - 1

PCIe2 x8 - 4

USB to SATA USB

USB

+

SAS Expander x28

Compact Flash 1x Internal USB 2.0

SAS/SATA HDDs or SATA SSDs

SAS/SATA

+

PCIe2 x8 - 2

VGA Video

UART

LPC

ICH10R PCIe2 x8

PCIe2 x8

PCIe2 Switch PCIe2 x8

PCIe2 x8

PCIe2 x8

ESI (PCIe x4)

IOH

PCIe2 Riser

Serial RJ-45

AST2100 G66925.1S-2E 4207 TAN A1 GP

Virtual

PCIe x2

PCIe2 x8 4 GB/s

Management 10/100 Ethernet

PCI 32bit 33 MHz USB

PCIe2 x8 - 5

2x 1GB Ethernet 0&1

DDR3

DDR3

.

82575EB

QPI XEON

PCIe2 Switch

2x 1GB Ethernet 2&3

DDR3

DDR3

PCIe2 Active Riser

82575EB

+

USB +

USB

SAS

2x Rear USB 2.0 2x Front USB 2.0

Figure 8. Block diagram of Sun Fire X4270 server with SAS HBA.

Table 3 describes the Sun Fire X4270 server card layout for the HA OSS module. A view of the HA OSS module configuruation, showing PCIe slot assignments and connections to storage, is included in Figure 9 on page 13. Table 3. Sun Fire X4270 server card layout for the HA OSS module. Slot Number Type

Card

Slot 0

Shares PCIe2 x16 8 GB/sec

Factory default with Internal SAS HBA

Slot 1

Shares PCIe2 x16 4 GB/sec

Not used

Slot 2

Shares PCIe2 x16 4 GB/sec

External SAS HBA

Slot 3

Shares PCIe2 x16 8 GB/sec

QDR InfiniBand HCA or 10 GbE NIC

Slot 4

Shares PCIe2 x16 4 GB/sec

Not used

Slot 5

Shares PCIe2 x16 4 GB/sec

External SAS HBA

13

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

InfiniBand or 10 GbE Host Adapter

Sun Fire X4270 Server Slot 3 Slot 0

Sun Microsystems, Inc.

Slot 4 Slot 1

Slot 5 Slot 2

Sun Fire X4270 Server Slot 3 Slot 0

Slot 4 Slot 1

Slot 5 Slot 2

Internal SAS 2-port SAS HBA InfiniBand/1o GbE Active SAS path Failover SAS path

Sun Storage J4400 arrays (4)

Figure 9. HA OSS module configuration (rear view).

.

Object Storage Server failover is possible to maintain availability of data. Each HA OSS module contains 4 Sun Storage J4400 arrays with a total of 96 spindles of 1 TB SATA drives. To ensure HA connectivity to storage, each Sun Storage J4400 array must be equipped with two SAS I/O Modules (SIMs), and each Sun Fire X4270 server is configured with two dual-port SAS HBAs. This design allows for either OSS to drive all of the volumes on the Sun Storage J4400 arrays. During normal operation, each Sun Fire X4270 server owns and services requests on 48 disk drives. In the event of a server failure, the remaining server in the pair will take over the drives from the failed server and then service requests on all 96 drives. HA OSS modules can utilize InfiniBand or 1/10 Gigabit Ethernet infrastructures. The internal boot drives of the Sun Fire X4270 servers are mirrored for added protection. Each Sun Storage J4400 array is configured with two RAID 6 8+2 OSTs and four drives used for external journals, as shown in Table 4.

14

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Table 4. Disk configuration of each Sun Storage J4400 array in the HA OSS Module. c0t20d0

c0t21d0

c0t22d0

c0t23d0

c0t16d0

c0t17d0

c0t18d0

c0t19d0

c0t12d0

c0t13d0

c0t14d0

c0t15d0

c0t8d0

c0t9d0

c0t10d0

c0t11d0

c0t4d0

c0t5d0

c0t6d0

c0t7d0

c0t0d0

c0t1d0

c0t2d0

c0t3d0

Key: Journal (OST 1)

OST 1

Journal (OST 2)

OST 2

Software Components The Sun Lustre Storage System software stack that resides on the Metadata Servers and Object Store Servers can include the following components: • Linux operating system • Lustre file system • Heartbeat failover tool • System Logging tool

.

These software components, in combination with the Sun Lustre Storage System hardware, have been subjected to iterative testing by Sun engineers to help determine the configuration that delivers the best performance, integrity, and reliability for Lustre file system deployments. The software stack is deployed via a DVD image that contains all necessary software, including the operating system, and installs the software components in a standard configuration. The installation process also validates that the correct Sun Lustre Storage System hardware is used and is configured according to specifications. RAID configuration, failover setup, and Lustre file system configuration is all done through the installation process. The administrator installing the system needs to answer a only few questions about the network addresses for key components in the system. The Sun Lustre Storage System DVD prompts for vital configuration information, such as: • Lustre file system namespace • MDS and OSS primary and failover server addresses • IP addresses for servers and service processors • External server for heartbeat • Heartbeat private network interface

15

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

The DVD then sets up the Linux devices according to udev name mapping. RAID volumes are created based on the type of Lustre server being configured (RAID 10 for MDS, RAID 6 for OSS). The mdadm and heartbeat configuration files are installed and configured, and the RAID volumes are started. The final step formats the RAID volumes as Lustre object storage targets (OSTs). OSTs contain the Lustre file system data; each OSS contains multiple OSTs, one per volume. The configuration of both HA OSS and Standard OSS modules follows this general installation process, including the loading of the operating system, drivers, and Lustre file system. However hardware verification, RAID configuration, and Lustre file system setup is completed automatically on the HA OSS modules, but must be completed manually by Sun Professional Services for Standard OSS modules. This deployment method greatly simplifies the installation of a Lustre file system. Standard hardware along with software improvements make a Sun Lustre Storage System implementation a much better and simpler solution than building a Lustre solution from various components. Note: If the recommended Sun Lustre Storage System hardware architecture is not followed, a custom Lustre file system implementation engagement must be scheduled with Sun Professional Services (additional charges may apply).

.

16

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Performance Evaluation Two performance evaluations were conducted for the following Sun Lustre Storage System configurations: Note: Updated testing of the HA OSS modules was performed for this revision of this article. Standard OSS module testing was not updated for this revision; the Standard OSS results are representative of testing performed in April 2009.

• HA OSS module (see “HA OSS Testing and Results” on page 17) Two HA OSS modules, each with dual Sun Fire X4270 servers and four Sun Storage J4400 arrays, were used for testing performed in October 2009. This new HA OSS module used two Sun Fire X4270 servers, each with two quad-core Intel Xeon 5500 series (Nehalem) processors and each configured with 24 GB RAM; Quad Data Rate (QDR) InfiniBand; and Lustre 1.8 file system software. The HA MDS module consisted of two Sun Fire X4270 servers and a shared Sun Storage J4200 array. • Standard OSS module (see “Standard OSS Testing and Results” on page 22) One Standard OSS module with a single Sun Fire X4540 server was used for this test performed in April 2009. This Standard OSS module used a Sun Fire X4540 server with dual quad-core AMD processors, 32 GB RAM, and 48 internal SATA drives; Dual Data Rate (DDR) InfiniBand; and Lustre 1.6 file system software. The HA MDS module consisted of two Sun Fire X4250 servers and a shared Sun Storage J4200 array. For more details on these configurations, refer to “Standard OSS Module” on page 9 and “HA OSS Module” on page 11.

.

The purpose of this performance evaluation was to provide accurate data to enable sizing of configurations. To enable this, the tests were constructed to determine the maximum sustained performance that can be achieved on two HA OSS modules and a single standard OSS module. For data on scaling Lustre file system implementations with additional OSS modules and thousands of clients in a production environment, see “Proven Scalability” on page 26. For each test, client nodes were available to run the IOzone benchmark and drive I/O to the OSS module. The number of client nodes used in these tests is not indicative of the number of nodes that an OSS module can support, and should not be viewed as a minimum or a maximum for an OSS module. It is important to note that node count alone is not a good indicator of cluster throughput needs. For example, a small cluster could run applications that have very high per-node I/O demands. Similarly, a large cluster could run applications that require minimal I/O throughput. When sizing a Sun Lustre Storage System solution, one should look at the aggregate bandwidth required of the cluster regardless of the number of nodes in the cluster.

17

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

The performance details that follow show that the HA OSS module, which was tested using QDR InfiniBand, can sustain an I/O transfer rate of up to 2.35 GB/sec on writes and 2.99 GB/sec on reads. The Standard OSS module, testing using DDR InfiniBand, can sustain an I/O transfer rate of 970 MB/sec on initial writes.

HA OSS Testing and Results Two HA OSS modules consisting of Sun Fire X4270 servers, Sun Storage J4400 arrays, and Sun QDR InfiniBand infrastructure were used for the HA OSS testing. Each HA OSS module included two Object Storage Servers that were configured in a failover pair for high availability. HA OSS Benchmark Configuration Figure 10 shows the hardware configuration used for the HA OSS testing.

Management Network

HA MDS

A

HA OSS (1)

B

A B A

Sun Datacenter InfiniBand Switch 72 (QDR)

B A B A B

.

3 Sun Blade 6000 Modular Systems • 30 Sun Blade X6250 server modules • 2 Xeon E5410, Quad Core 2.33 GHz • 4 GB Memory • 6 QDR InfiniBand EMs

HA OSS (2)

A B A B

InfiniBand SAS Ethernet

A B A B

Figure 10. Hardware diagram for the HA OSS benchmark environment. The following tables, Table 5 and Table 6, describe the hardware and software configuration used for the HA OSS performance evaluation.

18

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Table 5. Hardware configuration used for the HA OSS benchmark testing. Component

Description

HA OSS Module (2 complete HA OSS modules used for testing)

Two Sun Fire X4270 servers per module, each configured with: • 2 Intel Xeon X5570, 4-core processors at 2.93GHz • 24 GB RAM • 2 dual-ported SAS HBAs • 4x QDR InfiniBand HCA Four Sun Storage J4400 arrays per module, each configured with: • Twenty-four 1 TB SATA 7200 rpm drives

HA MDS Module

HA MDS module, as described in “HA MDS Module” on page 8

Clients

3 Sun Blade™ 6000 Modular Systems with 30 Sun Blade X6250 Server Modules, each containing a QDR InfiniBand Express Module

Network

Sun Datacenter InfiniBand Switch 72 (QDR)

Table 6. Software configuration used for the HA OSS benchmark testing.

.

Component

Description

Operating system

RHEL 5.3

Lustre file system

Version 1.8.1.1

Host device driver

LSI driver 4.16.00

Benchmark utility

IOzone 3.315

RAID and Disk Configuration The Sun Lustre Storage System’s RAID configuration consists of eight RAID 6 8+2 OST volumes per HA OSS module. External journals are located on mirrored pairs. OSS operating systems reside on mirrored internal SAS drives on the Sun Fire X4270 servers. The Lustre file system is mounted with default stripe count, starting OST and stripe size. Write through caching is used and NCQ is enabled. As one looks at the Sun Storage J4400 array, the data RAID 6 drives are 0 through 9, and 10 through 19 on each Sun Storage J4400 array . External journals are located on drives 20 through 23. For further details, see Table 4 on page 14. IOzone Benchmark Runs Tests were run utilizing the IOzone benchmark tool to evaluate the file system performance of the Lustre file system with the HA OSS modules. This publicly available benchmark utility generates and measures a variety of file operations (including read, re-read, write, and re-write operations) and can be used to provide a broad analysis of file system performance. See http://www.iozone.org/ for more information on the IOzone benchmark utility.

19

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

The testing was performed with 30 client nodes that were configured to run the IOzone benchmark. A file size of 24 GB was used to minimize the impact of client side caching. Tests were run with 1 MB block sizes. Previous testing has shown negligible performance differences between 256 KB, 512 KB, and 1 MB block sizes. Two sets of tests were run—the first with one HA OSS module and the second with two HA OSS modules—to demonstrate the scalability of the Lustre file system. Results from the test runs are included in Table 7. Table 7. Test results with 1 and 2 HA OSS modules (16 clients, QDR InfiniBand). Clients

1 Module Write (KB/sec)

1 Module Read (KB/sec)

2 Modules Write (KB/sec)

2 Modules Read (KB/sec)

4

1461152.09

1905492.28

1475918.25

1826406.72

8

2463621.62

3122975.19

2811736.47

3536164.84

12

2333728.91

2913391.42

3996334.44

4890684.41

16

2379572.73

2880095.92

4956592.94

6074346.78

20

4834387.83

6064760.86

24

4720750.31

5915174.48

Figure 11 graphically depicts the read and write performance for test runs with up to 24 clients. 7 .

GB/sec

6 5

1 HA OSS write

4

1 HA OSS read

3

2 HA OSS write 2 HA OSS read

2 1 0

4

8

12

16

20

24

Number of Clients

Figure 11. HA OSS: Read and write performance—24 clients.

As shown in Figure 11, peak performance with a single HA OSS module was reached with eight clients attaining a sustained initial write speed of 2.35 GB/sec. A plateau effect for initial write performance was observed after eight clients. Initial read performance on this same single module achieved results just over 2.99 GB/sec.

20

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

The second test with two HA OSS modules (a total of four OSSs and eight Sun Storage J4400 arrays) demonstrates the scalability of Lustre file system and the Sun Lustre Storage System. As shown in Figure 11, two HA OSS modules can sustain 6 GB/sec on reads and approximately 4.7 GB/sec on writes, demonstrating near linear scalability. These benchmark results show that over 2.35 GB/sec can be sustained for a single Sun Lustre Storage System HA OSS module and over 4.7 GB/sec with 2 HA OSS modules. To achieve higher aggregate performance, multiple HA OSS modules can be deployed within a single cluster. Note that the number of clients is not a maximum nor a minimum since this is a synthetic benchmark and may not reflect actual workload. The Lustre file system has been proven to scale to thousands of clients. Sample IOzone Benchmark Output This section includes sample IOzone benchmark output from a test run using two HA OSS modules (a total of four OSSs) and 16 clients. Other test runs used similar commands and generated similar output. The following IOzone parameters were used in this sample run: # /opt/iozone/iozone -M -t 16 -s 24g -r 1M -i0 -i1 -+m clients-16x1.txt

.

This particular run tests 16 clients, each writing a single file. A 1 MB block size is used for I/O operations, flush and fsync activities are included in timing calculations, file size is set to 24 GB, write/re-write and read/re-read test suites are executed, and various metric gathering and reporting options have been enabled. Note: Specific details of IOzone parameters for this command can be found at: http://www.iozone.org/docs/iozone_m sword_98.pdf Output from this test run of the IOzone benchmark utility is listed in Table 8. Table 8. Example IOzone performance test output (2 HA OSS modules, 16 client run). Iozone: Performance Test of File I/O Version $Revision: 3.315 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:

William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root.

21

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Run began: Mon Oct 19 03:02:37 2009

Machine = Linux rsb-09 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 SMP Tue Sep 29 15: File size set to 25165824 KB Record Size 1024 KB Network distribution mode enabled. Command line used: /opt/iozone/iozone -M -t 16 -s 24g -r 1M -i0 -i1 -+m clients-16x1.txt Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 16 processes Each process writes a 25165824 Kbyte file in 1024 Kbyte record

.



Test running: Children see throughput for 16 initial writers = 4956592.94 KB/sec Min throughput per process = 264562.47 KB/sec Max throughput per process = 324076.25 KB/sec Avg throughput per process = 309787.06 KB/sec Min xfer = 20542464.00 KB



Test running: Children see throughput for 16 rewriters Min throughput per process Max throughput per process Avg throughput per process Min xfer

= 5120604.94 KB/sec = 269572.88 KB/sec = 334712.06 KB/sec = 320037.81 KB/sec = 20269056.00 KB



Test running: Children see throughput for 16 readers Min throughput per process Max throughput per process Avg throughput per process Min xfer

= 6074346.78 KB/sec = 279408.12 KB/sec = 399309.31 KB/sec = 379646.67 KB/sec = 17609728.00 KB



Test running: Children see throughput for 16 re-readers Min throughput per process Max throughput per process Avg throughput per process Min xfer

= 6084403.50 KB/sec = 278865.00 KB/sec = 397386.06 KB/sec = 380275.22 KB/sec = 17661952.00 KB



Test cleanup:

iozone test complete.

22

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Standard OSS Testing and Results A Sun Lustre Storage System Standard Object Storage Server (OSS) module is built on a single Sun Fire X4540 server platform. The following performance test was run using a single Standard OSS module with DDR InfiniBand for the interconnect. The purpose of this benchmark was to test the maximum sustained performance that can be achieved on a single Standard OSS module over DDR InfiniBand. Based on the architecture of the Lustre file system, it has been shown that adding additional OSS modules will scale aggregate throughput in a near-linear fashion while also adding capacity. Standard OSS Benchmark Configuration One Sun Lustre Storage System Standard OSS module was used in this performance evaluation. Figure 12 shows the hardware configuration used for the Standard OSS testing. InfiniBand SAS Ethernet

Module: HA MDS Management Network

24-port InfiniBand 4x DDR Switch

A

B

Module: Standard OSS

. Sun Blade 6048 Modular System • 12 Sun Blade X6220 server modules • 2 Quad Core 3.0 GHz • 16 GB Memory • 4 DDR InfiniBand EMs

Figure 12. Hardware diagram for the Standard OSS benchmark environment.

The following tables, Table 9 and Table 10 , describe the hardware and software configuration used for the Standard OSS performance evaluation.

23

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Table 9. Hardware configuration used for the Standard OSS performance evaluation. Component

Description

Standard OSS Module Sun Fire X4540 server: • 2 AMD Opteron™ Quad Core processors at 2.2 GHz • 32 GB RAM • Forty-eight 1TB 7200 rpm SATA drives • 4x DDR InfiniBand HCA HA MDS Module

Two Sun Fire X4250 servers, each configured with: • 2 Intel Xeon E540 processors at 3.0 GHz • 8 GB RAM • 2 dual-ported SAS HBAs • 4x DDR InfiniBand HCA A shared Sun Storage J4200 array configured with twel 300 GB SAS 15K rpm drives

Clients

Sun Blade™ 6048 Modular System with 12 Sun Blade X6220 Server Modules, each containing an InfiniBand Express Module (EM)

Network

24-port InfiniBand DDR switch

Table 10. Software configuration used for the Standard OSS benchmark testing.

.

Component

Description

Operating system

CentOS 5.2

Lustre file system

Version 1.6.7

Host device driver

LSI driver 4.16

Benchmark utility

IOzone 3.315

Disks in the Sun Fire X4540 system were grouped into production-ready RAID 6 configurations and deployed as Object Storage Targets (OSTs). See “Standard OSS Module” on page 9 for further details of the RAID architecture and disk layouts. IOzone Benchmark Runs Tests were run utilizing the IOzone benchmark tool to evaluate the file system performance of the Lustre file system with the Standard OSS module. This publicly available benchmark utility generates and measures a variety of file operations (including read, re-read, write, and re-write operations) and can be used to provide a broad analysis of file system performance. See http://www.iozone.org/ for more information on the IOzone benchmark utility.

24

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

The testing was performed with nine nodes that were configured to run the IOzone benchmark. A file size of 16 GB was used to minimize the impact of client side caching. Tests were run with 256 KB, 512 KB and 1 MB block sizes. Performance differences were discovered to be negligible between block sizes; for brevity only 1 MB block size results are captured in this document. Peak performance with a nine client compute cluster was observed with 63 threads (7 threads per client) attaining a sustained initial write speed of approximately 974 MB/sec. These benchmark results show that approximately 970 MB/sec can be sustained for a single Sun Lustre Storage System Standard OSS module on initial writes. To achieve higher aggregate performance, multiple Standard OSS modules can be deployed within a single cluster. IOzone Benchmark Output Table 11 includes IOzone benchmark output from a test run using nine clients each running seven threads. Table 11. Example Standard OSS IOzone performance test output . Iozone: Performance Test of File I/O Version $Revision: 3.315 $ Compiled for 64 bit mode. Build: linux-AMD64 .

Contributors:

William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root.

Run began: Tue Apr

7 00:36:29 2009

Excel chart generation enabled Include fsync in write timing

Machine = Linux mds02 2.6.18-92.1.10.el5_lustre.1.6.6.20081218100335smp #1 SMP Excel chart generation enabled File size set to 16777216 KB Record Size 1024 KB Network distribution mode enabled. Command line used: /opt/iozone3_315/src/current/iozone -Rb /root/iozone.12 39086189.312636000/1M.63.oss-9.16g.7threads.test.xls -e -M -R -t 63 -s 16g -r 1M -i0 -i1 -+m /opt/snowbird/tools/client_list9c7t g

25

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.



Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 63 processes Each process writes a 16777216 Kbyte file in 1024 Kbyte records



Test running: Children see throughput for 63 initial writers = 974560.83 KB/sec Min throughput per process = 13552.75 KB/sec Max throughput per process = 26112.33 KB/sec Avg throughput per process = 15469.22 KB/sec Min xfer = 8699904.00 KB



Test running: Children see throughput for 63 rewriters Min throughput per process Max throughput per process Avg throughput per process Min xfer

= 750776.55 = 10417.62 = 20125.73 = 11917.09 = 8692736.00

KB/sec KB/sec KB/sec KB/sec KB



Test running: Children see throughput for 63 readers Min throughput per process Max throughput per process Avg throughput per process Min xfer

= 903548.76 = 7314.68 = 18577.52 = 14342.04 = 6610944.00

KB/sec KB/sec KB/sec KB/sec KB



Test running: Children see throughput for 63 re-readers Min throughput per process Max throughput per process Avg throughput per process Min xfer

= 896220.91 = 7359.98 = 18619.98 = 14225.73 = 6632448.00

KB/sec KB/sec KB/sec KB/sec KB

.

Test cleanup: “Throughput report Y-axis is type of test X-axis is number of processes” “Record size = 1024 Kbytes “ “Output is in Kbytes/sec” “

Initial write “ 974560.83



Rewrite “ 750776.55



Read “ 903548.76



Re-read “ 896220.91

iozone test complete.

26

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Proven Scalability Based on the architecture of the Lustre file system, it has been shown that adding additional OSSs scales performance and capacity in a near-linear fashion. Sun can reference a number of installations that have achieved these scalability results.

CLUMEQ Supercomputing Consortium CLUMEQ is a supercomputing consortium of universities in the province of Quebec, Canada. It includes McGill University, Université Laval, and all nine components of the Université du Québec network. CLUMEQ supports scientific research in disciplines such as climate and ecosystems modeling, high energy particle physics, cosmology, nanomaterials, supramolecular modeling, bioinformatics, biophotonics, fluid dynamics, data mining and intelligent systems. In March 2009, Sun began to assist CLUMEQ with the implementation of an end-toend solution based on the Sun Constellation System. The Sun Constellation System, the world’s first open HPC architecture, provides an open petascale computing environment that combines ultra-dense high-performance computing, networking, storage, and software into one system. The system features 960 nodes with a total of 7680 processor cores, and is built with ten Sun Blade 6048 Modular Systems fully loaded with Sun Blade X6275 Server Modules. Connecting the system is a Quad Data Rate infrastructure based on the Sun Datacenter InfiniBand Switch 648, Sun Datacenter InfiniBand Switch 36, QDR Network Express Modules, and QDR HCAs. For their storage needs, CLUMEQ has deployed a Sun Lustre Storage System with an HA MDS Module and 9 HA OSS modules (18 total OSS), for a total RAW capacity of over three-quarters of a petabyte. For acceptance testing, the IOR benchmark was used to measure sustained throughout for the Sun Lustre Storage System, and near-linear scaling with the addition of benchmark clients was demonstrated (see Figure 13. Maximum sustained performance against the 9 HA OSS configuration was measured at 18.26 GB/sec. 20 Write Speed (GB/sec)

.

15

10

5

72

144 # of Writing Clients

216

Figure 13. Maximum sustained file system write performance.

27

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Texas Advanced Computer Center (TACC) An additional scalability reference is the Texas Advanced Computing Center’s Ranger System (see http://www.tacc.utexas.edu/resources/hpcsystems/#ranger), where Sun has demonstrated near-linear scalability in a configuration encompassing 50 similar previous-generation Sun OSS modules with a single HA MDS module supporting a file system that was 1.2 petabytes in size. Figure 14 shows the Lustre file system throughput on the $SCRATCH file system on TACC’s Ranger system. TACC has observed throughput rates of 46 GB/sec. The $SCRATCH file system uses 50 Sun Fire X4500 Servers as OSSs, each of which is capable of more than 900 MB/sec of throughput. Therefore, the $SCRATCH file system at TACC demonstrates scaling utilizing 50 Sun Fire X4500 servers as OSSs. In addition, TACC has experienced a remarkable achievement of throughput on a single application’s use of the Lustre file system yielding 35 GB/sec for that single application. $SCRATCH File System Throughput 50

35

Stripecount=1 Stripecount=4

Write Speed (GB/sec)

Write Speed (GB/sec)

60

40 30 20 10 0 10

.

100

1000

10000

30

$SCRATCH Application Performance Stripecount=1 Stripecount=4

25 20 15 10 5 0

# of Writing Clients

Figure 14. Lustre file system performance at TACC.

10

100

# of Writing Clients

1000

10000

28

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Summary The Sun Lustre Storage System simplifies architecting and deploying a Lustre storage solution while enabling consistent and predictable results. The Sun Lustre Storage System features the Sun Fire servers and Sun Open Storage products that are best suited for typical Lustre file system deployments. Through iterative testing and benchmarking, Sun has determined the best configurations and published the results from performance testing. With those results, users can make the proper sizing decisions to achieve the overall performance consistent with their design decisions. Testing shows that a configuration with two HA OSS modules, tested using QDR InfiniBand, can sustain an I/O transfer rate of up to 4.7 GB/sec on writes and 6.0 GB/sec on reads. Performance demonstrated on a Standard OSS module, tested using DDR InfiniBand, is approximately 970 MB/sec on initial writes.

.

While originally developed to drive the world’s largest compute clusters, the Lustre file system has evolved into a valuable option even for moderate-sized clusters. Through this evolution, ease of use, management and configuration features and tools have been added to the product. The Sun Lustre Storage System takes this further by providing tools that automate storage configuration and layout in accordance to best practices developed in this effort and verified through benchmarking. Deploying a verified software stack that includes all the necessary tools simplifies the process of building a compute cluster using the Lustre file system. With future software releases and new hardware platforms it is expected that the performance and ease of use for the Sun Lustre Storage System will continue to be enhanced over time.

About the Authors Sean Cochrane is a Principal Field Engineer in Sun’s Global HPC Sales organization and is the primary architect for the Sun Lustre Storage System at Sun Microsystems. Sean has been at Sun since 1997 when he joined Sun from the University of Utah where he helped to establish the state wide network for education. Sean holds a Bachelor of Science degree in Computer Science and has over 25 years of experience in the computer industry. Ken Kutzer is the Marketing Team Lead, Storage for High Performance Computing and works within the Systems Marketing Group at Sun. He is responsible for driving the strategy and efforts to help raise customer and market awareness of Sun’s innovative storage products and technologies for High Performance Computing. Ken holds a Bachelor of Science degree in Electrical Engineering and has over 15 years in the computer and storage industries.

29

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Larry McIntosh is a Principal Systems Engineer at Sun Microsystems and works within the Global Engineering Solutions Group. He is responsible for designing and implementing high performance computing technologies at Sun’s largest customers. Larry has 35 years of experience in the computer, communications, and storage industries and has been a software developer and consultant in the commercial, government, education and research sectors as well as a computer science college professor. Larry’s recent work has included the deployment of the Ranger system servicing the National Science Foundation and Researchers at the Texas Advanced Computer Center (TACC) in Austin, Texas.

Acknowledgements The authors would like to recognize the following individuals for their contributions to this article: • Dr. Marc Parizeau, Deputy Director of CLUMEQ and Professor of Computer Engineering at Laval University • Dr. Tommy Minyard, Associate Director Advanced Computing Systems, Texas Advanced Computing Center • Dr. Karl Schulz, Associate Director High Performance Computing, Texas Advanced Computing Center

.

• Michael Berg, Atul Vidwansa, Greg Drobish, Joey Jablonski, Kelly Lewandowski, Craig Flaskerud, John Fragalla, Warner Hersey, Shuichi Ihara, Bryon Neitzel and Peter Bojanic at Sun Microsystems

30

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

References Table 12. References for more information. Web Sites Lustre File System

sun.com/software/products/lustre

Sun Lustre Storage System

sun.com/servers/hpc/storagecluster

Sun Storage & Archive Solution for HPC

sun.com/storage/hpc/

IOzone benchmark utility

www.iozone.org/

Video from SC08 by Sun Microsystems showing near-linear scaling of the Lustre file system

link.brightcove.com/services/player/bcpid 1640183659?bctid=8899392001

Ordering Sun Documents The SunDocsSM program provides more than 250 manuals from Sun Microsystems, Inc. If you live in the United States, Canada, Europe, or Japan, you can purchase documentation sets or individual manuals through this program.

Accessing Sun Documentation Online .

The docs.sun.com Web site enables you to access Sun technical documentation online. You can browse the docs.sun.com archive or search for a specific book title or subject. The URL is http://docs.sun.com. To reference Sun BluePrints Online articles, visit the Sun BluePrints Online Web site at: http://www.sun.com/blueprints/online.html.

Solving the HPC I/O Bottleneck: Sun Lustre Storage System

Sun Microsystems, Inc.

Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com © 2009 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Lustre, StorageTek, Sun Blade, and Sun Fire are trademarks or registered trademarks of Sun Microsystems, Inc. or its subsidiaries in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the US and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. AMD and Opteron are trademarks or registered trademarks of Advanced Micro Devices. Intel Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Information subject to change without notice.  Printed in USA  11/09