O Middleware for Fault-Resilient High-Performance Computing Clusters

Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters Dissertation Presented in Partial Fulfillment ...

Author: Rachel Matthews

1 downloads 0 Views 5MB Size

Report

Download PDF

Recommend Documents

Split Smart Messages: Middleware for Pervasive Computing on Smart Phones

Chapter 12: Principles of Mobile Computing Middleware

Dell s High Performance Computing Clusters

Green computing in IEEE 802.3az enabled clusters

O for High Performance Computing

Middleware. Middleware

Burj Khalifa a new high for highperformance

User Guide for Middleware

Exploiting Reflection and Metadata to build Mobile Computing Middleware

Keywords: Cloud computing, CloudSim, DVFS, Grid computing, IAAS, Middleware, PAAS, SAAS

Axon: A Middleware for Robotics

A Framework for Middleware Supporting Real-Time Wide-Area Distributed Computing

Uno: A Privacy-Aware Distributed Storage and Replication Middleware for Heterogeneous Computing Platforms

Middard, Middleware, Middardware and Middleware technology *

Highperformance EPLD ATF1500A ATF1500AL. Features

Subscribe Middleware

High Performance Java Remote Method Invocation for Parallel Computing on Clusters

Middleware Technologies

Enterprise Middleware

LaCOLLA: Middleware for Self-Sufficient Online Collaboration

Middleware Building Blocks for Architecting RFID Systems

Middleware for Social Networking on Mobile Devices

An Access Control Framework for Reflective Middleware

Oligomere technologies for cost-effective processing highperformance polyphthalamide composites

Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Raghunath Raja Chandrasekar, M.S. Graduate Program in Computer Science and Engineering

The Ohio State University 2014

Dissertation Committee: Dr. Dhabaleswar K.(DK) Panda, Advisor Dr. Ponnuswamy Sadayappan Dr. Radu Teodorescu Dr. Kathryn Mohror

c Copyright by

Raghunath Raja Chandrasekar 2014

Abstract

In high-performance computing (HPC), tightly-coupled, parallel applications run in lock-step over thousands to millions of processor cores. These applications simulate a wide-range of scientific phenomena, such as hurricanes and earthquakes, or the functioning of a human heart. The results of these simulations are important and time-critical, e.g., we want to know the path of the hurricane before it makes landfall. Thus, these applications are run on the fastest supercomputers in the world at the largest scales possible. However, due to the increased component count, large-scale executions are more prone to experience faults, with Mean Times Between Failures (MTBF) on the order of hours or days due to hardware breakdowns and soft errors. A vast majority of current-generation HPC systems and application codes work around system failures using rollback-recovery schemes, also known as Checkpoint-Restart (CR), wherein the parallel processes of an application frequently save a mutually agreed-upon state of their execution as checkpoints in a globally-shared storage medium. In the face of failures, applications rollback their execution to a fault-free state using these snapshots that were saved periodically. Over the years, checkpointing mechanisms have gained notoriety for their colossal I/O demands. While state-of-art parallel file systems are optimized for concurrent accesses from millions of processes, checkpointing overheads continue to dominate application run times, with the time taken to write a single checkpoint taking on the

ii

order of tens of minutes to hours. On future systems, checkpointing activities are predicted to dominate compute time and overwhelm file system resources. On supercomputing systems geared for Exascale, parallel applications will have a wider range of storage media to choose from - on-chip/off-chip caches, node-level RAM, NonVolatile Memory (NVM), distributed-RAM, flash-storage (SSDs), HDDs, parallel file systems, and archival storage. Current-generation checkpointing middleware and frameworks are oblivious to this hierarchy in storage where each medium has unique performance and data-persistence characteristics. This thesis proposes a cross-layer framework that leverages this hierarchy in storage media, to design scalable and low-overhead fault-tolerance mechanisms that are inherently I/O bound. The key components of the framework include - CRUISE, a highly-scalable in-memory checkpointing system that leverages both volatile and Non-Volatile Memory technologies; Stage-FS, a light-weight data-staging system that leverages burst-buffers and SSDs to asynchronously move application snapshots to a remote file system; Stage-QoS, a file system agnostic Quality-of-Service mechanism for data-staging systems that minimizes network contention; MIC-Check, a distributed checkpoint-restart system for coprocessorbased supercomputing systems; Power-Check, an energy-efficient checkpointing framework for transparent and application-aware HPC checkpointing systems; and FTB-IPMI, an out-of-band fault-prediction mechanism that pro-actively monitors for failures. The components of this framework have been evaluated up to a scale of three million compute processes, have reduced the checkpointing overhead on scientific applications by a factor of 30, and reduced the amount of energy consumed by checkpointing systems by up to 48%.

iii

Dedicated to Amma, Appa, and Aarthy.

iv

Acknowledgments

This work would not have been possible if not for the support of several people throughout the course of my doctoral study. I would like to thank and acknowledge the following people in particular: My doctoral advisor, Dr. Dhabaleswar K. Panda, for his continual guidance and support. In addition to guiding my research itself, Dr. Panda has constantly encouraged me and helped me wade through several hurdles during my PhD, and for that, I am highly grateful to him. His devotion to research and his high-standards are nothing short of a strong inspiration. Choosing to become his doctoral advisee was one of the best decisions I have taken in my life. My dissertation committee members, Dr. P. Sadayappan and Dr. R. Teodorescu, for agreeing to serve on the committee, and for their valuable feedback during and after the candidacy proposal that helped improve the dissertation. My mentors at the Lawrence Livermore National Laboratory, Adam Moody and Dr. Kathryn Mohror, who guided my research during and after my internships at the lab. Adam, both a mentor and a friend, is someone I truly look up to. His desire to learn and innovate is infectious, and I am thankful for collaborating with him. Kathryn is a real pleasure to work with and has always encouraged me and patted me on my back for my achievements. I am thankful to her for agreeing to serve on my dissertation committee as an external member.

v

All the present and past members of NOWLAB who have played an integral part in the evolution of my PhD thesis. I am thankful to the research personnel and staff — Sayantan, Khaled, Xavier, Hari, Xiaoyi, Hao, Deva, Jonathan, and Mark; and all the students — Xiangyong, Miao, Sreeram, Krishna, Jithin, Siddhesh, Nishanth, Wasi, Nusrat, Akshay, and all the other junior students. The learning never stops when you are around these people. Nor does the desire to drink infinite cups of coffee. All my friends who stepped in as family during my formative years at OSU — Aarthi, Deepak, Isha, Kathik, Srinivas, Vignesh, Zoya, and many others. My friends from back home who have always rooted for me, particularly — Anoop, Avinash, Bharathan, Karan, Karthik, Nachi, Raagini, Ranjani, Ravi, Rupak, Shravanthi, Sethu, Sountheriya, Sunil, and Venkat, amongst many others. And lastly, my family, which has gone above and beyond in supporting my journey towards a PhD. My sister and brother-in-law, for taking pride in what I do. My dad, for his wisdom and affection, and my mom for her true love. Being able to make my parents proud has been my biggest achievement in life so far. My fianc´ee, Aarthy, for her patience and resilience over the years. This dissertation would not have been possible without her never-ending support.

vi

Vita

February 11, 1988 . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Chennai, India 2005-2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Tech. Information Technology Anna University, Chennai, Tamil Nadu, India Summer 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summer Research Scholar, Lawrence Livermore National Laboratory, Livermore, California, USA Summer 2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summer Research Scholar, Lawrence Livermore National Laboratory, Livermore, California, USA 2009-Present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Research Associate, The Ohio State University, Columbus, Ohio, USA

Publications Research Publications

1. R. Rajachandrasekar, A. Venkatesh, K. Hamidouche, and D. K. Panda, PowerCheck: An Energy-Efficient Checkpointing Framework for HPC Clusters (Under Review), IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), May 2015 2. R. Rajachandrasekar, S. Potluri, A. Venkatesh, K. Hamidouche, Md. Rahman and D. K. Panda, MIC-Check: A Distributed Checkpointing Framework for the Intel Many Integrated Cores Architecture, ACM Symposium on High-Performance Parallel and Distributed Computing, June 2014 vii

3. R. Rajachandrasekar, A. Moody, K. Mohror and D. K. Panda, A 1 PB/s File System to Checkpoint Three Million MPI Tasks, ACM Symposium on High-Performance Parallel and Distributed Computing, June 2013 4. R. Rajachandrasekar, A. Moody, K. Mohror and D. K. Panda, Thinking Beyond the RAM Disk for In-Memory Checkpointing of HPC Applications, OSU Tech. Report (OSU-CISRC-1/13-TR02), Jan. 2013 5. R. Rajachandrasekar, J. Jaswani, H. Subramoni and D. K. Panda, Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework, IEEE Cluster, Sept. 2012 6. R. Rajachandrasekar, X. Besseron and D. K. Panda, Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI, Workshop on System Management Techniques, Processes, and Services (SMTPS), held in conjunction with IPDPS ’12, May 2012 7. R. Rajachandrasekar, X. Ouyang , X. Besseron, V. Meshram and D. K. Panda, Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging?, Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids (Resilience), held in conjunction with EuroPar, Aug. 2011 8. R. Rajachandrasekar, J. Perkins, K. Hamidouche, M. Arnold and D. K. Panda, Understanding the Memory-Utilization of MPI Libraries: Challenges and Designs in Implementing the MPI T Interface, ACM EuroMPI/ASIA, Sept. 2014 9. X. Ouyang, R. Rajachandrasekar , X. Besseron, Hao Wang, Jian Huang and D. K. Panda, CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart, Int’l Conference on Parallel Processing (ICPP), Sept. 2011 10. X. Ouyang, R. Rajachandrasekar , X. Besseron and D. K. Panda, High Performance Pipelined Process Migration with RDMA, IEEE Int’l Symposium on Cluster, Cloud and Grid Computing (CCGrid), May. 2011 11. X. Ouyang, N. Islam, R. Rajachandrasekar, J. Jose, M. Luo, H. Wang and D. K. Panda, SSD Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks, Int’l Conference on Parallel Processing (ICPP), Sept. 2012 12. A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda, High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters, IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS), May. 2014 13. X. Ouyang, S.Marcarelli, R. Rajachandrasekar and D. K. Panda, RDMA-Based Job Migration for MPI over InfiniBand, IEEE Cluster, Sept. 2010

viii

14. M. W. Rahman , X. Lu, N. S. Islam, R. Rajachandrasekar and D. K. Panda, HighPerformance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA (Under Review), IEEE Int’l Parallel and Distributed Processing Symposium(IPDPS), May. 2015 15. N. S. Islam, X. Lu, M. W. Rahman, R. Rajachandrasekar and D. K. Panda, InMemory I/O and Replication for HDFS with Memcached: Early Experiences, IEEE Int’l Conference on Big Data, Oct. 2014 16. M. W. Rahman , X. Lu, N. S. Islam, R. Rajachandrasekar and D. K. Panda, MapReduce over Lustre: Can RDMA-based Approach Benefit?, Int’l European Conference on Parallel Processing (Euro-Par), Aug. 2014 17. V. Meshram, X. Besseron, X. Ouyang, R. Rajachandrasekar, R. P. Darbha and D. K. Panda, Can a Decentralized Metadata Service Layer benefit Parallel Filesystems?, Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS), held in conjunction with Cluster ’11, Sept. 2011 18. N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Int’l Conference on Supercomputing (SC ’12), Nov. 2012

Fields of Study Major Field: Computer Science and Engineering

ix

Table of Contents

Page Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1.

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 1.2 1.3

2.

1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Research Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Organization of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 11

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11

Checkpoint-Restart Mechanisms in HPC . . . . . . Checkpoint-Restart I/O Characteristics . . . . . . . InfiniBand Interconnect . . . . . . . . . . . . . . . Quality-of-Service Support in InfiniBand . . . . . . Filesystem in User-Space (FUSE) . . . . . . . . . . Berkeley Lab Checkpoint-Restart (BLCR) . . . . . . Distributed MultiThreaded CheckPointing (DMTCP) Scalable Checkpoint-Restart Framework (SCR) . . . The Xeon Phi Architecture . . . . . . . . . . . . . . The Fault-Tolerance Backplane (FTB) . . . . . . . . Running-Average Power Limit (RAPL) . . . . . . . x

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

12 13 14 16 18 19 20 20 21 23 24

2.12 Intelligent Platform Management Interface (IPMI) . . . . . . . . . . . . 25 3.

Stage-FS: RDMA-Based Hierarchical Checkpoint Data-Staging . . . . . . . . 26 3.1 3.2

3.3 3.4 4.

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

27 31 31 32 33 34 36 37

Stage-QoS: Network Quality-of-Service Aware Checkpointing . . . . . . . . . 39 4.1 4.2

4.3

4.4 4.5 5.

Detailed Design . . . . . . . . . . . . . . . . . . Performance Evaluation . . . . . . . . . . . . . . 3.2.1 Experimental Testbed . . . . . . . . . . . 3.2.2 Profiling of a Stand-Alone Staging Server . 3.2.3 Scalability Analysis . . . . . . . . . . . . 3.2.4 Evaluation with Applications . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . .

Design Goals and Alternatives . . . . . . . . . . . . . Detailed Design . . . . . . . . . . . . . . . . . . . . 4.2.1 Configuring the IB Subnet Fabric . . . . . . . 4.2.2 Enabling Quality-of-Service in the Filesystem Experimental Evaluation . . . . . . . . . . . . . . . . 4.3.1 Micro-Benchmark Evaluation . . . . . . . . . 4.3.2 Impact of SL Weights . . . . . . . . . . . . . 4.3.3 Impact on Applications . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

40 41 41 43 45 45 48 50 51 52

CRUISE: Efficient In-Memory Checkpoint Data Management . . . . . . . . . 54 5.1

5.2

5.3

5.4

Design Alternatives . . . . . . . . . . . 5.1.1 Intercepting Application I/O . . 5.1.2 In-Memory File Storage . . . . Architecture and Design . . . . . . . . 5.2.1 The Role of CRUISE . . . . . 5.2.2 Data Structures . . . . . . . . . 5.2.3 Spill Over Capability . . . . . 5.2.4 Remote Direct Memory Access 5.2.5 Simplifications . . . . . . . . . 5.2.6 Lock Management . . . . . . . Implementation of CRUISE . . . . . . 5.3.1 Initializing the File System . . 5.3.2 write() Operation . . . . . . . . Failure Model with SCR . . . . . . . . xi

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

55 55 58 61 61 63 65 66 69 69 70 71 72 74

5.5

5.6 5.7 6.

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

75 75 76 80 84 85 86

MIC-Check: Scalable Checkpointing for Heterogeneous HPC Systems . . . . . 87 6.1

6.2 6.3

6.4

6.5 6.6 7.

Experimental Evaluation . . . . . . . 5.5.1 Experimentation Environment 5.5.2 Microbenchmark Evaluation . 5.5.3 Intra-Node Scalability . . . . 5.5.4 Large-Scale Evaluation . . . Related Work . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . .

I/O Limitations on the Xeon Phi Architecture . . . . . . . . . . . . . 6.1.1 Intrinsic Limitations . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Extrinsic Limitations . . . . . . . . . . . . . . . . . . . . . . Architecture and Design . . . . . . . . . . . . . . . . . . . . . . . . Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 MIC-Check Proxy (MCP) and I/O Interception Library (MCI) 6.3.2 MIC-Check MVAPICH . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Impact of the pipeline chunk size . . . . . . . . . . . . . . . 6.4.3 Intra-node scalability . . . . . . . . . . . . . . . . . . . . . 6.4.4 Inter-node scalability . . . . . . . . . . . . . . . . . . . . . 6.4.5 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . 6.4.6 Evaluation with Real-World Applications . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

87 88 91 92 95 95 98 99 99 100 101 103 104 105 107 109

Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters 110 7.1 7.2

7.3

7.4

Architecture of Power-Check . . . . . . . . . . . . . . . . . . Design and Implementation . . . . . . . . . . . . . . . . . . . 7.2.1 Design Scope and Assumptions . . . . . . . . . . . . . 7.2.2 libpowercheck: Measuring Energy and Actuating Power 7.2.3 Enhanced Checkpointing Libraries . . . . . . . . . . . 7.2.4 I/O Funneling Agent . . . . . . . . . . . . . . . . . . . Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 7.3.1 Understanding the Energy-Usage of Checkpointing . . 7.3.2 Evaluating Power-Check . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . .

xii

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

112 115 115 116 119 121 124 124 127 132

8.

FTB-IPMI: Low-Overhead Fault Prediction . . . . . . . . . . . . . . . . . . . 136 8.1

8.2

8.3 8.4 9.

Design and Implementation . . . . . . . . . . . . . 8.1.1 Querying IPMI . . . . . . . . . . . . . . . . 8.1.2 Sensor State Analysis . . . . . . . . . . . . 8.1.3 FTB Event Publication . . . . . . . . . . . . 8.1.4 Rule-Based Prediction in MVAPICH2 . . . 8.1.5 Applications of FTB-IPMI . . . . . . . . . . Experimental Evaluation . . . . . . . . . . . . . . . 8.2.1 Resource Utilization . . . . . . . . . . . . . 8.2.2 Scalability . . . . . . . . . . . . . . . . . . 8.2.3 Proactive Process Migration in MVAPICH2 Related Work . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

137 139 140 142 143 144 145 145 148 149 152 153

Conclusions and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 154 9.1 9.2

Impact on the HPC Community . . . . . . . . . . . . . . . . . . . . . . 158 Open-source contributions to the community . . . . . . . . . . . . . . . 159

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

xiii

List of Tables

Table

Page

3.1

Size of the checkpoint files . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1

I/O throughput for the storage hierarchy on the OSU-RI system described in Section 5.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2

Impact of Non-Uniform Memory Access on Bandwidth (GB/s) . . . . . . . 77

5.3

CRUISE throughput (MB/s) with Spill-over . . . . . . . . . . . . . . . . . 80

6.1

Peak bandwidth of different channels on Xeon Phi systems (Path# indicated in Fig 6.3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.2

MIC-Check support for various execution and checkpointing modes . . . . 93

8.1

FTB-IPMI Sensor Readings from a single compute-node on Cluster A (See Section 8.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.2

FTB-IPMI rules to generate FTB event . . . . . . . . . . . . . . . . . . . . 142

8.3

FTB Events published by FTB-IPMI . . . . . . . . . . . . . . . . . . . . . 142

xiv

List of Figures

Figure

Page

1.1

Disparity in the performance capabilities of storage media . . . . . . . . .

4

1.2

Proposed research framework . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1

Per-VL Buffering and Flow control . . . . . . . . . . . . . . . . . . . . . . 16

2.2

The FUSE Architecture (Courtesy: [7]) . . . . . . . . . . . . . . . . . . . 18

2.3

FTB Architecture (Courtesy: [67]) . . . . . . . . . . . . . . . . . . . . . . 23

3.1

Comparison between the direct checkpoint and the checkpoint staging approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2

Design of Hierarchical Data Staging Framework . . . . . . . . . . . . . . . 29

3.3

Throughput of a single staging server with varying number of clients and processes per client (Higher is better) . . . . . . . . . . . . . . . . . . . . 32

3.4

Throughput scalability analysis, with increasing number of Staging groups and 8 clients per group (Higher is better) . . . . . . . . . . . . . . . . . . . 33

3.5

Comparison of the checkpoint times between the proposed staging approach and using the classic approach (Lower is Better) . . . . . . . . . . . 35

4.1

Design Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2

OpenSM Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3

Impact of I/O noise on MPI Pt-to-Pt Latency (Lower is better) . . . . . . . 47

xv

4.4

Impact of I/O noise on MPI Pt-to-Pt Bandwidth (Higher is better) . . . . . 47

4.5

Impact of I/O noise on MPI AlltoAll Collective Latency (Lower is better) . 48

4.6

Impact of Service Level Credit Weights on Pt-to-Pt operations (in the presence of I/O noise) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.7

Impact of Service Level Credit Weights on Collective operations (in the presence of I/O noise) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.8

Impact of I/O Noise on End-Applications . . . . . . . . . . . . . . . . . . 50

5.1

Architecture of CRUISE . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2

Data Layout of CRUISE on the Persistent Memory Block . . . . . . . . . . 63

5.3

Protocol to RDMA files out of CRUISE . . . . . . . . . . . . . . . . . . . 67

5.4

Pseudo-code for open() function wrapper . . . . . . . . . . . . . . . . . 71

5.5

Pseudo-code for write() function wrapper . . . . . . . . . . . . . . . . 73

5.6

Impact of Chunk Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.7

Intra-Node Aggregate Bandwidth Scalability . . . . . . . . . . . . . . . . . 80

5.8

Aggregate Bandwidth Scalability of CRUISE . . . . . . . . . . . . . . . . 83

6.1

Disparity between the Parallel File System throughput as seen by processes on the host and MIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.2

VFS architecture on the Xeon Phi . . . . . . . . . . . . . . . . . . . . . . 89

6.3

Communication paths available to Xeon Phi systems . . . . . . . . . . . . 90

6.4

System-level architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.5

Implementation of MIC-Check . . . . . . . . . . . . . . . . . . . . . . . . 96

6.6

Connections established by MIC-aware MVAPICH MPI library . . . . . . . 98

xvi

6.7

Impact of pipelining chunk sizes on I/O throughput . . . . . . . . . . . . . 100

6.8

Intra-node scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.9

Inter-Node Aggregate Bandwidth Scalability . . . . . . . . . . . . . . . . . 103

6.10 System resource utilization of MCP . . . . . . . . . . . . . . . . . . . . . 104 6.11 Evaluation with applications . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.1

CPU utilization during a checkpoint . . . . . . . . . . . . . . . . . . . . . 110

7.2

Design of the I/O Funneling Layer . . . . . . . . . . . . . . . . . . . . . . 112

7.3

Execution workflow during a checkpoint . . . . . . . . . . . . . . . . . . . 113

7.4

Design of the I/O Funneling Layer . . . . . . . . . . . . . . . . . . . . . . 122

7.5

Impact of storage media on the energy-footprint . . . . . . . . . . . . . . . 125

7.6

Impact of write-patterns on energy-footprint . . . . . . . . . . . . . . . . . 126

7.7

CPU Utilization during checkpointing . . . . . . . . . . . . . . . . . . . . 129

7.8

Percentage of CPU time spent waiting for I/O requests . . . . . . . . . . . 129

7.9

Application-level evaluation of Power-Check (DMTCP) . . . . . . . . . . . 130

7.10 Application-level evaluation of Power-Check (BLCR) . . . . . . . . . . . . 130 8.1

FTB-IPMI Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.2

FTB-IPMI Work-flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

8.3

Real-Time CPU Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.4

Average FTB-IPMI CPU Usage . . . . . . . . . . . . . . . . . . . . . . . 146

8.5

Scalability with Multiple Threads . . . . . . . . . . . . . . . . . . . . . . 149

8.6

Execution times for Single Iteration . . . . . . . . . . . . . . . . . . . . . 150 xvii

8.7

Prediction-Triggered Preemptive Fault-Tolerance in MVAPICH2 . . . . . . 151

xviii

Chapter 1: Introduction

Modern High Performance Computing (HPC) clusters continue to grow to ever increasing proportions. These supercomputing systems allow scientists and engineers to tackle grand challenge problems in their respective domains and make significant contributions to their fields. Examples of such problems include astro-physics, earthquake analysis, weather prediction, nanoscience modeling, multi-scale and multi-physics modeling, biological computations, computational fluid dynamics, etc. However, performance gains that could be obtained in traditional single/dual core processors by employing schemes such as frequency scaling and instruction pipelining have greatly diminished due to problems in power consumption, heat dissipation and fundamental limitations in exploiting Instruction Level Parallelism. Processor speeds no longer double every 18 - 24 months. As a result, HPC systems no longer rely on the speed of a single processing element to achieve the desired performance. They instead tend to exploit the parallelism available in a massive number of moderately fast distributed processing elements which are connected together using a high performance network interconnect. The list of Top500 high performance machines in the world [23] clearly indicates this trend. There has been an order of magnitude increase in the number of cores in the last five years. The “Tianhe-2” [94], which is currently the world’s fastest Supercomputer, has over 3 million cores. The fastest supercomputer five years ago, “Roadrunner”, had a little over 1

100 thousand cores. Multi-core processors and the availability of commodity high speed interconnects such as InfiniBand [22] has resulted in the trend of using such large clusters. As we usher the era of peta-flop and exa-flop computing, this trend is poised to continue for many years to come. Many-core architectures like GPUs and MIC-based coprocessors have caused systems to become heterogeneous by introducing multiple levels of parallelism and varying computation and data-movement costs at each level. Accelerators (such as NVIDIA GPUs) and coprocessors (such as Intel Many Integrate Core/Xeon Phi) are fueling the growth of next-generation ultra-scale systems that have high compute density and high performance per watt. This is again evident in the recent Top500 list [23] (November 2014), with 28 of the top 100 systems using either accelerators or coprocessors to boost their compute power. This includes #1 Tianhe-2, #2 Titan and #7 Stampede from the top 10. However, as these clusters scale out, the likelihood of system failures increases as well. Although each component has only a very small chance of failure, the combination of all components has a much higher chance of failing, or providing degraded service. This will have a severe detrimental impact on the Mean-Time-Between-Failures (MTBF). The MTBF for typical HEC installations is currently estimated to be in the order of hours or days due to hardware breakdowns and soft errors [91, 120, 121, 55, 116]. This will continue to degrade as system sizes become larger. Many real world applications that study Molecular Dynamics [3, 4, 103], Finite Element Analysis [14, 101], etc. take anywhere from a few hours to a couple of days to complete their computation. Given that the MTBF of such modern clusters is smaller than the average running time of these applications, multiple failures can be expected during the lifetime of the application. In order to continue computing past the MTBF of the system, either the system software or the application itself 2

needs to provide fault-tolerance. Thus, it is highly desirable that next generation system architectures and software environments be designed with sophisticated fault-tolerant and fault-resilient capabilities. Many of the currently available system software implementations that provide faulttolerance are based on fully coordinated checkpointing [48, 133, 106, 81, 128, 29, 115, 43, 83, 49, 72, 141, 65, 34]. When a fault occurs, applications either roll-back to a previously saved checkpoint and restart their execution execution, or migrates a subset of the parallel processes that are on a failing node to a healthy node. Although parallel file systems are optimized for concurrent access by large scale applications, checkpointing and processsnapshotting overheads can still dominate application run times, where a single checkpoint can take on the order of tens of minutes [76, 112]. A study by Los Alamos National Laboratory shows that about 60% of an HPC application’s wall-clock time was spent in checkpoint/restart alone [104]. Similarly, a study from Sandia National Laboratories predicts that a 168-hour job on 100,000 nodes with a node MTBF of 5 years will spend only 35% of its time in compute work and the rest of its time in checkpointing activities [61]. On current HPC systems, checkpointing utilizes 75-80% of the I/O traffic [102, 25]. On future systems, checkpointing activities are predicted to dominate compute time and overwhelm file system resources [55, 93]. Message logging protocols [130, 78, 32, 86, 42, 83, 59, 41, 40, 111] offer another viable solution to fault-tolerance, notably because they allow the computation to make progress even with low MTBF. Log-based protocols (or message logging protocols) assume that the state of the system evolves according to non-deterministic events. These events (usually the application messages) are logged in order to rollback the failed processes from a previous saved checkpoint [56] and restore the pre-failure state by replaying these non-deterministic events. Although message logging protocols have been studied, 3

these schemes have overhead for message replication and also require significant amount of memory per node to store the logs.

Throughput (MB/s)

16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 0

Memory RAMDisk

SSD

ParallelFS

HDD

NFS

Figure 1.1: Disparity in the performance capabilities of storage media

Hence, it is clear that persistence is a key ingredient of any protocol or mechanism that is designed to handle failures. Rollback-recovery schemes store the state of application processes into snapshot files, message logging schemes buffer communication data that are sent to other processes, redundancy-based recovery schemes persist duplicate copies of data on multiple nodes, process-migration schemes essentially move the snapshot data of processes from one node to another, and parity-based data-recovery schemes store additional parity information that aids in reconstruction of lost or damaged data. Clearly, any fault-tolerating mechanism would have to persist either some amount of metadata or a large amount of actual application data itself, in a globally-visible or accessible data-store that is guaranteed to persist the data forever until it is explicitly deleted. Most of the popular parallel file systems satisfy these requirements, but have several shortcomings, including

4

contention at scale, centralized points of failure, etc. With emerging dense-node architectures, parallel applications will have a wider range of storage media to choose from (onchip/off-chip caches, node-level RAM, NVM/SCM, distributed-RAM, flash storage/SSDs, HDDs, parallel file systems and archival storage) based on the type of data that is being stored, and the access frequency. Each of these various levels in the storage hierarchy has distinct performance and functionality traits. Figure 1.1 distinctly illustrates this with a comparison of the I/O throughput offered at each level. Clearly, a one size fits all solution will not be able to fully leverage the capabilities of each level in this hierarchy. Cross-layer solutions that consider multiple factors need to be developed to efficiently handle such a diverse I/O and storage environment. These factors include: (a) the recency with which an application needs to access the data that was persisted, (b) storage medium locality, (c) the low-level I/O primitives supported by a given medium, (d) the inherent characteristics of the medium, (e) knowledge of any memory/energy caps that have been enforced by the runtime and (f) inter-process data-sharing requirements imposed by the application.

1.1

Problem Statement

This thesis addresses the above discussed challenges, with a particular focus on the I/O demands of checkpointing-based fault-tolerance mechanisms. Specifically, the thesis addresses several pressing research questions, which are otherwise unanswered in literature: 1. Can checkpoint-restart mechanisms benefit from a hierarchical data-staging framework? 2. How can I/O middleware minimize the contention for network resources between checkpoint-restart traffic and inter-process communication traffic?

5

3. How can the behavior of HPC applications and I/O middleware be enhanced to leverage the deep storage hierarchies available on current-generation supercomputers? 4. How can the capabilities of state-of-art checkpointing systems be enhanced efficiently handle heterogeneous systems? 5. Are there opportunities to increase the energy-efficiency of supercomputers, without hurting performance, by designing smarter checkpointing and I/O systems? 6. Can low-overhead timely failure prediction mechanisms be designed for pro-active failure avoidance and recovery?

1.2

Research Framework

Figure 1.2 succinctly illustrates the research framework that this thesis proposes. The following discussion highlights how the framework addresses each of the broad challenges that were enumerated in the problem statement. 1. Can checkpoint-restart mechanisms benefit from a hierarchical data-staging framework? With the rapid advances in technology, many clusters are being built with high performance commercial components such as high-speed low-latency networks and advanced storage devices such as Solid State Drives (SSDs). Leadership-class supercomputers are also being provisioned with off-node high-throughput Burst Buffers. These advanced technologies provide an opportunity to rethink existing solutions to tackle the I/O challenges imposed by Checkpoint-Restart mechanisms. We have

6

Figure 1.2: Proposed research framework

designed a hierarchical data staging architecture, namely Stage-FS that uses a dedicated set of staging server nodes to offload checkpoint I/O operations from the compute nodes, thereby giving an asynchronous checkpointing capability to applications and MPI libraries. This architecture can leverage the in-memory data-management capabilities of CRUISE to make better use of the deep storage hierarchy.

7

2. How can I/O middleware minimize the contention for network resources between checkpoint-restart traffic and inter-process communication traffic? This thesis specifically addresses the problem of network contention, caused due to the sharing of network resources by parallel applications and file systems simultaneously. We leverage the Quality-of-Service (QoS) capabilities of the widely used InfiniBand interconnect to enhance our data-staging file system, making it QoS-aware. This is a user-level framework, called Stage-QoS, which is agnostic of the underlying storage and MPI implementation. Using this file system, we demonstrate the isolation of file system traffic from MPI communication traffic, thereby reducing the network contention. 3. How can the behavior of HPC applications and I/O middleware be enhanced to leverage the deep storage hierarchies available on current-generation supercomputers? The thesis addresses this challenge with a new in-memory file system called CRUISE: Checkpoint Restart in User SpacE. CRUISE is optimized for use with multilevel checkpointing libraries to provide low-overhead, scalable file storage on systems that provide some form of Non-Volatile Memory (NVM) that persists beyond the life of a process. CRUISE supports a minimal set of POSIX semantics such that its use is transparent when checkpointing HPC applications. An application specifies a bound on memory usage, and if its checkpoint files are too large to fit within this limit, CRUISE stores what it can in memory and then spills-over the remaining bytes in slower but larger storage, such as an SSD or the parallel file system. CRUISE also

8

supports Remote Direct Memory Access (RDMA) semantics that allow a remote server process to directly read files from a compute node’s memory. 4. How can the capabilities of state-of-art checkpointing systems be enhanced efficiently handle heterogeneous systems? The advent of heterogeneous architectures provisioned with accelerators (such as GPGPUs) and coprocessors (such as Intel Xeon Phis) are enabling the design of increasingly capable supercomputers within reasonable power budgets. Naive checkpointing protocols, which are predominantly I/O-intensive, face severe performance bottlenecks on such systems, particularly on the Xeon Phi architecture, due to several inherent and acquired limitations. Consequently, existing checkpointing frameworks are not capable of serving distributed MPI applications that leverage heterogeneous hardware architectures. We have analyzed the intrinsic and extrinsic issues that limit the I/O performance when checkpointing parallel applications on Xeon Phi clusters, and have designed a novel distributed checkpointing framework, namely MIC-Check, which works around these limitations and provides scalable I/O performance on such heterogeneous systems. 5. Are there opportunities to increase the energy-efficiency of supercomputers, without hurting performance, by designing smarter checkpointing and I/O systems? While there are innumerable studies in literature that have analyzed, and optimized for, the performance and scalability of a variety of checkpointing protocols, not much research has been done from an energy or power perspective. Applications running on future exascale machines will be constrained by a power envelope, and it is not 9

only important to understand the behavior of checkpointing systems under such an envelope but to also adopt techniques that can leverage power capping capabilities exposed by the OS to achieve energy savings without forsaking performance. We address the problem of marginal energy benefits with significant performance degradation due to naive application of power capping around checkpointing phases by proposing a novel power-aware checkpointing framework — Power-Check. By use of data-funneling mechanisms and selective processor power-capping, Power-Check makes efficient use of the I/O and CPU subsystem. 6. Can low-overhead timely failure prediction mechanisms be designed for proactive failure avoidance and recovery? Although most of the individual hardware and software components within a cluster implement mechanisms to provide some level of fault tolerance, these components work in isolation. They work independently, without sharing information about the faults they encounter. This lack of a system-wide fault information coordination has emerged to be one of the biggest problems in leadership-class HPC systems. Faultprediction is a challenging issue that several researchers are trying to address. Faultdetection and prediction in HPC clusters and Cloud-computing systems are increasingly challenging issues. This thesis proposes a light-weight multi-threaded service, namely FTB-IPMI, which provides distributed fault-monitoring for HPC applications. Fault-predictors and other decision-making engines that rely on distributed failure information can benefit from FTB-IPMI to facilitate proactive fault-tolerance mechanisms such as preemptive job migration.

10

1.3

Organization of this Thesis

Chapter 2 introduces the necessary background topics and concepts that are relevant to the thesis. Chapter 3 describes the hierarchical data-staging designs that the thesis proposes, while chapter 4 describes the proposed Quality-of-Service mechanisms. Chapter 5 discusses the in-memory checkpoint data management designs. Chapter 6 discusses how the thesis addresses the challenges involved in checkpointing applications running on accelerators and coprocessors. Chapter 7 describes the designs that make the framework proposed in this thesis energy-efficient. Chapter 8 presents the distributed fault-monitoring system that was developed to predict HPC system failures and proactively avoid them.

11

Chapter 2: Background

2.1

Checkpoint-Restart Mechanisms in HPC

The primary use of Checkpoint-Restart techniques in supercomputing systems has been to enhance the fault-tolerance capabilities of the scientific codes that run on it. Secondary uses include job-migration, dynamic task rescheduling, debugging etc. Checkpointing mechanisms can be broadly classified into application-level or transparent system-level ones. In case of the former, applications tend to write execution state snapshots into checkpoint files in between compute-communicate iterations in order to minimize the amount of information that needs to be stored, to avoid a potential domino-effect [125] and to avoid introducing stray/zombie processes into the system during a restart. Every process of the parallel application writes its own checkpoint that facilitates efficient storage and restart. In case of the later, system middleware, such as the MPI library or the job-scheduler, captures the state of a running application without its knowledge using checkpointing tools like the predominantly used The former approach leverages application-specific information to choose the data that gets checkpointed, and hence is more time and space efficient than the latter. However, the latter allows for preemptive fault-tolerance by being able to checkpoint an application at any instant of time in response to failure prediction hints from system components. Checkpoints are typically written to a globally-shared storage medium 12

that is visible to all processes participating in the application. This is often a parallel file system like Lustre, GPFS or PVFS. Several optimizations have been proposed in literature to optimize the I/O costs of writing checkpoints to shared storage by utilizing node-local storage for low-overhead, frequent checkpointing instead, and using the parallel file system only to write a select few checkpoints. Several optimizations have also been proposed to make the checkpointing protocol uncoordinated from the application’s perspective.

2.2

Checkpoint-Restart I/O Characteristics

Checkpoint/restart I/O workloads have certain characteristics that allows us to optimize the design and implementation of the proposed framework. Here, we detail these characteristics of typical application-level checkpoint I/O workloads. A single file per process. Many applications save state in a unique file per process. This checkpointing style is a natural fit for multilevel checkpointing libraries, such as Scalable Checkpoint-Restart (SCR), which imposes an additional constraint that a process may not read files written by another process. As such, there is no need to share files between processes, so storage can be private to each process, which eliminates inter-process consistency and reduces the need for locking. Dense files. In general, POSIX allows for sparse files in which small amounts of data are scattered at distant offsets within the file. For example, a process could create a file, write a byte, and then seek to an offset later in the file to write more data, leaving a hole. File systems may then optimize for this case by tracking the locations and sizes of holes to avoid consuming space on the storage device. However, checkpoints typically consist of a large volume of data that is written sequentially to a file. Thus, it will suffice to support

13

non-sequential writes in without incurring the overhead of tracking these holes to optimize data placement. Write-once-read-rarely files. A checkpoint file is not modified once written, and it is only read during a restart after a failure, which is assumed to be a rare event relative to the number of checkpoints taken. This property makes it feasible to access file data by external methods such as RDMA without concern for file consistency. Once written, the file contents do not change. Temporal nature of checkpoint data. Since an application restarts from its most recent checkpoint, older checkpoints can be discarded as newer checkpoints are written. Hence, we need not track POSIX file timestamps. Globally coordinated operation. Typically, parallel application processes coordinate with each other to ensure that all message passing activity has completed before saving a checkpoint. This coordination means that all processes block until the checkpointing operation is complete, and when a failure occurs, all processes are restarted at the same time. This means that we can clear all locks when the file system is remounted.

2.3

InfiniBand Interconnect

InfiniBand (IB) is an industry standard switched fabric that is designed for interconnecting compute and I/O nodes in High-End Computing clusters [22]. It has emerged as the most-used internal systems interconnect in the Top 500 list of supercomputers. The latest revision of this list reveals that more than 45% of the systems use IB. Communication Model and Transports: Connection establishment and communication in IB is done using a set of primitives called Verbs. IB uses a queue based model. A process can queue up a set of instructions that the hardware executes. This facility is referred to as

14

a Work Queue (WQ). Work queues are always created in pairs, called a Queue Pair (QP), one for send operations and one for receive operations. In general, the send work queue holds instructions that cause data to be transferred from one process’s memory to another process, and the receive work queue holds instructions about where to place data that is received. The completion of Work Queue Entries (WQEs) is reported through Completion Queues (CQ). IB provides both reliable and unreliable transport modes: Reliable Connection (RC), Reliable Datagram (RD), Unreliable Connection (UC) and Unreliable Datagram (UD). However, only RC and UC are required to be supported for IB implementations to be compliant with the standard. The RC transport is used in all the experiments presented in this report. As its name suggests, RC provides reliable communication and ensures ordering of messages between two end-points. Memory involved in communication through IB should be registered with the IB network adapter. Registration is done using an IB verbs call which pins the corresponding pages in memory and returns local and remote registration keys (lkey and rkey). The keys are used in communication operations as described in the following section. Communication Semantics: InfiniBand supports two types of communication semantics in RC transport: channel and memory semantics. In channel semantics, both the sender and receiver have to be involved to transfer data between them. The sender has to post a send work request entry (WQE) which is matched with a receive work request posted by the receiver. The buffer and the lkey are specified with the request. It is to be noted that the receive work request needs to be posted before the data transfer gets initiated at the sender. The receive buffer size should be equal or greater than the send buffer size. This allows for zero-copy transfers but requires strict synchronization between the two processes.

15

Higher level libraries avoid this synchronization by pre-posting receive requests with staging buffers. The data is copied into the actual receive buffer when the receiver posts the receive request. This allows the send request to proceed as soon as it is posted. There exists a trade-off between synchronization costs and additional copies. In memory semantics, Remote Direct Memory Access (RDMA) operations are used instead of send/receive operations. These RDMA operations are one-sided and do not require software involvement at the target. The remote host does not have to issue any work request for the data transfer. The send work request includes address and lkey of the source buffer and address and rkey of the target buffer. Both RDMA Write (write to remote memory location) and RDMA Read (read from remote memory location) are supported in InfiniBand.

2.4

Quality-of-Service Support in InfiniBand

VL0

VL0

VL1

VL1

VL2

Physical Link

VL2

VL15

VL15

Host Channel Adapter

Switch Port

Figure 2.1: Per-VL Buffering and Flow control

InfiniBand is an open-standard high-speed interconnect that provides send-receive semantics, and memory-based semantics called Remote Direct Memory Access (RDMA).

16

RDMA operations allow a node to directly access a remote node’s memory contents without using the CPU at the remote side. These operations are transparent at the remote end since they do not involve the remote CPU in the communication. InfiniBand empowers many of todays Top500 Supercomputers. Fundamentally, network QoS mechanisms provide a way to either increase or limit the priority of data flow based on level of importance given to it when configuring a network. The InfiniBand interconnect architecture provides support for QoS using abstractions called Service Level (SL) and Traffic Class (TClass) at the switch-level and router-level, respectively. These abstractions hide-away the underlying components that help achieving QoS - Virtual Lanes (VL), Virtual Lane arbitration and link-level flow control. Virtual Lanes allow multiple independent data flows to share the same physical link, but with exclusive buffering and flow control resources. Figure 2.1 illustrates how the IB HCA and switch ports use the same physical link but provide independent buffers for each VL. A VL arbiter controls the link usage by selecting appropriate data flows based on a VL arbitration table. The IB specification allows implementers to provide up to 16 VLs - 0 to 15, with VL15 reserved for management traffic and VL0-14 reserved for general purpose traffic. A Service Level (SL) is a 4-bit field in the Local Routing Header of a packet that indicates a class of service that will be offered to that packet. Each component in an IB network maintains a Service Level to Virtual Lane (SL2VL) mapping table that is consulted before sending a packet on a given SL. The SL to VL mapping and the priorities set for each VL are configurable and is deployment-specific. The TClass field in a Global Routing Header mimics the functionality of the SL field, but only at the router-level. All these parameters are configured within the InfiniBand subnet manager.

17

2.5

Filesystem in User-Space (FUSE)

User-Level Filesystem Applications

libfuse

glibc

glibc

User Space Kernel Space FUSE

ext3/4

VFS

NFS

...

Figure 2.2: The FUSE Architecture (Courtesy: [7])

FUSE [7] is a software that allows the creation of a virtual file system in the user level. It relies on a kernel module to perform privileged operations at the kernel level, and provides a user-space library to communicate with this kernel module. FUSE is widely used to create file systems that do not really store the data itself but relies on other resources to effectively store the data. The FUSE module is available with all mainstream Linux kernels starting from version 2.4.x. The kernel module works with a user-space library to provide an intuitive interface for implementing a file system with minimal effort and coding. Given that a FUSE file

18

system can be mounted just as any other, it is straight-forward to intercept application I/O operations transparently.

2.6

Berkeley Lab Checkpoint-Restart (BLCR)

BLCR library has been the most pre-dominantly used system-level checkpoint-restart solution for HPC systems. All the major MPI libraries have a tight integration with BLCR, providing a larger coverage for end applications. BLCR is a kernel-level solution which uses a kernel module to access the kernel data structures when saving the state of a process. BLCR saves a wide variety of information about the application’s execution state such as open files, pipes, pinned and protected memory, threads and sessions, signal handling, and so on. It does not, however, have the capability to natively handle parallel or distributed applications. However, it can be extended using checkpoint callback handlers which are implemented in a user-space library that should be linked to the application. The callback functions are invoked when a checkpoint is about to be performed or if the application is restarted from a checkpoint file. Any user level task like synchronizing with other MPI processes or freeing resources before checkpoint or connection establishment after restart can be performed in the callback handler prior to checkpoint or after restart from checkpoint as the case may be. BLCR is portable across various architectures, and supports all Linux kernels starting from 2.4. In order to handle potential stray, out-of-order, or zombie messages at the network level, MPI libraries which use BLCR ensure that the MPI processes are not communicating during the checkpointing protocol. They achieve this by coordinating with each other to reach a quiescent state, after which all pending communication between processes are

19

put on hold, and in-progress communication is flushed. All communication channels are also disconnected to avoid stray messages during a restart. The state of the application is captured in a checkpoint now, after which the communication channels are restored again for the application to make progress.

2.7

Distributed MultiThreaded CheckPointing (DMTCP)

DMTCP [33] is a widely-used user-space checkpointing tool that can transparently save the state of applications without requiring any modifications. DMTCP also supports the checkpointing of parallel MPI applications. It has an extensible architecture that provides a plugin-capability for third-party modules. It offers two important features that make it extensible: event-hooks and wrapper functions. Event hooks allow a plugin to execute additional action at the time of checkpointing, resuming or restarting. The wrapper functionality allows a plugin to wrap any library or system calls using some prologue or epilogue code, or provide an entirely new implementation for them. This extensible architecture enables users to easily build their own extensions to checkpointing, including customizable support for incremental checkpointing, parity-based checkpointing, remote checkpointing, remote restart, and in-memory checkpointing.

2.8

Scalable Checkpoint-Restart Framework (SCR)

High-performance computing systems are growing more powerful by using more components. As the system mean time before failure correspondingly drops, applications must checkpoint frequently to make progress. However, at scale, the cost in time and bandwidth of checkpointing to a parallel file system becomes prohibitive. A solution to this problem is multilevel checkpointing.

20

Multilevel checkpointing allows applications to take both frequent inexpensive checkpoints and less frequent, more resilient checkpoints, resulting in better efficiency and reduced load on the parallel file system. The slowest but most resilient level writes to the parallel file system, which can withstand an entire system failure. Faster checkpointing for the most common failure modes uses node-local storage, such as RAM, Flash, or disk, and applies cross-node redundancy schemes. Most failures only disable one or two nodes, and multi-node failures often disable nodes in a predictable pattern. Thus, an application can usually recover from a less resilient checkpoint level, given well-chosen redundancy schemes. To evaluate this approach in a large-scale, production system context, LLNL researchers developed the Scalable Checkpoint/Restart (SCR) library [93]. It has been used in production since late 2007 using RAM disks and solid-state drives on Linux clusters. SCR’s design is based on two key properties. First, a job only needs its most recent checkpoint—as soon as the job writes the next checkpoint, a previous checkpoint can be deleted. Second, the majority of failures only disable a small portion of the system, leaving most of the system intact. Overall, with SCR, it has been found that jobs run more efficiently, recover more work upon failure, and reduce load on critical shared resources such as the parallel file system and the network infrastructure. For example, the results obtained in [93] showed that 85% of failures disabled less than 1% of the compute nodes on the clusters in question.

2.9

The Xeon Phi Architecture

The Many Integrated Core (MIC) architecture from Intel aims to boost the performance of highly parallel applications at low power requirements. This is achieved through higher degrees of parallelism obtained from many simplified low-power cores. The key advantage

21

of the MIC architecture is its x86 compatibility which allows the huge ecosystem of existing tools, libraries and applications to run on it, with little or no modification. The current generation product that conforms to MIC architecture, code-named Xeon Phi (SE10P), is equipped with 61 processor cores interconnected by a high performance bi-directional ring. Each core is an in-order, dual-issue core that supports fetch and decode instructions from four hardware threads. There are eight memory controllers which can deliver a theoretical bandwidth of up to 352 GB/s. The Xeon Phi is attached as a PCI device and any communication operation between the host processor and the MIC incurs the costs associated with moving data over the PCI channel. The MIC architecture offers three modes of execution for the MPI programming model – Offload, coprocessor-only and the symmetric mode [24]. In the offload mode, all the MPI processes execute either on the host (regular offload) or on the MIC (reverse offload) with the other platform being used as an accelerator. This mode allows programmers to identify sections of code that can benefit from execution on the complimentary platform and offload instructions through annotations or pragmas. In the coprocessor-only mode, all MPI processes reside entirely on the MIC which lends itself for highly parallel and energy efficient mode of execution. Lastly, in the symmetric mode, MPI processes execute both on the host and the MIC. In the symmetric mode, the MPI job needs to be run in the Multiple Program Multiple Data (MPMD) mode and the mode allows for maximum utilization of compute resources.

22

FTB client

FTB client

Connect

Connect Subscribe

Publish event

FTB Agent

FTB Agent Bootstrap

FTB Agent FTB Agent

FTB Agent

Figure 2.3: FTB Architecture (Courtesy: [67])

2.10

The Fault-Tolerance Backplane (FTB)

The Fault Tolerance Backplane (FTB) developed as part of the Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) initiative [1] is an asynchronous messaging backplane that provides communication between the various system software components. FTB provides a common infrastructure for the Operating System, Middleware, Libraries and Applications to exchange information related to hardware and software failures in real time. Different components can subscribe to be notified about one or more events of interest from other components, as well as notify other components about the faults it detects. The FTB physical infrastructure is shown in Figure 2.3. The FTB framework comprises a set of distributed daemons, called FTB Agents which contain the bulk of the FTB logic and manage most of the book-keeping and event communication throughout the system. FTB agents connect to each other to form a tree-based topology. If an agent loses connectivity during its lifetime, it can reconnect itself to a new parent in the topology tree, 23

making the tree fault tolerant and self-healing. From the software perspective, the FTB Software Stack consists of three layers, namely, the Client Layer, the Manager Layer and the Network Layer. The Client Layer consists of a set of APIs for the clients to interact with each other. The Manager Layer handles the bookkeeping and decision making logic. It handles the client subscriptions, subscription mechanisms and event notification criteria. This layer is responsible for event matching and routing events across to other FTB Agents. This layer exposes a set of APIs for the Client Layer to interact with it. The interface is internal to FTB and is not exposed to external clients. The Network Layer is the lowest layer of the software stack. This layer deals with the sending and receiving of data. The Network Layer is transparent to the upper layers and is designed to support multiple communication protocols such as TCP/IP and shared-memory communication.

2.11

Running-Average Power Limit (RAPL)

RAPL (Running Average Power Limit) [51] is a platform-specific power management interface provided by Intel x86 processors for the Sandy Bridge architecture and onwards. The interface allows the use of non-architectural MSRs (Model Specific Register) to limit power usage, monitor energy status, etc. The specific registers that report energy usage are updated by specific actions. The RAPL interface can expose multiple domains of power budgeting and status query within each processor socket. Using RAPL capabilities in the modern Intel processors, users can get power-meter and power-clamping functionality by reading and writing to various Model-Specific Registers (MSRs) using privileged instructions. Intel has provided a kernel module that exposes MSRs as device files — /dev/cpu//msr, where specific registers can be read

24

using different offsets into the file. The RAPL-specific registers and the offsets themselves are detailed in Intel’s Software Developer’s Manual [10].

2.12

Intelligent Platform Management Interface (IPMI)

The Intelligent Platform Management Interface (IPMI) [11] defines a set of common interfaces to a computer system which can be used to monitor system health. IPMI consists of a main controller called the Baseboard Management Controller (BMC) and other management controllers distributed among different system modules that are referred to as Satellite Controllers (SC). The BMC connects to SCs within the same chassis through the Intelligent Platform Management Bus/Bridge (IPMB). Amongst other pieces of information, IPMI maintains a Sensor Data Records (SDR) repository which provides the readings from individual sensors present on the system, including, sensors for voltage, temperature and fan speed. IPMI can be used to monitor system health using in-band (running locally) or outof-band (connected remotely) methods. In-band monitoring can be done using one of the several drivers, including a direct Keyboard Controller Style (KCS) interface driver, a Linux SMBus System Interface (SSIF) driver through the SSIF device (i.e. /dev/i2c-0), the OpenIPMI Linux kernel driver (i.e. /dev/ipmi0), and the Sun/Solaris BMC driver (i.e. /dev/bmc). Out-of-band monitoring is done by communication with a remote node’s BMC which has been configured for such communication. There are several tools and libraries available to query the IPMI hardware for system events and sensors readings. These include OpenIPMI [19], FreeIPMI [9], ipmiutil and ipmitool.

25

Chapter 3: Stage-FS: RDMA-Based Hierarchical Checkpoint Data-Staging

With the rapid advances in technology, many clusters are being built with high performance commercial components such as high-speed low-latency networks and advanced storage devices such as Solid State Drives (SSDs). These advanced technologies provide an opportunity to redesign existing solutions to tackle the I/O challenges imposed by checkpointing systems. As part of this thesis, we propose a hierarchical data staging architecture to address the I/O bottleneck caused by Checkpoint-Restart. Specifically it addresses the challenges: 1. How can high-speed networks and new storage media such as SSDs be leveraged to accelerate checkpointing I/O performance? 2. Can a hierarchical and scalable data-staging system be designed to specifically service checkpoint-restart I/O workloads? 3. What are the benefits that end-applications can achieve by using such a data-staging system?

26

3.1

Detailed Design

The fundamental principles behind fault-tolerance protocols were made several decades ago when network speeds were slower than that of the processor and local memory bandwidth, and flash-based storage was still a thing of the future. However, over the past decade, we have witnessed this assumption to slowly wither away. The latest generation clusters are provisioned with interconnect systems that are capable of providing extremely high bandwidth and low latencies. According to the road map of the InfiniBand standard [22], 100Gbps EDR and 200Gbps HDR are expected to be available in the coming years. Likewise, flash-based storage media such as Solid-State Disks (SSDs) have been gaining widespread adoption as well. PCIe-based flash solutions are particularly gaining traction owing to their low write-latencies and high bandwidth. However, providing SSDs and multiple high-speed network adapters for each node of an Exascale system will be prohibitively expensive. In this context, we propose a Hierarchical Data-Staging Framework that leverages the Remote Direct Memory Access capabilities of the InfiniBand interconnect, and the highbandwidth storage capabilities of SSDs to provide a fast and temporary storage area for HPC applications. This temporary storage can absorb the bursts of I/O generated by an application when saving its state of execution in checkpoints. This fast staging area is made available on a subset of nodes, namely Staging servers. In addition to what a generic compute-node is usually provisioned with, staging servers are over-provisioned with highthroughput SSDs and multiple high-bandwidth network links. Given the fact that such hardware is expensive, this architecture avoids the need to install them on every computenode.

27

Computing node

Computing node

Computation

Computing node

Computing node

Direct checkpoint Checkpoint staging

Staging node

Shared filesytem

Background transfer

Shared filesytem

Classic direct checkpoint approach

Checkpoint staging approach

Figure 3.1: Comparison between the direct checkpoint and the checkpoint staging approaches

Figure 3.1 shows a comparison between the classic direct-checkpointing and the proposed checkpoint staging approaches. On the left, with the classic approach, the checkpoint files are directly written to a shared file system, such as Lustre. Due to the heavy I/O burden imposed on the shared file system by the checkpoint I/O requests, the parallel writes get multiplexed, and the aggregate throughput is reduced. This increases the time for which the application blocks, waiting for the checkpointing operation to complete. On the right, with the staging approach, the staging nodes are able to quickly absorb the large amount of data thrust upon them by the client nodes, with the help of the scratch space provided by the staging servers. Once the checkpoint data has been written to the staging nodes, the application can resume. Then, the data transfer between the staging servers and the shared file system takes place in background and overlaps with the computation. Hence, this approach reduces the idling time of an application caused due to the checkpoint protocol. Regardless of which approach is chosen to write the checkpointing data, it eventually has to reach the same media.

28

We have designed and developed an efficient software subsystem which can handle large and concurrent snapshot writes from typical rollback recovery protocols, and can leverage the fast storage services provided by the staging server.

Figure 3.2: Design of Hierarchical Data Staging Framework

Figure 3.2 shows a global overview of our hierarchical data-staging framework which has been designed for use with these staging nodes. A group of clients, governed by a single staging server, represents a staging group. These staging groups are building blocks of the entire architecture. Our design imposes no restriction on the number of blocks that can be used in a system. The internal interactions between the compute nodes and a staging server are illustrated for one staging group in the figure. With the proposed design, neither the application nor the MPI stack needs to be modified to utilize the staging service. We have developed a virtual file system based on FUSE [7] to allow this convenience. The applications that run on compute nodes can access this staging file system just like any other local file system. FUSE provides the ability to intercept standard file system calls such as open(), read(), write(), close() etc., and 29

manipulate the data as needed at user-level, before forwarding the call and the data to the kernel. This ability is exploited to transparently send the data to the staging area, rather than writing to the local or shared file system. One of the major concerns with checkpointing is the high degree of concurrency with which multiple client nodes write process snapshots to a shared storage subsystem. These concurrent write streams introduce severe contention at the Virtual Filesystem Switch (VFS) which impairs the total throughput. To avoid this contention caused by small and mediumsized writes which is common in the case of checkpointing, we use the write-aggregation method proposed and studied in [97]. It allows coalescing of the write requests from the application/checkpointing library, and groups them into fewer large-sized writes, which in turn reduces the number of pages allocated to them from the page cache. After aggregating the data buffers, instead of writing them to the local disk, the buffers are enqueued in a work-queue which is serviced by a separate thread that handles the network transfers. The primary goal of this staging framework is to let the application that is being checkpointed resume its computation as early as possible, without penalizing it for the shortcomings of the underlying storage system. The InfiniBand network fabric has RDMA capability which allows for direct reads/writes to/from host memory without involving the host processor. This capability has been leveraged to directly read the data that is aggregated in the client’s memory, which then gets transferred to the staging node which governs it. The staging node writes the data to a high-throughput node-local SSD while it receives chunks of data from the client node (step A1 in Fig. 3.2). Once the data has been written to these staging servers, the application can be certain that the checkpoint has been persisted in a stable medium, and can proceed with its computation phase. The data from the SSDs on

30

individual servers are then moved to a stable distributed file system in a lazy manner (step A2 in Fig. 3.2). Concerning the reliability of this data-staging approach, one has to note that after completing the staging protocol, all checkpoint files are eventually stored in the same shared file system as was the case with the direct-checkpointing approach. So both approaches provide the same reliability regarding the saved data. However, with the staging approach, the checkpointing operation is faster. This reduces the odds of losing the checkpoint data due to a compute node failure. During a checkpoint, the staging servers introduce additional points of failure. To counter the effects of such a failure, we ensure that the previous set of checkpoint files are not deleted before all the new ones are safely transferred to the shared file system.

3.2 3.2.1

Performance Evaluation Experimental Testbed

A 64-node InfiniBand Linux cluster was used for the experiments. Each client node has eight processor cores on two Intel Xeon 2.33 GHz Quad-core CPUs. Each node has 6 GB main memory and a 250 GB ext3 disk drive. The nodes are connected with Mellanox MT25208 DDR InfiniBand HCAs for low-latency communication. The nodes are also connected with a 1 GigE network for interactive logging and maintenance purposes. Each node runs Linux 2.6.30 with FUSE library 2.8.5. The primary shared storage partition is backed by Lustre. Lustre 1.8.3 is configured using 1 MetaData Server (MDS) and 1 Object Storage Server (OSS), and is set to use InfiniBand transport. The OSS uses a 12-disk RAID-0 configuration which can provide a 300 MB/s write throughput.

31

1 process/node

550

●

2 processes/node

4 processes/node

8 processes/node

●

530

●

510

●

490

Throughput (in MB/s)

●

1

2

3

4 5 Number of client nodes

6

7

8

Figure 3.3: Throughput of a single staging server with varying number of clients and processes per client (Higher is better)

The cluster also has 8 storage nodes, 4 of which have been configured to be used as the “staging nodes”(as described in Fig. 3.2) for these experiments. Each of these 4 nodes have PCI-Express based SSD cards with 80 GB capacity, two of them being Fusion-io ioXtreme cards (350 MB/s write throughput) and two others being Fusion-io ioDrive cards (600 MB/s write throughput).

3.2.2

Profiling of a Stand-Alone Staging Server

It is important to study the performance of a single staging node in terms of the number of clients that it services. The I/O throughput was computed using the standard IOzone benchmark [13]. Each client writes a file of size 1 GB using 1 MB records. Figure 3.3 reports the results of this experiment. We observe a peak throughput of 550 MB/s when a single client with 1 process writes data. This throughput matches the write throughput of the SSD used as the staging area (i.e. 600 MB/s). This indicates that transferring the files over the InfiniBand network does not prove to be a bottleneck. As the number of processes per client node (and the total number

32

Staging 1 proc/node

Staging 8 proc/node

Lustre 1 proc/node

Lustre 8 proc/node

1500

● ● ●

500

Total throughput (in MB/s)

●

●

10

15 20 25 30 Number of client nodes (1 staging node for 8 client nodes)

Figure 3.4: Throughput scalability analysis, with increasing number of Staging groups and 8 clients per group (Higher is better)

of processes in turn) increases, there is contention at the SSD which slightly decreases the throughput. For 8 processes per node and 8 client nodes, i.e. 64 client processes, the throughput is 488 MB/s, which represents only a 11% decline.

3.2.3

Scalability Analysis

In this section, we study the scalability of the whole architecture from the application’s perspective. In these experiments, we choose to associate 8 compute nodes with a given staging server. We measure the total throughput using the IOzone benchmark for 1 and 8 processes per nodes. Each process writes a total of 1 GB of data using 1 MB record size. The results are compared to the classic approach where all processes directly write to the Lustre shared file system. Figure 3.4 shows that the proposed architecture scales even as the number of groups are increased. It is expected because it is designed in such a way that the I/O resources are

33

added proportionally to the number of computing resources. Conversely, the Lustre configuration does not offer such a possibility, so the Lustre throughput stays constant. The peak aggregated throughput observed for all the staging nodes combined is 1,834 MB/s, which is close to the sum of write throughput of the SSDs available on these nodes (1,900 MB/s).

3.2.4

Evaluation with Applications

As explained in Figure 3.1, the purpose of the staging operation is to allow the application to resume its execution faster after a checkpoint. In the next experiment, we measure the time taken to write a checkpoint as seen by an application. Specifically, we measure the time during which the computation is suspended because of the checkpoint. We compared the proposed staging approach with the classic method in which the application processes directly write their checkpoints to the parallel Lustre file system. We also measure the time required by the staging node to move the checkpointed data to Lustre in background once the checkpoints have been staged and the computation has resumed. We used two applications (LU and BT) from the NAS Parallel Benchmarks for this experiment. The class D input has a large memory footprint, and hence, big checkpoint files. These applications were run on 32 nodes with MVAPICH2 [64] and were checkpointed using the integrated Checkpoint/Restart support based on BLCR [68]. Table 3.1 shows the checkpoint size of these applications for the considered test cases.

LU.D.128 BT.D.144

Average size per process 109.3 MB 212.1 MB

Table 3.1: Size of the checkpoint files

34

Total size 13.7 GB 29.8 GB

Direct checkpoint Checkpoint staging Background transfer

Direct checkpoint Checkpoint staging Background transfer

99.3 s

150

105.3 s

50

100

Time (seconds)

60 40

49.5 s

20

Time (seconds)

80

200

241 s

11.9 s 0

0

28.8 s

Lustre directly

Staging approach

Lustre directly

(a) LU.D.128

Staging approach

(b) BT.D.144

Figure 3.5: Comparison of the checkpoint times between the proposed staging approach and using the classic approach (Lower is Better)

Figure 3.5 reports the checkpointing time that we measured for the two applications. For the proposed approach, two values are distinctly shown: the checkpoint staging time (step A1 in Figure 3.2) and the background transfer time (step A2 in Figure 3.2). The staging time is the checkpointing time as seen by the application, i.e. the time during which the computation is suspended. The background transfer time is the time to transfer the checkpoint files from the staging area to the Lustre file system, which takes place in parallel to the application execution once the computation resumes. For the classic approach, the checkpoint is directly written to the Lustre file system, so we show only the checkpoint time (step B in figure 3.2). The application is blocked on the checkpointing operation for the entire duration shown. The direct checkpoint time and the background transfer time both write the same amount of data to the same Lustre file system. The huge difference (twice faster or more) between these data transfer times is because, thanks to our hierarchical architecture, the contention on the shared file system is reduced. With the direct-checkpointing approach, 128 or 144

35

processes write their checkpoint simultaneously to the shared file system. With our staging approach, only 4 staging servers write simultaneously to the shared file system. It is interesting to compare only the direct checkpoint time to the checkpoint staging time because they correspond to the time which is seen by the application (for classic approach and staging approach, respectively). Indeed, the background transfer is overlapped by the computation. Our results show the benefit of using the staging approach which considerably reduces the time during which the application is suspended. For both our test cases, the checkpoint time, as seen by the application, is 8.3 times faster. The time gained can be used to make progress in the computation.

3.3

Related Work

Checkpoint/Restart is supported by several MPI stacks [64, 71, 46] to achieve fault tolerance. Many of these stacks use FTB [67] as a back-plane to propagate fault information in a consistent manner. However, Checkpoint is well known for its heavy I/O overhead to dump process images to stable storage [105]. A lot of efforts have been conducted to tackle this I/O bottleneck. PLFS [39] is a parallel log-structured file system proposed to improve the checkpoint writing throughput. This solution only deals with N-1 scenario where multiple processes write to the same shared file, hence it cannot handle MPI system-level checkpoint where each process is checkpointed to a separate image file. SCR [93] is a multi-level checkpoint system that stores data to the local storage on compute nodes to improve the aggregated write throughput. SCR stores redundant data on neighbor nodes to tolerate failures of a small portion of the system, and it periodically

36

copies locally cached data to parallel file system to tolerate cluster-wide catastrophic failures. Our approach differs from SCR where a compute node stages its checkpoint data to its associated staging server, such that the compute node can quickly resume execution while the staging server asynchronously moves checkpoint data to a parallel file system. OpenMPI [70] proposes a feature to store process images to node-local file system, and later copies these files to a parallel filesystem. Dumping a memory-intensive job to a local file system is usually bounded by the local disk speed, and it is hard to work on disk-less clusters where RAM disk is not feasible due to the high application memory footprint. Our approach aggregates node-local checkpoint data and stages it to a dedicated staging server, which takes advantages of high bandwidth network and advanced storage media such as SSD to achieve good throughput. Isaila et al. [75] designed a two-level staging hierarchy to hide file access latency from applications. Their design is coupled with Blue Gene’s architecture where dedicated I/O nodes service a group of compute nodes, and not all clusters have such a hierarchical structure. DataStager [27] is generic service for I/O staging which is also based on InfiniBand RDMA. However, our work is specialized for the Checkpoint/Restart. Thus, we can optimize the I/O scheduling for this scheme. For example, we give the priority to the data movement from the application to the staging nodes to shorten the checkpoint time from the application perspective.

3.4

Summary

As a part of this work, we explored several design alternatives to develop a hierarchical data staging framework to alleviate the bottleneck caused by heavy I/O contention at the

37

shared storage when multiple processes in an application write their checkpoint snapshots. Using the proposed framework, we have studied the scalability and throughput of hierarchical data-staging and the merits it offers in handling large amounts of checkpoint data. We have evaluated the checkpointing overheads on different applications, and have noted that they are able to resume their computation up to 8.3 times faster with the data-staging framework compared to the classic checkpointing model. As part of the future work, we would like to extend this framework to offload other fault-tolerance protocols to the staging server and relieve the client of additional overhead.

38

Chapter 4: Stage-QoS: Network Quality-of-Service Aware Checkpointing

Modern interconnect technologies, such as InfiniBand (IB), which have been dominating the commodity networking space in the High-End Computing (HEC) domain provide several novel features that can be used to avoid contention and congestion. Of particular interest in this context is its Quality-of-Service (QoS) capability that provides a notion of dedicated bandwidth and controlled latency to selected traffic flows. This opens up the possibility of alleviating network contention problems, but not completely eliminating them, by carefully orchestrating the data-flow from different components in the system using the same physical interconnect fabric. With this as a motivation, this chapter describes how the proposed framework addresses the following open research challenges: 1. Can the performance impact of network space-sharing between parallel applications and file systems be characterized? 2. How can the QoS capabilities provided by cutting-edge interconnect technologies be leveraged by parallel file systems to minimize network contention? 3. How can existing HPC middleware benefit from such a QoS-enabled parallel file system?

39

4.1

Design Goals and Alternatives

Optimizations in MPI Stack

Optimizations inside the Parallel Filesystem

Optimizations in a User-level Filesystem MPI Applications

MPI Applications

MVAPICH2

MPICH2

OpenMPI

I/O Libraries (MPI-I/O, NetCDF POSIX-I/O,HDF5...)

MPI Applications

MVAPICH2

MPICH2

OpenMPI

I/O Libraries (MPI-I/O, NetCDF POSIX-I/O,HDF5...)

MVAPICH2

MPICH2

OpenMPI

I/O Libraries (MPI-I/O, NetCDF POSIX-I/O,HDF5...)

InfiniBand Interconnect Fabric

InfiniBand Interconnect Fabric

InfiniBand Interconnect Fabric

QoS-Enabled Data-Staging Framework

Backend Parallel Filesystem (Lustre, PVFS, GPFS,...)

Backend Parallel Filesystem (Lustre, PVFS, GPFS,...)

Backend Parallel Filesystem (Lustre, PVFS, GPFS,...)

Figure 4.1: Design Alternatives

For a solution that is aiming to address the problem of network contention between I/O traffic and communication traffic from parallel applications, several design choices are available. Figure 4.1 illustrates the architecture of some of these choices. This discussion assumes that the parallel applications follow the Message-Passing Interface (MPI) parallel programming model [16], but the proposed designs are applicable to any parallel programming model that can take advantage of the InfiniBand interconnect. Prior work from our research group [131] studied the benefits of using multiple VLs to avoid congestion due to Head-of-Line blocking by leveraging the QoS capabilities of InfiniBand inside our MPI library - MVAPICH2 [17]. Such a technique can be used to avoid contention due to I/O by avoiding the default SL, which is SL0, over which I/O traffic is normally moved. However, such a technique will be implementation-specific and will not be portable, requiring a redesign of each MPI library that wants to use it. The second potential design alternative is to modify parallel file system kernels (Figure 4.1(b)), such as Lustre [15] or PVFS2 [21], making it QoS-aware in such a fashion that it can isolate its own I/O traffic to a non-standard SL and let the communication traffic from MPI applications stream freely over the default 40

SL. Yet again, such a solution will be specific to the file system that was enhanced, and can not be ported to be a system-wide solution for this problem when multiple file systems are used for varying purposes including checkpointing, scratch space, etc. The approach we have chosen however, achieves both portability and performance by leveraging the QoS capabilities of the IB subsystem in our data-staging framework, as shown in Figure 4.1(c). This is a user-level framework that sits between the backend parallel file systems and the MPI library making it implementation-agnostic. This framework can intercept the I/O calls that are directed to the backend storage from any of the I/O interfaces, such as POSIX-I/O, MPI-I/O, HDF5 and NetCDF, that the parallel application might be using to write data.

4.2 4.2.1

Detailed Design Configuring the IB Subnet Fabric

The core-logic of the InfiniBand QoS mechanism is configured within the subnet manager. For the purpose of this work, we have used OpenSM, which is a widely used opensource implementation of the InfiniBand subnet manager specification. It is comprised of the Subnet Manager (SM) that initiates and configures the IB subnet, and the Subnet Management Agent (SMA) that is deployed on every device port to monitor their respective hosts. The SM and SMAs communicate using Subnet Management Packets (SMP), which uses the exclusive management lane - VL15 for its traffic. As discussed in Chapter 2.4, OpenSM assigns weights to different VLs based on a VL arbitration table. It also maintains an SL2VL mapping, which is correlated with the arbitration table to identify the weights for different service levels. These two key parameters, in addition to few others, are provided to OpenSM during initialization using a configuration

41

file - opensm.conf. Figure 4.2 is a snapshot of the configuration file used for this work. The figure enumerates the different parameters that are set inside this file.

qos ca max vls qos ca high limit qos ca vlarb high qos ca vlarb low qos ca sl2vl qos swe max vls qos swe high limit qos swe vlarb high qos swe vlarb low qos swe sl2vl log flags

8 255 0:40,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 0:0,1:2,2:4,3:8,4:16,5:32,6:64,7:128 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 8 255 0:40,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 0:0,1:2,2:4,3:8,4:16,5:32,6:64,7:128 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 0x0F

Figure 4.2: OpenSM Configuration File

The qos_ca_max_vls parameter specifies the maximum number of VLs any given Host Channel Adapter in the subnet can support. Most of the current generation HCAs support only up to 8 VLs. The VL arbitration table consists of two sub-tables: a highpriority and a low-priority one. The VL arbiter employs a weighted round-robin scheme and processes the high-priority table first, followed by the low-priority one, giving each VL the turn to send the corresponding amount of data. qos_ca_high_limit is the maximum number of packets that can be sent by a VL in the high-priority table before yielding to those in the low-priority list. This is to ensure that the low-priority VLs do not get starved for a turn. qos_ca_vlarb_high and qos_ca_vlarb_low specify the number of 64-byte data units a given VL can send when its turn arrives. qos_ca_sl2vl lists the VLs that should be mapped to the corresponding

42

SLs in order. Our HCA supports only 8 general purpose VLs, so the SLs 8-15 are mapped to the management VL15. The *_swe_* variants of these parameters have the same significance, but for the ports on the switches. Our arbitration table is configured such that, SL0 gets the highest priority. The weights of the other SLs are set to ”0” in the high-priority lane. This ensures that the non-zero SLs do not send data over the link as long as SL0 as a packet to send. The non-zero SLs get a turn to send data only when there are no packets in SL0’s queue, or if the qos_ca_high_limit has been reached. SLs 1-7, listed in the low-priority list, have been given weights in increasing powers of 2 to study the impact of SL weights.

4.2.2

Enabling Quality-of-Service in the Filesystem

InfiniBand uses a memory-based communication abstraction whose interface is known as a Queue-Pair (QP). The QP, which is also the virtual end-point of a communication link, is used to achieve direct memory-to-memory transfers between different applications and system software components. In the hierarchical data-staging architecture, the QPs are persistent, in the sense that they are created when the staging file system is mounted on the client nodes. There is a QP-based connection established between every client and the staging server that governs it. There is no connection established amongst the clients themselves as the data is not moved between them, but it is only flushed to the burst-buffers available with the staging server. One of the goals of making the staging file system QoS-aware is to give the highest priority to the network communication traffic from the parallel applications, and to move the I/O data only during idle times when the application is busy in a compute-phase. Another design goal was to be able to set the SL over which the I/O packets from the clients

43

are sent, at runtime when mounting the file system, and dynamically based on certain hints provided to the file system during the file write. To this effect, the filesystem was enhanced to establish an array of QPs, each of which was attributed to a different non-zero SL ranging from 1 through 7. The SL over which the file traffic from the client nodes should be sent can be set by one of two methods. One way is to set an environment variable (STAGEFS_SERVICE_LEVEL=) when mounting the file system, where "n" is the SL number to use. With this method, all of the file system traffic will use the SL-n. The default Service Level that is used is 1, which has the least weight in the low-priority portion of the VL arbitration table. SL1 was chosen as the default so that all file I/O is provided link access only in a best-effort manner, without giving an opportunity for contention that affect other applications’ communication latencies. The other method is to provide hints to the file system using certain filenamesuffixes when writing the file at the client-side. Adding the suffix ”.sl” to file names will hint the file system, letting it know of the Service Lane that needs to be used to send this particular file to a staging server. This allows for a per-file QoS with which applications that perform heavy I/O operations can prioritize data, such as giving higher priority to insitu processing data that is needed for the next step of the computation, and lower priority to application log files. As indicated in Section 3.1, the data movement from the staging servers to the backend file system does not contribute to network contention as it happens in a lazy manner when the network is idle, and so, it does not have to be aware of the Service Level abstraction. If the backend parallel file system does support InfiniBand transport, the data packets will just be sent over the default SL0.

44

4.3

Experimental Evaluation

In this section, we present the results of our experiments that evaluate the benefits of the proposed design with some micro-benchmarks and some real-world MPI applications. The computing cluster used for these evaluations, Cluster A, is a 160-node Linux-based system. Each compute node has eight Intel Xeon cores organized as two sockets with four cores per socket and has 12 GB of memory. They are equipped with InfiniBand QDR Mellanox ConnectX-2 HCAs, which support up to 8 Virtual Lanes. The operating system used is Red Hat Enterprise Linux Server release 6 with the 2.6.32-71.el6.x86 64 kernel. Cluster A also has 16 dedicated storage nodes with the same configuration, but with 24GB of RAM each. Additionally, the storage nodes are equipped with a 300GB OCZ VeloDrive PCIe SSD.

4.3.1

Micro-Benchmark Evaluation

In these set of experiments, we study the impact of file system noise on MPI communication by characterizing its latency, bandwidth and some representative MPI collective operation latencies, with varying message sizes. The OSU Micro-Benchmarks (OMB) [20] suite was used for these evaluations. All tests were run with 2 MPI processes residing on two unique client nodes, with the file system noise generated by one I/O thread on each of these nodes that writes a stream of raw data in 1MB blocks to the staging system using the dd GNU core utility. Direct I/O was used when writing the raw I/O data to avoid kernel page cache buffering. This was done to get a more accurate representation of the data being written. The MPI processes and the I/O writers are mapped to different processing cores of host CPUs. A

45

single storage node which houses a 300GB SSD is setup to be the staging server that governs these two client nodes. The fast-SSD storage helps alleviate the storage bottlenecks, and to get a clearer understanding of the impact on the InfiniBand network. From Figure 4.3, it can be observed that the latency of MPI application communication gets hurt when the file system traffic shares SL0 with MPI traffic. The impact of this noise is more prominent with larger message sizes, as seen in Figure 4.3(c) where the contention in the network is increasing the latency by up to 400 microseconds for a message of size 4MB. However, when using the QoS-aware file system to isolate the I/O traffic to a nonzero SL (SL1 in this case), the latency is lesser than the SL0 case by 320 microseconds, which is merely 80 microseconds more than the default latency. A similar trend can also be observed with the bandwidth numbers in Figure 4.4. The I/O traffic that is flowing over SL0 degrades the bandwidth by up to 700 MB/s for a 4MB message. However, when this traffic is channeled over SL1 to segregate the I/O and MPI traffic, the bandwidth degradation is reduced by up to 674MB/s. The impact of file I/O noise is no different in case of collective operations. Figure 4.5 shows how the operation latency of a representative MPI Collective (MPI AlltoAll) is affected. The latency goes up by about 250 microseconds for a 1MB message. Using SL1 for I/O brings down this latency to about 15 microseconds more than the default latency. From the above experiments, it is clear that the proposed solution can benefit both latency-sensitive and bandwidth-sensitive MPI applications. The benefits of the QoS-based I/O noise isolation becomes more prominent with increasing message sizes. This is beneficial for most of the communication-intensive MPI applications which intuitively aggregates

46

several small messages to form a large message locally before sending it out to a different MPI rank using the network, rather than paying the overhead costs of sending several smaller messages.

(a) Small message latency

(b) Medium message latency

(c) Large message latency

Figure 4.3: Impact of I/O noise on MPI Pt-to-Pt Latency (Lower is better)

(a) Small message bandwidth

(b) Medium message bandwidth

(c) Large message bandwidth

Figure 4.4: Impact of I/O noise on MPI Pt-to-Pt Bandwidth (Higher is better)

47

(a) Small message latency

(b) Medium message latency

(c) Large message bandwidth

Figure 4.5: Impact of I/O noise on MPI AlltoAll Collective Latency (Lower is better)

4.3.2

Impact of SL Weights

It is also desirable to understand the impact of the credit weights that were assigned to the different SLs, as described in Section 4.2.1. We yet again use the OSU MicroBenchmarks for this evaluation, with the same type of file system noise as before. Figure 4.6 shows the trends for the latency, bandwidth and bi-directional bandwidth of MPI Point-to-Point operations, and Figure 4.7 illustrates the trends for two representative collective operations:

MPI AllReduce and MPI AlltoAll. All the graphs in this experi-

ment illustrate the trends only for large messages, as the smaller and medium message did not have a significant impact as far as performance goes, as seen in the earlier experiments. One thing to note is that the MPI communication traffic is always sent over SL0, which is also the default Service Level. The SL over which the file system traffic was channeled was varied between 1 and 7, and the corresponding behavior of the MPI performance was noticed. Both in the case of point-to-point and collective operations, there is an insignificant difference in performance as the SL is varied for the I/O flow. The reason for this can be attributed to the way the SLs weights were assigned (see Section 4.2.1). As long as SL0 has

48

a data packet to send, SL1-7 which has an entry only in the low-priority list, will not get a chance to send data unless the qos ca high limit is reached. Based on weights set in the opensm.conf file (shown in Figure 4.2), it is evident that the SLs 1-7 are serviced only in a best-effort manner.

(a) Large message latency

(b) Large message bandwidth

(c) Large message bi-directional bandwidth

Figure 4.6: Impact of Service Level Credit Weights on Pt-to-Pt operations (in the presence of I/O noise)

(a) AllReduce large message latency (b) AlltoAll large message latency

Figure 4.7: Impact of Service Level Credit Weights on Collective operations (in the presence of I/O noise)

49

1.20 Normalized Communication Time

1.40

Normalized Runtime

1.00 0.80 0.60 0.40 0.20 0.00

default

I/O noise (SL0)

1.20 1.00 0.80 0.60 0.40 0.20 0.00

I/O noise isolated

default

I/O noise (SL0)

I/O noise isolated

(a) Anelastic Wave Propagation application (b) NAS Parallel Benchmark Conjugate Gradient (CG.D.64)

Figure 4.8: Impact of I/O Noise on End-Applications

4.3.3

Impact on Applications

To study the impact of our designs on real applications, we evaluated it with the Anelastic Wave Propagation (AWP-ODC) MPI application which simulates dynamic rupture and wave propagation during an earthquake in 3D. This application has been scaled to more than hundred thousand processing cores. This application is both computation and communication intensive. In between computations, velocity values are exchanged with processes containing neighboring data sub-grids in all directions of the 3D process grid - north, south, east, west, up and down. For our runs, we set the input data-grid values NX, NY, NZ to 256. The application was run on 64 cores of cluster A, with the 3D process grid organized as a 4x4x4 matrix. TMAX value was set to 20, making the application run with 2,001 iterations. Figure 4.8(a) shows the normalized runtime of this application in the default case, with I/O noise on the default SL0, and with the I/O noise isolated to SL1 using the QoS-aware file system. The file system noise was generated in the same manner as in the previous experiments. Each of these runtimes have been normalized against the default runtime. The runtime with I/O traffic going over the same SL as the application increased the runtime 50

by 17.9% owing to the contention caused at the network HCA and links. Isolating the file system traffic however reduces this overhead and brings down the runtime to just 8% more than the default runtime. A similar trend can also be observed with other applications like the CG kernel from the NAS benchmark suite, which is a Conjugate Gradient method to solve unstructured sparse linear systems. The kernel runs on a 2D grid of power-of-two number of processes. There are two transpose processor communications involved that updates a vector element across iterations. Figure 4.8(b) shows the normalized network communication time of the CG kernel running with 64 MPI processes across 8-client nodes, with I/O noise on SL0 and with the I/O noise isolated to SL1. The time spent in communication is 32.77% more than the default time spent in the former case, but just 9.31% more in the latter.

4.4

Related Work

Providing QoS from a file system’s perspective might mean several things. Several researchers have looked at techniques to allow applications use portions of the disk bandwidth exclusively [35, 45, 77]. This involves providing QoS from a disk-scheduling perspective. Wu and Brandt [137] add QoS support to the Ceph [136] Object-Based file system. This work however, focuses on the disk bandwidth and not the network performance. Likewise, Zhang et al. [140] have proposed a QoS mechanism for the PVFS2 file system, which is based on machine-learning techniques that can translate program execution time goals into I/O throughput bounds. Xu et al. [138] propose a virtualization based bandwidth management framework for parallel file systems where they employ proxy servers which provide differential services to applications based on a predefined resource sharing algorithm. Our

51

work is orthogonal to the above discussed projects that solely focus on providing differentiated services for the end-users based on the storage subsystem throughput. These solutions are not portable, as discussed in Section 4.1, owing to their implementation-specific designs. Many researchers have also investigated the use of InfiniBand QoS capabilities. Prior work from our research group [131] proposed the simultaneous use of multiple virtual lanes that are provided by IB to avoid Head-of-Line congestion. This work was the first of its kind, which made use of multiple virtual lanes at the MPI level. Alfaro [31] proposed an optimal configuration for the VL arbitration table to optimize performance. Alfaro et al. [30] also proposed a formal model to manage the arbitration tables in InfiniBand. The same group worked on a framework to provide QoS over advanced switching systems [87]. All these solutions focus on the theoretical aspects of InfiniBand QoS specification, and can work in conjunction with the work presented in this dissertation.

4.5

Summary

In this work, we have developed a data-staging framework file system that takes advantage of the QoS features of InfiniBand network fabric to reduce the contention in the network by isolating the I/O data flow from the MPI communication flow. This is a portable solution that can work with any MPI library and any backend parallel file system in a pluggable manner. We have also studied the impact of our solution with representative micro-benchmarks and real applications. Experimental results show that with the proposed solution, the point-to-point latency of MPI applications in the presence of I/O traffic can be reduced by up to 320 microseconds for 4MB message size, and the bandwidth for which can be increased by up to 674MB/s. Collective operations such as MPI AlltoAll could also

52

benefit from this work, with its operation latency reducing by about 235 microseconds in the presence of file system noise. The AWP-ODC MPI application’s runtime in the presence of I/O traffic was reduced by about 9.89%. The time spent in communication by the CG kernel with I/O traffic was reduced by 23.46% As part of the future work, we plan on carrying out experiments at a larger scale. We also plan on studying the impact of this solution on workloads from multiple I/O libraries like MPI-I/O, NetCDF, HDF5, etc. These I/O libraries are predominantly used by several MPI applications to write checkpoint data and working datasets. Furthermore, we would like to evaluate the impact of such a best-effort service given to the I/O traffic on realworld file system workloads. Such a study would help in determining a balanced threshold which would reduce the contention for both MPI workloads and I/O workloads without compromising on the performance of one over the other.

53

Chapter 5: CRUISE: Efficient In-Memory Checkpoint Data Management

Multilevel checkpointing systems are a recent optimization to address the checkpointing bottlenecks and reduce I/O times significantly [93, 38]. They utilize node-local storage for low-overhead, frequent checkpointing, and only write a select few checkpoints to the parallel file system. Node-local storage is appealing because it scales with the size of the application; as more compute nodes are used, more storage is available. Unfortunately, node-local storage is a scarce resource. While a handful of HPC systems have storage devices such as SSDs on all compute nodes, most systems only have main memory, and some of those do not provide any file system interface to this memory, e.g., RAM disk. Additionally, to use an in-memory file system, an application must dedicate sufficient memory to store checkpoints, which may not always be feasible or desirable. This thesis addresses these problems with a new in-memory file system called CRUISE: Checkpoint Restart in User SpacE. CRUISE is optimized for use with multilevel checkpointing libraries to provide low-overhead, scalable file storage on systems that provide some form of memory that persists beyond the life of a process, such as System V IPC shared memory. CRUISE supports a minimal set of POSIX semantics such that its use is transparent when checkpointing HPC applications. An application specifies a bound on memory usage, and if its checkpoint files are too large to fit within this limit, CRUISE stores 54

what it can in memory and then spills-over the remaining bytes in slower but larger storage, such as an SSD or the parallel file system. Finally, CRUISE supports Remote Direct Memory Access (RDMA) semantics that allow a remote server process to directly read files from a compute node’s memory. The following chapter describes these various solutions that are proposed.

5.1

Design Alternatives

Logically, CRUISE requires two layers of software: the first layer intercepts POSIX calls made by the application or checkpoint library, and the second layer interacts with the storage media to manage file data. We considered several design alternatives for each layer that differ in imposed overheads, performance, portability, and capability to support our design goals.

5.1.1

Intercepting Application I/O

With CRUISE, our objective is to transparently intercept existing application I/O routines such as read(), write(), fread(), and fwrite(), and metadata operations such as open(), close(), and lseek(). We considered two options for implementing the interception layer: FUSE and I/O wrappers. 5.1.1.1

FUSE-based File System

A natural choice for intercepting application I/O in user-space is to use the Filesystem in User Space (FUSE) module [7]. A file system implementation that uses FUSE can act as an intermediary between the application and the actual underlying file system, e.g., a parallel file system.

55

The FUSE module is available with all mainstream Linux kernels starting from version 2.4.x. The kernel module works with a user-space library to provide an intuitive interface for implementing a file system with minimal effort and coding. Given that a FUSE file system can be mounted just as any other, it is straight-forward to intercept application I/O operations transparently. However, a significant drawback is that FUSE is not available on all HPC systems. Some HPC systems do not run Linux, and some do not load the necessary kernel module. Another problem is relatively poor performance for checkpointing workloads. First, because I/O data traverses between user-space and kernel-space multiple times, FUSE can introduce a significant amount of overhead on top of any overhead added by the file system implementation. Second, the use of FUSE implies a large number of small I/O requests for writing checkpoints. By default, FUSE limits writes to 4 KB units. Although the unit size can be optionally increased to 128 KB, that is relatively small for checkpoint workloads that can have file sizes on the order of hundreds of megabytes per process. When FUSE is used in such workloads, many I/O requests are generated at the Virtual File System (VFS) layer leading to several context switches between the application and the kernel. We quantified the overhead incurred by FUSE using a dummy file system that simply intercepts I/O operations from an application and passes the data to the underlying file system, a kernel-provided RAM disk in this experiment. Direct I/O was used to isolate the effects of the VFS cache. For these runs, we measured the write() throughput of a single process that wrote a 50 MB file to both native RAM disk, and to the dummy FUSE mounted atop the RAM disk. We found that the bandwidth achieved by FUSE was 80 MB/s, while the bandwidth of RAM disk was 1,610 MB/s. Due to the large overheads of using FUSE,

56

the FUSE file system only gets approximately 5% of the performance of writing to RAM disk directly. 5.1.1.2

Linker-Assisted I/O Call Wrappers

The other alternative we considered for intercepting application I/O was to use a set of wrapper functions around the native POSIX I/O operations. The GNU Linker (ld) supports intercepting standard I/O library calls with user-space wrappers. This can be done statically during link-time, or dynamically at run time using LD PRELOAD. This method works without significant overhead because all control remains completely in user-space without data movement to and from the kernel. The difficulty is that a significant amount of work is involved to write wrappers for all of the POSIX I/O routines that an application might use. Two goals for CRUISE are portability and low overhead for checkpoint workloads, so in spite of the additional work required to write linker-assisted wrapper functions, we opted for this method due to its better performance and portability.

Location NFS HDD Parallel FS SSD RAM disk Memory

Throughput (MB/s) 84.50 97.43 764.18 1026.39 8555.26 15097.85

Table 5.1: I/O throughput for the storage hierarchy on the OSU-RI system described in Section 5.5.1

57

5.1.2

In-Memory File Storage

Table 5.1 illustrates the I/O throughput of different levels in the storage hierarchy. We show the performance for several stable storage options: the Network File System (NFS), spinning magnetic hard-disk (HDD), parallel file system, and solid-state disk (SSD). We also show the performance of two memory storage options, RAM disk and shared memory via a memory-to-memory copy operation (Memory). Of course, the memory-based storage options far out-perform stable storage. A key design goal of CRUISE is to store application checkpoint files in memory to improve performance and, more importantly, to serve as a local file system on HPC systems that provide no other form of local storage. Here, we discuss three options that we considered for in-memory storage, RAM disk, a RAM diskbacked memory map, and a persistent memory segment. 5.1.2.1

Kernel-Provided RAM disk

RAM disk is a kernel-provided virtual file system backed by the volatile physical memory on a node. RAM disk can be mounted like any other file system, and the data stored in it persists for the lifetime of the mount. The kernel manages the memory allocated to RAM disk, enabling persistence beyond the lifetime of user-space processes but not across node reboots or crashes. RAM disk also provides standard file system interfaces and is fully POSIX-compliant, making it a natural choice for in-memory data storage. However, by comparing the RAM disk to the memory copy performance in Table 5.1, it is evident that RAM disk does not fully utilize the throughput offered by the physical memory subsystem. Another drawback with RAM disk is that one can not directly access file contents with RDMA.

58

5.1.2.2

A RAM disk-Backed Memory-Map

The drawbacks regarding performance and RDMA capability could be addressed by memory mapping a file residing in RAM disk. This approach could fully utilize the bandwidth offered by the physical memory subsystem simply by copying checkpoint data from application buffers to the memory-mapped region using memcpy(). Once the checkpoint is written to the memory-map, it can be synchronized with the backing RAM disk file using msync(). Then one can simply read the normal RAM disk file during recovery. However, given that the file backing the memory-map resides in the memory reserved for RAM disk, the checkpoint data occupies twice the amount of space. Moreover, there are difficulties involved with tracking consistency between the memory-mapped region and the backing RAM disk file. 5.1.2.3

Byte-Addressable Persistent Memory Segment

The third approach we considered was to directly store the checkpoint data in physical memory. Our target systems all provide a mechanism to acquire a fixed-size segment of byte-addressable memory which can persist beyond the lifetime of the process that creates it. This includes systems such as the recent IBM Blue Gene/Q that provides so-called persistent memory, and all Linux clusters that provide System V IPC shared memory segments. The downside of this method is that it requires implementation of memory allocation and management, data placement, garbage collection, and other such file system activities. In short, the difficulty lies in implementing the numerous functions and semantics of a POSIX-like file system.

59

The advantages are the fine-grained management of the data and access to the entire bandwidth of the memory device. Additionally, we expect this approach to work with future byte-addressable Non-Volatile Memory (NVM) or Storage Class Memory (SCM) architectures. Although the use of a byte-addressable memory segment requires significant implementation effort to perform the activities of a file system, we chose this method for CRUISE for its portability and performance. 5.1.2.4

Limitations of the Kernel Buffer Cache

One could argue that the buffer cache maintained in the kernel is a viable alternative that satisfies most of the design goals for CRUISE. The benefits of using the buffer cache include fast writes, asynchronous flush of data to a local or remote file system, and dynamic management of application and file system memory. However, the potential pitfalls of using the buffer cache in a multilevel checkpointing system outweigh these benefits. One, with multilevel checkpointing, there are situations wherein a cached checkpoint need not be persisted to stable storage. The kernel, however, cannot make this distinction and may unnecessarily flush all data in the buffer cache to the underlying storage system. Two, using the buffer cache involves copies between user and kernel space, reducing write throughput. Three, using the buffer cache does not permit direct access to data for the RDMA capability, which is desirable for asynchronous checkpoint staging. And four, we lose control over when data is moved from the compute node to the remote file system. With an in-memory file system like CRUISE, we can orchestrate data movement such that it does not impact the performance of large-scale HPC applications with file system noise. CRUISE is an initial proof-of-concept system intended to work with byte-addressable NVM architectures that cannot be serviced by the buffer cache. 60

5.2

Architecture and Design

In this section, we present our design of CRUISE. We begin with a high-level overview. We follow with details on simplifications we made to support checkpoint files, and our approaches for lock management, spill-over, and RDMA support.

5.2.1

The Role of CRUISE

Figure 5.1: Architecture of CRUISE

In Figure 5.1, we show a high-level view of the interactions between components in SCR and CRUISE. On the left, we show the current state-of-the-art with SCR, and on the 61

right, we show SCR with CRUISE. In both cases, all compute nodes can access a parallel file system. Additionally, each compute node has some type of node-local storage media such as a spinning disk, a flash memory device, or a RAM disk. In the SCR-only case, the MPI application writes its checkpoints directly to node-local storage, and it invokes the SCR library to apply cross-node redundancy schemes to tolerate lost checkpoints due to node failures. For the highest level of resiliency, SCR writes a selected subset of the checkpoints to the parallel file system. By using SCR, the application incurs a lower overhead for checkpointing but maintains high resiliency. However, SCR cannot be employed on clusters with insufficient node-local storage. In the SCR-CRUISE case, checkpoints are directed to CRUISE. All application I/O operations are intercepted by the CRUISE library. File names prefixed with a special mount name are processed by CRUISE, while operations for other file names are passed to the standard POSIX routines. CRUISE manages file data in a pre-allocated persistent memory region. Upon exhausting this resource, CRUISE transparently spills remaining file data to node-local storage or the parallel file system. This configuration enables applications to use SCR on systems where there is only memory or where node-local storage is otherwise limited. As an additional optimization, CRUISE can expose the file contents stored in memory to remote direct memory access. When SCR determines that a checkpoint set should be written to the parallel file system, an asynchronous file-transfer agent running on a dedicated I/O node can extract this data via RDMA using an CRUISE API that lists the memory addresses of the blocks of the files.

62

Figure 5.2: Data Layout of CRUISE on the Persistent Memory Block

5.2.2

Data Structures

The CRUISE file system is maintained in a large block of persistent memory. The size of this block can be specified at compile time or run time. So long as the node does not crash, this memory persists beyond the life of the process that creates it so that a subsequent process may access the checkpoints after the original process has failed. When a subsequent process mounts CRUISE, the base virtual address of the block may be different. Thus, internally all data structures are referenced using byte offsets from the start of the block. The memory block does not persist data through node failure or reboot. In those cases, a new persistent memory block is allocated, and SCR restores any lost files by way of its redundancy schemes. Figure 5.2 illustrates the format of the memory block. The block is divided into two main regions: a metadata region that tracks what files are stored in the file system, and the data region that contains the actual file contents. The data region is further divided into fixed-size blocks, called data-chunks. Although not drawn to scale in Figure 5.2, the 63

memory consumed by the metadata region only accounts for a small fraction of the total size of the block. We assume that a CRUISE file system only contains a few checkpoints at a time, which simplifies the design of the required data structures. As discussed in Section 2.8, SCR deletes older node-local checkpoints once a new checkpoint has been written, freeing up space for newer checkpoints to be stored. Thus, we are safe to assume a small number of files exist at any time. Because CRUISE handles a limited number of files for each process, we design our metadata structures to use small, fixed-size arrays. Each file is then assigned an internal FileID value, which is used as an index into these arrays. CRUISE manages the allocation and deallocation of FileIDs using the free fid stack. When a new file is created, CRUISE pops the next available FileID from the stack. When a file is deleted, its associated FileID is pushed back onto the stack. For each file, we record the file name in the File List array, and we record the file size and the list of data-chunks associated with the file in an array of File Metadata structures. The FileID is the index for both arrays. CRUISE adds the name of a newly created file to the File List in its appropriate position, and sets a flag to indicate that this position is in use. For metadata operations that only provide the file name, such as open(), rename(), and unlink(), CRUISE scans the File List for a matching name to discover the FileID, which can then be used to index into the array of File Metadata structures. For calls which return a POSIX file descriptor, like open(), we associate a mapping from the file descriptor to the FileID so that subsequent calls involving the file descriptor can index directly to the associated element in the File List and File Metadata structure arrays.

64

The File Metadata structure is logically similar to an inode in traditional POSIX file systems, but it does not keep all of the metadata kept in inodes. The File Metadata structure simply holds information pertaining to the size of the file, the number of data-chunks allocated to the file, and the list of data-chunks that constitute the file. Finally, the free chunk stack manages the allocation and deallocation of datachunks. The size and number of data-chunks are fixed when the file system is created. Each data-chunk is assigned a ChunkID value. The free chunk stack tracks ChunkIDs that are available to be assigned to a file. When a file requires a new data-chunk, CRUISE pops a value from the stack and records the ChunkID in the File Metadata structure. When a chunk is freed, e.g., after an unlink() operation, CRUISE pushes the corresponding ChunkID back on the stack.

5.2.3

Spill Over Capability

Some HPC applications use most of the memory available on each compute node, and some also save a significant fraction of that memory during a checkpoint. In such cases, the memory block allocated to CRUISE may be too small to store the checkpoints from the processes running on the node. For this reason, we designed CRUISE to transparently spill over to secondary storage, such as a local SSD or the parallel file system. During initialization, a fixed-amount of space on the spill-over device is reserved in the form of a file. As with the memory block, the user specifies the location and size of this file. The file is logically fragmented into a pool of data-chunks, and the allocation of these chunks is managed by the free spillover stack, which is kept in the persistent memory block. For each chunk allocated to a file, the File Metadata structure also records a field to indicate whether the chunk is in the memory or the spill-over device. When

65

allocating a new chunk for a file, CRUISE allocates a chunk from the spill-over storage only when there are no remaining free chunks in memory.

5.2.4

Remote Direct Memory Access

RDMA allows a process on a remote node to access the memory of another node, without involving a process on the target node. The main advantage of RDMA is the zero-copy communication capability provided by high-performance interconnects such as InfiniBand. This allows the transfer of data directly to and from a remote process’ memory, bypassing kernel buffers. This minimizes the overheads caused by context switching and CPU involvement. Several researchers have studied the benefits of RDMA-based asynchronous data movement mechanisms [119, 27, 117]. An I/O server process can pull checkpoint data from a compute node’s memory without requiring involvement from the application processes, and then write the data to slower storage in the background. This reduces the time for which an application is blocked while writing data to stable storage. A vast majority of the asynchronous RDMA-based data movement libraries have two sets of components: one or more local RDMA agents that reside on each compute node, and smaller pool of remote RDMA agents hosted on storage nodes or dedicated data-staging nodes. Typically, each data-staging RDMA agent provides data movement services for a small group of compute nodes rather than serving all of them, making this a scalable solution. On receiving a request to move a particular file to the parallel file system, the compute-node RDMA agent reads a portion of the file from disk to its memory space,

66

prepares it for RDMA, and then signals the RDMA agent on the data-staging node. However, the additional memory copy to read the file data into memory for RDMA incurs a significant overhead. Given that the data managed in CRUISE is already in memory, this additional memory copy operation can be avoided by issuing in-place RDMA operations. To achieve this, we expose an interface for discovering the memory locations of files for efficient RDMA access in CRUISE. The local agent can then communicate the memory locations to the remote agent. This method eliminates the additional memory copies and enables the remote agent to access the files without further interaction with the local agent.

Figure 5.3: Protocol to RDMA files out of CRUISE

Figure 5.3 illustrates the protocol for the interface, which works by the following description: (1) On initialization, the local and remote RDMA agents establish a connection 67

for RDMA transfers. (2) The local RDMA agent uses the function get data region() exposed by CRUISE to get the starting address of the memory region in which CRUISE stores its data chunks, and the size of this memory region. The local RDMA agent then registers the memory region for RDMA operations. (3) Following this, the local RDMA agent sleeps until it receives a request from SCR to flush a checkpoint file to the parallel file system. (4) On receiving a request from SCR, the local agent invokes get chunk meta list() exposed by CRUISE, which returns a list of metadata information about each data chunk in the file. This includes the logical ChunkID, the memory address of the chunk if it is in memory, the offset of the chunk if it is in a spill-over file, and a flag to indicate if the chunk is located inside the memory region or the spill-over file. If a chunk has been spilled-over to an SSD, the local agent issues a read() to copy that particular chunk to its address space before initiating an RDMA transfer. (5) Then, the local agent sends a control message to the remote agent with the information about the memory addresses to transfer. (6) The remote process reads the data chunks directly from the data region managed by CRUISE, without involving the local RDMA agent or the application processes. (7) After the data has been read from the list of addresses, the remote agent sends a control message to the local agent informing it that it is safe for these buffers to be replaced for subsequent transfers. (8) The remote agent writes the data it receives into the parallel file system. Note that it is the duty of the remote agent to pipeline the loop of steps (5)-(8) to make optimum use of the network bandwidth and to overlap the communication and I/O phases. (9) When the file transfer is complete, the local agent informs SCR to complete the transfer protocol.

68

5.2.5

Simplifications

We made simplifications over POSIX semantics in CRUISE for directories, permissions, and time stamps. CRUISE does not support directories. However, CRUISE maintains the illusion of a directory structure by using the entire path as the file name. This support is sufficient for SCR and simplifies the implementation of the file system. When files are transferred from CRUISE to the parallel file system, the directory structure can be recreated since the full paths are stored. CRUISE does not support file permissions. Since compute nodes on HPC systems are not shared by multiple users at the same time, there is no need for administering file permissions or access rights. All files stored within CRUISE can only be accessed by the user who initiated the parallel application. SCR restores normal file permissions when files are transferred from CRUISE to the parallel file system. CRUISE does not track time stamps. SCR manages information about which checkpoints are most recent and which can be deleted to make room for new checkpoint files, so time stamps are not required. Typically, versioning mechanisms tend to be a mere sequential numbering of checkpoints, in the order in which they were saved. Updating time stamps on file creation, modification, or access incurs unnecessary overhead, so we remove this feature from CRUISE.

5.2.6

Lock Management

For some flexibility between performance and portability, the persistent memory block may either be shared by all processes running on a compute node, or there may be a private block for each process. The patterns of checkpoint I/O supported by SCR do not require 69

shared-file access between MPI processes; in fact, SCR prohibits it. Given this, we can assume that no two processes will access the same data-chunk, nor will they update the same File Metadata structure. However when using a single shared block, multiple processes interact with the stacks that manage the free FileIDs and data-chunks. When operating in this mode, the push and pop operations must be guarded by exclusive locks. Since stack operations are on the critical path, we need a light-weight locking mechanism. We considered two potential mechanisms for locking common data structures. One option is to use System V IPC semaphores and the other is to use Pthread spin-locks. Semaphores provide a locking scheme with a high-degree of fairness, and processes sleep while waiting to acquire the lock, freeing up compute resources. However, the locking and unlocking routines are heavy-weight in terms of the latency incurred. Spin-locks, on the other hand, provide a low-latency locking solution, but they may lack fairness and can lead to wasteful busy-waiting. When using SCR, all processes in the parallel job synchronize for the checkpoint operation to complete before starting additional computation. This synchronization ensures some degree of fairness between processes across checkpoints. Furthermore, in the case of HPC applications, busy-waiting on a lock does not reduce performance since users do not oversubscribe the compute resources. Thus, we elected to use spin-locks in CRUISE to protect the stack operations.

5.3

Implementation of CRUISE

Here, we illustrate the implementation of the CRUISE file system by detailing initialization and two representative operations: the open() metadata operation and the write() data operation.

70

1: open(const char *path, int flags, ...) 2: if path matches CRUISE mount prefix then 3: lookup corresponding FileID 4: if path not in File List then 5: pop new FileID from free fid stack 6: if out of FileIDs then 7: return EMFILE 8: end if 9: insert path in File List at FileID 10: initialize File Metadata for FileID 11: end if 12: return FileID + RLIMIT NOFILE 13: else 14: return real open(path, flags, ...) 15: end if

Figure 5.4: Pseudo-code for open() function wrapper

5.3.1

Initializing the File System

To initialize CRUISE, a process must mount CRUISE with a particular prefix by calling a user-space API routine. At mount time, CRUISE creates and attaches to the persistent memory block. It initializes pointers to the different data structures within this block, and it clears any locks which may have been held by previous processes. If the block was newly created, it initializes the various resource stacks. Once CRUISE has been mounted at some prefix, e.g., /tmp/ckpt, it intercepts all I/O operations for files at that prefix. For all other files, it forwards the call to the original I/O routine. Figure 5.4 lists pseudo-code for the open() function. When CRUISE intercepts any file system call, it first checks to see if the operation should be served by CRUISE or if it should be passed to the underlying file system. In open(), CRUISE compares the path

71

argument to the prefix at which it was mounted. CRUISE intercepts the call if the file prefix matches the mount point; otherwise it invokes the real open(). When CRUISE intercepts open(), it scans the File List to lookup the FileID for a file name matching the path argument. If it is not found, CRUISE allocates a new FileID from the free fid stack, adds the file to the File List, and initializes its corresponding File Metadata structure. As a file descriptor, CRUISE returns the internal FileID plus a constant RLIMIT NOFILE. RLIMITs are system specific limits imposed on different types of resources, including the maximum number of open file descriptors for a process. The CRUISE variable RLIMIT NOFILE specifies a value one greater than the maximum file descriptor the system would ever return. CRUISE differentiates its own file descriptors from system file descriptors by comparing them to this value.

5.3.2

write() Operation

Figure 5.5 shows the pseudo-code for the write() function. CRUISE first compares the value of fd to RLIMIT NOFILE to determine whether fd is a CRUISE or system file descriptor. If it is a CRUISE file descriptor, CRUISE converts fd to a FileID by subtracting RLIMIT NOFILE. Using the FileID, CRUISE looks up the corresponding File Metadata structure to obtain the current file size and list of data-chunks allocated to the file. From the current file pointer position and the length of the write operation, CRUISE determines whether additional data-chunks must be allocated. If necessary, it acquires new datachunks from free chunk stack. If the persistent memory block is out of data-chunks, CRUISE allocates chunks from the secondary spill-over pool. It appends the ChunkIDs to the list of chunks in the File Metadata structure, and then it copies the contents of buf to the data-chunks. CRUISE also updates any relevant metadata such as the file size.

72

1: write(int fd, const void *buf, size t count) 2: if fd more than RLIMIT NOFILE then 3: FileID = fd - RLIMIT NOFILE 4: get File Metadata for FileID 5: compute number of additional data-chunks 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: else 17: 18: end if

required to accommodate the write if additional data-chunks needed then pop data-chunks from free chunk stack if out of memory data-chunks then pop data-chunks from the free spillover stack end if

store new ChunkIDs in File Metadata end if

copy data to chunks update file size in File Metadata return number bytes written return

real write(fd, buf, count)

Figure 5.5: Pseudo-code for write() function wrapper

73

5.4

Failure Model with SCR

CRUISE is designed with the semantics of multilevel checkpointing systems in mind. The core principle of multilevel checkpointing is to use light-weight checkpoints, such as those written to CRUISE, to handle the most common failures. Less frequent but more severe failures restart the application from a checkpoint on the parallel file system. In this section, we detail the integration of CRUISE with SCR. SCR supports HPC applications that use the Message Passing Interface (MPI). SCR directs the application to write its files to CRUISE, and after the application completes its checkpoint, SCR applies a redundancy scheme that protects the data against common failure modes. The redundancy data and SCR metadata are stored in additional files written to CRUISE. On any process failure, SCR relies on the MPI runtime to detect the failure and kill all remaining processes in the parallel job. Note that processes can fail or be killed at any point during their execution, so they may be interrupted while writing a file, and they may hold locks internal to CRUISE. If a failure terminates a job, SCR logic in the batch script restarts the job using spare nodes to fill in for any failed nodes. During the initialization of the SCR library by the new job, each process first mounts CRUISE and then invokes a global barrier. During the mount call, CRUISE clears all locks. The subsequent barrier ensures that locks are not allocated again until all processes return from the mount call. After the barrier, each process attempts to read an SCR metadata file from CRUISE. SCR tracks the list of checkpoint files stored in CRUISE, and it records which files are complete. It deletes any incomplete files, and it attempts to rebuild any missing files by way of its redundancy encoding. If SCR fails to rebuild a checkpoint, it restores the job using a checkpoint from the parallel file system.

74

Note that because CRUISE stores data in persistent memory, like System V shared memory, data is not lost due to simple process failure. All processes in the first job can be killed, and processes in the next job can re-attach to the memory and read the data. However, data is lost if the node is killed or rebooted. In this case, CRUISE creates a new, empty block of persistent memory, and SCR is responsible for restoring missing files using its redundancy schemes. CRUISE also relies on external mechanisms to ensure data integrity. CRUISE relies on ECC hardware to protect file data chunks stored in memory, and it relies on the integrity provided by the underlying file system for data chunks stored in spill over devices. For this latter case, we only need to ensure that CRUISE synchronizes data to the spill over device when the application issues a sync() call or closes a file.

5.5

Experimental Evaluation

Here we detail our experimental evaluation of CRUISE. We performed both single- and multi-node experiments to investigate the throughput and scalability of the file system.

5.5.1

Experimentation Environment

We used several HPC systems for our evaluation. OSU-RI is a 178-node Linux cluster running RHEL 6 at The Ohio State University. Each node has dual Intel Xeon processors with 4 CPUs and 12 GB of memory. OSURI also has 16 dedicated storage nodes, each with 24 GB of memory and a 300GB OCZ VeloDrive PCIe SSD. We used the GCC compilers for our experiments, version 4.6.3. Sierra and Zin are Linux clusters at Lawrence Livermore National Laboratory that run the TOSS 2.0 operating system, a variant of RHEL 6.2. Both of these are equipped with Intel Xeon processors. On Sierra, each node has dual 6-core processors and 24 GB of 75

memory; and on Zin, each node has dual 8-core processors and 32 GB of memory. Both clusters use the InfiniBand QDR interconnect. The total node counts on the clusters are 1,944 and 2,916 respectively. We used the Intel compiler, version 11.1. Sequoia is an IBM Blue Gene/Q system with 98,304 compute nodes. Each node has 16 compute cores and 16 GB of memory. The compute nodes run IBM’s Compute Node Kernel and are connected with the IBM Blue Gene torus network. We used the native IBM compiler, version 12.1.

5.5.2

Microbenchmark Evaluation

In this section, we give results from several experiments to evaluate the performance of CRUISE. First, we explore the impact of NUMA effects on intra-node scalability. Next, we evaluate the effect of data-chunk sizes on performance. Finally, we evaluate the spill-over capability of CRUISE. All results presented are an average of five iterations. 5.5.2.1

Non-Uniform Memory Access

With the increase in the number of CPU cores and chip density, the distance between system memory banks and processors also increases. If the data required by a core does not reside in its own memory bank, there is a penalty incurred in access latency to fetch data from a remote memory bank. In order to evaluate this cost, we altered CRUISE so that memory pages constituting the data-chunks are allocated in a particular NUMA bank. Table 5.2 lists the outcome of our evaluation on a single node of OSU-RI. OSU-RI nodes have 8 processing cores; 4 cores share a memory bank. The table shows the CRUISE bandwidth obtained by allocating a shared memory block for 4 process running on the first four CPU cores, either on the local bank, on the remote bank, or by interleaving pages across the two banks. The “local bank” case always delivers the 76

# Procs (N) 1 2 3 4

Single Memory Block Local Remote Mixed Bank Bank 3.74 2.63 3.09 6.54 4.51 5.16 7.84 5.28 6.33 8.29 5.70 6.81

N-Memory Blocks Local Remote Mixed Bank Bank 3.74 2.63 3.09 6.58 4.50 5.33 7.84 5.29 6.33 8.28 5.69 6.80

Table 5.2: Impact of Non-Uniform Memory Access on Bandwidth (GB/s)

Figure 5.6: Impact of Chunk Sizes

best bandwidth, the “remote bank” case always performs the worst, and the “interleaved” case strikes a balance between the two. The difference is most exaggerated with 4 processes, for which local bandwidth is 8.3 GB/s compared to only 5.7 GB/s for remote. Thus, CRUISE bandwidth drops by more than 30% if we are not careful to allocate data-chunk memory appropriately. To this end, we determine on which core a process is running when it mounts CRUISE. We use this information to determine from which NUMA bank to allocate data chunks for this process. HPC applications typically pin processes to cores, so processes do not migrate from one NUMA bank to another during the run.

77

5.5.2.2

Impact of Chunk Sizes

One important parameter that affects the performance of CRUISE is the size of the data-chunk used to store file data. The chunk size determines the unit of data with which a write() or read() operation works. To study the impact of chunk sizes, we used the same benchmark from before in which 12 processes each write 64 MB of data to a file in CRUISE on a single node of Sierra. We then vary the chunk size from 4 KB up to 64 MB. In Figure 5.6, the x-axis shows the chunk size and the y-axis indicates the aggregate bandwidth obtained. As the graph indicates, we see performance benefits with larger chunk sizes. These benefits can be attributed to the fact that a file of a given size requires fewer chunks with increasing chunk sizes, which in turn leads to fewer bookkeeping operations and fewer calls to memcpy(). However, the aggregate bandwidth obtained here saturates that of the memory bank at 18.2 GB/s when chunks larger than 16 MB are used. Although this trend might remain the same across different system architectures, the actual thresholds could vary. To facilitate portability, we leave the chunk size as a tunable parameter. In addition to having relatively larger chunks for performance reasons, it is also beneficial when draining checkpoints using RDMA as discussed in Section 5.2.4. One-sided RDMA put and get operations are known to provide higher throughput on high-performance interconnects such as InfiniBand when transferring large data sizes. 5.5.2.3

Spill-over to SSD

With the next set of experiments, we use a system with local SSD to evaluate the data spill-over capability in CRUISE. As discussed in Section 5.2.3, if the file data is too large to fit entirely in memory, CRUISE spills the extra data to secondary storage. In such scenarios, we can theoretically estimate the file system throughput using the following formula:

78

sizetot Tspillover= sizeM EM SSD + size TM EM TSSD Where, Tspillover is the throughput with spill-over enabled; sizetot is the total size of the checkpoint; sizeMEM is the size of the checkpoint stored in memory; sizeSSD is the size stored to the SSD; and TMEM and TSSD are the native throughput of memory and the SSD device. We developed tests to study the performance penalties involved with saving parts of a checkpoint in memory and the rest to an SSD. Table 5.3 lists seven different test scenarios for a 512 MB-per-process checkpoint. Test #1 is the ideal scenario where 100% of the file is stored in memory, and Test #7 is the worst-case scenario where CRUISE must store the entire checkpoint to disk. With Tests #2-6, the size of the file that spills to the SSD increases by a factor of two. All of these tests were run on a single storage node of OSU-RI that has a high-speed SSD installed. We first ran Tests #1 and #7 to measure the native throughput of memory and the SSD on the system, and we substituted these values into the above formula to compute the expected performance of the other cases. We then limited the memory available to CRUISE according to the test case, and conducted the other tests to measure the actual throughput. The theoretical and actual results are tabulated in Table 5.3. The experiment clearly shows that with an increase in the percentage of a checkpoint that has to be spilled to the SSD or any such secondary device, the total throughput of the checkpointing operation reduces. For instance, in case of Test#6, with exactly half the checkpoint spilling to the SSD, the total throughput is reduced by almost 86%. Also, the actual results closely match the theoretical estimates, which validates our basic formula.

79

Test # 1 2 3 4 5 6 7

% in SSD 0 3.125 6.25 12.5 25 50 100

Spill Size (MB) 0 16 32 64 128 256 512

Theoretical Throughput 15074.17 10349.12 7879.33 5333.61 3240.00 1815.06 965.67

Actual Throughput 15074.17 10586.61 8134.46 5312.26 3110.58 2163.93 965.67

16

20 18 16 14 12 10 8 6 4 2 0

memcpy CRUISE (block/proc) CRUISE (single block) ramdisk

14 12 10

GB/s

GB/s

Table 5.3: CRUISE throughput (MB/s) with Spill-over

8

memcpy aligned memcpy unaligned CRUISE ramdisk

6

4 2 0 1

2

3

4

5

6

7

8

9 10 11 12

0

Cores

(a) Sierra node

16

32

Cores

48

64

(b) Sequoia node

Figure 5.7: Intra-Node Aggregate Bandwidth Scalability

5.5.3

Intra-Node Scalability

In Figure 5.7, we show the intra-node scalability of CRUISE compared with RAM disk and a memcpy() operation on a single node of Sierra and Sequoia. The x-axis indicates the number of processes on the node, and the y-axis gives the aggregate bandwidth of the I/O operation in GB/s summed across all processes. Each process is bound to a single

80

CPU-core of the compute node and writes and deletes a file five times, reporting its average bandwidth. On Sierra, the file size was 100 MB; on Sequoia, the file size was 50 MB. The performance of the memory-to-memory copies represents an upper bound on the performance achievable with our in-memory file system. To measure this bound, our benchmark copies data from one user-level buffer to another using standard memcpy() calls (red lines in Figure 5.7). The maximum aggregate bandwidth tops out around 18 GB/s on Sierra and roughly 13 GB/s on Sequoia. One notable trend in the plot for Sierra is the double-saturation curve. Sierra is a dualsocket NUMA machine with 6 cores per NUMA bank. As the process count increases from 1 to 6, all processes are bound to the first socket and the performance of the local NUMA bank begins to saturate. Then, as the process count increases to 7, the seventh process runs on the second socket and uses the other NUMA bank leading to a jump in aggregate performance. Finally, this second NUMA bank begins to saturate as the process count is increased further from 7 to 12. On Sequoia, each node has 16 compute cores, each of which supports 4-way simultaneous multi-threading. Therefore, we can evaluate the aggregate throughput for up to 64 processes on a node. On this system, we found a significant difference in memcpy performance depending on how buffers are aligned. If source and destination buffers are aligned at 64-byte boundaries, a fast memcpy routine is invoked that utilizes Quad Processing eXtension (QPX) instructions. Otherwise, the system falls back to a more general, but slower memcpy implementation. We plot results for both versions. The aligned memory copies (red line) saturate the physical memory bandwidth with a small number of parallel threads. It delivers a peak bandwidth of 13.5 GB/s with 32 processes. The unaligned variant (green line) scales linearly up to 32 processes where it reaches its peak performance of 12 GB/s. 81

We do not see the double-saturation curves as in the case of Sierra, because the compute nodes on Blue Gene/Q systems have a crossbar switch that connects all cores to all of memory, so there are no NUMA effects. However, there are some interesting points where trends change significantly. The Blue Gene/Q architecture configures hardware as though the total number of tasks is rounded up to the next power of two in certain cases. These switch points apparently impact the memory bandwidth available to the tasks, particularly when going from 16 to 17 processes per node and again from 32 to 33. Beyond 32 processes per node, memory bandwidth initially drops but increases to another saturation point with about 45 processes. For process counts from 45 to 64, memory bandwidth steadily decreases again. We are still investigating the reason why memory bandwidth is affected this way. Having said that, applications are unlikely to run with process counts other than powers of two on a node. We now examine the RAM disk performance (blue lines). With each iteration, each process in our benchmark writes and deletes a file in RAM disk. On Sierra, the aggregate bandwidth for RAM disk is nearly half of that for memcpy. On Sequoia, the performance is even worse. The memory copy performance increases with increasing cores, but the RAM disk performance is flat at ∼ 0.6 GB/s. On Sierra, we evaluated the performance of CRUISE with a private block per process (purple, filled triangle) and with all processes on the node sharing a single block (purple, hollow triangle). There is a clear difference in performance between these modes. When using private blocks, the performance of CRUISE is close to that of memcpy, achieving nearly the full memory bandwidth. With a single shared block, CRUISE closely tracks the memcpy performance up to 6 processes, but then it falls off that trend with higher process counts. 82

10000

1000

1000

100

TB/s

10000

GB/s

100000

100

memcpy CRUISE ramdisk

10

1

memcpy 64ppn CRUISE 64ppn ramdisk 16ppn

0.1

8192

4096

2048

512

1024

256

128

64

32

16

1

Processes

10

1K

(a) Zin Cluster (Linux)

2K

4K

8K

CRUISE 32ppn CRUISE 16ppn

16K 32K 64K 96K

Nodes

(b) Sequoia Cluster (IBM Blue Gene/Q)

Figure 5.8: Aggregate Bandwidth Scalability of CRUISE

A portion of the difference is due to locking overheads. However, experimental results showed these effects to be small for the 64 MB data-chunk size used in these tests. Instead, the majority of the difference appears to be due to the costs of accessing non-local memory. To resolve this problem, we intend to modify CRUISE to manage a set of free chunks for each NUMA bank and then select chunks from the appropriate bank depending on the location of the process making the request. On Sequoia, we currently do not make an effort to align buffers in CRUISE. CRUISE has control over the alignment of the data-chunks, but it has no control over the offset of the buffers passed by the application. Thus, the performance of CRUISE (purple line) closely follows that of the unaligned memcpy (green line). We could modify CRUISE to fragment data-chunks and use aligned buffers more often. This would boost performance at the cost of using more storage space, but it could be a worthwhile optimization for large writes.

83

5.5.4

Large-Scale Evaluation

CRUISE is designed to be used with large-scale clusters that span thousands of compute nodes. We evaluated the scaling capacity of this framework, and we show the results in Figure 5.8. We conducted these evaluations on Zin and Sequoia. For each of these clusters, we measured the throughput of CRUISE with increasing number of processes. In these experiments, we configured CRUISE to allocate a persistent memory block per process. On Zin, each process writes a 128MB file; on Sequoia, each writes a 50MB file. We compare CRUISE to RAM disk and a memory-to-memory copy of data within a process’ address space using memcpy(). Since CRUISE requires at least one memory copy to move data from the application buffer to its in-memory file storage, the memcpy performance represents an upper-bound on throughput. On Zin (Figure 5.8(a)), the number of processes writing to CRUISE was increased by a factor of two up to 8,192 processes along the x-axis. The y-axis shows the bandwidth(GB/s) in log-scale. As the graphs indicate, a perfect-linear scaling can be observed on this cluster. Furthermore, CRUISE takes complete advantage of the memory system’s bandwidth (the CRUISE plot overlaps the memcpy plot). The throughput of CRUISE at 8,192 processes is 17.6 TB/s, which is only slightly below the memcpy throughput of 17.7 TB/s. The throughput of RAM disk is nearly half that of CRUISE at 9.87 TB/s. These runs used 17.5% of the available compute nodes. Extrapolation of this linear scaling to the full 46,656 processes would lead to a throughput for CRUISE of over 100 TB/s. Figure 5.8(b) shows the scaling trends on Sequoia. Because Sequoia is capable of 4-way simultaneous multi-threading, a total of 6,291,456 parallel tasks can be executed. The xaxis provides the node-count for each data point, and the y-axis shows the bandwidth(TB/s) in log-scale. For clarity, we only show the configurations that deliver the best results for 84

aligned memcpy and RAM Disk. We show the results when using 16, 32, and 64 processes per node for CRUISE. At the full-system scale of 6 million processes (64 processes/node), the aggregate aligned memcpy bandwidth reaches 1.21 PB/s. As observed in Figure 5.7(b), CRUISE nearly saturates this bandwidth to deliver a throughput of 1.16 PB/s when running with 32 processes per node. This is 20x faster than the system RAM disk, which provides a maximum throughput of 58.9 TB/s, and it is 1000x faster than the 1 TB/s parallel file system provided for the system.

5.6

Related Work

Linker support to intercept library calls has been around for a while. Darshan [47] intercepts an HPC application’s calls to the file system using linker support to profile and characterize the application’s I/O behavior. Similarly, fakechroot [6] intercepts chroot() and open() calls to emulate their functionality without privileged access to the system. Other researchers have investigated saving files in memory for performance. The MemFS project from Hewlett Packard [100] dynamically allocates memory to hold files. However, there is no persistence of the files after a process dies and MemFS requires kernel support. McKusick et al. present an in-memory file system [89]. This effort also requires kernel support, and it requires copies from kernel buffers to application buffers which would cause high overhead. MEMFS is a general purpose, distributed file system implemented across compute nodes on HPC systems [123]. Unlike our approach, they do not optimize for the predominant form of I/O on these systems, checkpointing. Another general purpose file system for HPC is based on a concept called containers which reside in memory [80]. While this work does consider optimizations for checkpointing, its focus is on asynchronous movement of

85

data from compute nodes to other storage devices in the storage hierarchy of HPC systems. Our work primarily differs from these in that CRUISE is a file system optimized for fast node-local checkpointing. Several efforts investigated checkpointing to memory in a manner similar to that of SCR [141, 135, 52, 38, 75, 119]. They use redundancy schemes with erasure encoding for higher resilience. These works differ from ours in that they use system-provided inmemory or node-local file systems, such as RAM disk, to store checkpoints. Rebound checkpoints to volatile memory but focuses on single many-core nodes and optimizes for highly-threaded applications [28].

5.7

Summary

In this work, we have developed a new file system called CRUISE to extend the capabilities of multilevel checkpointing libraries used by today’s large scale HPC applications. CRUISE runs in user-space for improved performance and portability. It performs over twenty times faster than kernel-based RAM disk, and it can run on systems where RAM disk is not available. CRUISE stores file data in main memory and its performance scales linearly with the number of processors used by the application. To date, we have benchmarked its performance at 1 PB/s, at a scale of 96K nodes with three million MPI processes writing to it. CRUISE implements a spill-over capability that stores data in secondary storage, such as a local SSD, to support applications whose checkpoints are too large to fit in memory. CRUISE also allows for Remote Direct Memory Access to file data stored in memory, so that multilevel checkpointing libraries can use processes on remote nodes to copy checkpoint data to slower, more resilient storage in the background of the running application.

86

Chapter 6: MIC-Check: Scalable Checkpointing for Heterogeneous HPC Systems

The Xeon Phi coprocessor is not naturally conducive for checkpointing schemes which are I/O-intensive by design. Existing state-of-art checkpointing techniques that alleviate the I/O burden on applications by leveraging multi-level and asynchronous checkpointing schemes will also be subject to the I/O performance penalties on the MIC. It is also not clear how checkpointing will be supported in the variety of usage modes that Xeon Phi offers. In this context, this thesis proposes MIC-Check, a novel distributed checkpointing framework for parallel MPI applications that leverage the compute capabilities of the MIC architecture. The following chapter describes and evaluates the MIC-Check system.

6.1

I/O Limitations on the Xeon Phi Architecture

With the Xeon Phi processor, there is a remarkable disparity between the compute and I/O performance capabilities. Figure 6.1 clearly illustrates the I/O performance of a Xeon Phi (MIC) PCIe-based coprocessor, by comparing it with that of the Xeon-based system (Host) that is hosting it. Along the horizontal axis is the number of processes that are writing to the Lustre parallel file system on the Stampede supercomputing system (described in Section 6.4.1), and along the vertical axis is the aggregate I/O throughput as seen by these processes. In case of the host, the aggregate throughput steadily scales with 87

Aggregate Throughput (MB/s)

3,500 3,000 Host Xeon Phi (MIC)

2,500 2,000 1,500 1,000 500 0

1

2 4 Num. processes writing data

8

Figure 6.1: Disparity between the Parallel File System throughput as seen by processes on the host and MIC

the number of processes, achieving up to 3.4GB/s with 8 processes, whilst the processes on the coprocessor reach a peak of 893MB/s with 4 processes, after which contention amongst them begins to hurt the aggregate throughput which drops to 41MB/s with 8 processes. This poor I/O performance trend spurs from inherent characteristics of the coprocessor itself, and its interaction with external components like the network infrastructure and the storage subsystem. This section succinctly discusses these intrinsic and extrinsic limitations that affect the I/O performance.

6.1.1

Intrinsic Limitations

The Xeon Phi coprocessor runs a embedded version of Linux with its own Virtual File System (VFS) in the kernel that is very similar to a conventional Linux VFS in terms of functionality. However, it has certain pitfalls that directly affect the checkpointing I/O performance. This section highlights three main hot spots [26] that hurt I/O performance.

88

Figure 6.2: VFS architecture on the Xeon Phi

Typically, when the page involved in an I/O request is not cached by the kernel in its page-cache, the VFS assigns a new page for the data to be cached in memory. While this memory is often sourced from a per-CPU pool of pages, the kernel page allocator is invoked to request a free physical page when the per-CPU pool is depleted. This is a CPU-intensive operation that involves claiming appropriate zone and LRU locks, and identifying physical pages that can be used to replenish the per-CPU pool from which the page was originally requested. The low-frequency processing units on the Xeon Phi, along with reduced cache sizes, reduces the performance of these operations which manifests into the I/O request. Copying data to and from user-level buffers is achieved using copy from user and copy to user routines that involve page-fault handling. This suffers from performance

89

MIC-to-local Host 5 /Local Host-to-MIC 2

6

1 3 Host-to-Remote Host 4

Remote (Host/MIC)-to-MIC : Inter-IOH MIC-to-Remote (Host/MIC) : Inter-IOH Remote (Host/MIC)-to-MIC : Intra-IOH MIC-to-Remote (Host/MIC) : Intra-IOH

Figure 6.3: Communication paths available to Xeon Phi systems

overheads, as in the case of VFS page allocation, making cache-misses expensive. Furthermore, the kernel data-copy routines do not leverage the vector-processing capabilities of the architecture. In addition to polluting caches with data not needed by the kernel, this makes large-sized data-movement really inefficient. Furthermore, in order to maintain consistency of the physical pages, each page is associated with a lock bit which indicates the status of its usage. If a lock has been grabbed already, a new thread wanting to read to, or write from, a page gets added to a queue and sent to sleep. When the lock is released, the kernel computes a hash to identify the next thread from the queue that gets to grab the page lock, instead of reading it from memory. However, this becomes expensive on the Xeon Phi, adding to the overheads involved in the VFS page cache management.

90

Path# 1 2 3 4 5 6

Sandy Bridge 370 MB/s 1079 MB/s 5280 MB/s 962 MB/s 6977 MB/s 6296 MB/s

Ivy Bridge 247 MB/s 1179 MB/s 6396 MB/s 3421 MB/s 6977 MB/s 6296 MB/s

Table 6.1: Peak bandwidth of different channels on Xeon Phi systems (Path# indicated in Fig 6.3)

6.1.2

Extrinsic Limitations

Although the Xeon Phi coprocessor is an out-of-chip hardware component that communicates with the host system and other system peripherals using the PCIe bus, application processes that execute on the coprocessor have a variety of communication channels to choose from for data movement, either to communicate with other processes, or to read/write data from/to the parallel file system. Figure 6.3 shows the various communication paths that are available between a coprocessor, a host processor, and a remote communication end-point. Table 6.1 that accompanies the figure lists the peak bandwidth that is observed on these different paths for the Sandy Bridge and Ivy Bridge architectures. It is clear from these observations that the peak data movement bandwidth to/from a coprocessor heavily relies on the communication path chosen. Specifically, it is apparent that moving data from MIC memory (read from MIC) to the NIC memory and vice versa (write to MIC) incurs significant bandwidth limitations. Even with the Ivy Bridge architecture, the read performance is particularly bad, especially when the MIC device and the NIC are not co-located on the same IO hub. The proposed checkpointing framework, built on this premise, is designed to leverage the most optimal 91

path for writing the checkpoint snapshots to the parallel file system with the least amount of overhead to the application.

6.2

Architecture and Design

Compute Node

Compute Node

Host

Host RAM

QPI BLCR

NoC

RAM

PCIe (P2P) Node-local Storage SSD

Parallel File System

Xeon Phi

Xeon Phi RAM

PCIe

PCIe

BLCR Parallel File System

BLCR

RAM

QPI Proxy

PCIe

BLCR

IB

HDD

NoC

Node-local Storage SSD

(a) Baseline

IB

HDD

(b) MIC-Check

Figure 6.4: System-level architecture

Figure 6.4 describes the high-level architecture of the MIC-Check framework. To the left (Figure 6.4(a)) is the architecture of one compute node of a supercomputing cluster that employs the Xeon Phi coprocessors on each compute node. The coprocessor resides on the host as a PCIe device, and uses the PCI channel to access the host memory, nodelocal storage, or any parallel file systems. In the baseline scenario, when the application processes running natively on the MIC being writing a checkpoint, the I/O requests are forwarded to the parallel file system, and all data movement happens over the PCIe-bus. To the right (Figure 6.4(b)) is the MIC-Check architecture on the same system. With the steady increase in the number of processing cores available on host CPUs, the compute capacity available for applications is aplenty. MIC-Check reserves one of these CPU cores exclusively to a servlet, namely the MIC-Check Proxy (MCP), that progresses 92

Checkpointing Mode Application System

Programming Modes for MIC Native Symmetric Offload X X X X X Conditional

Table 6.2: MIC-Check support for various execution and checkpointing modes

checkpointing I/O requests on behalf of the application processes that are residing on the coprocessor. The use of MCP allows the MIC-Check framework to schedule data movement through the most optimal data-movement path. The MCP is exclusively bound to a host processor core on each compute node where the application will be executed. During initialization, all application processes establish a SCIF-based communication channel with the MCP on the node in which it executes. When it is time for an application to take a checkpoint of itself, it sends a request to the MCP, along with metadata about the checkpoint. On receiving this request, the MCP asynchronously pulls the application process’ snapshot out of the coprocessor memory using SCIF one-sided operations, and writes it out to the parallel file system directly. The framework ensures that the onesided transfer and persistent write to the parallel file system are fully-pipelined to extract maximum throughput from the PCIe channel and the InfiniBand channel through which the framework communicates with high-performance parallel file systems such as Lustre. MCP is multi-threaded and progresses the I/O requests from each application process independently. By staging the I/O requests through the host, this framework circumvents the inherent limitations of the Xeon Phi architecture, which were outlined in Section 6.1. While it is easier to port legacy applications written for execution on x64 architectures to run on the Xeon Phi system in the native mode, application scientists and developers

93

are actively exploring the other alternatives to exploit maximum performance from the coprocessors (see Section 2.9). Hence, MIC-Check was designed such that the use of MCP is oblivious to the execution model adopted by an application, and is not just applicable to the native mode. Table 6.2 summarizes MIC-Check’s support for the different MIC execution modes. In case of the symmetric mode, processes residing in the host machine directly write their checkpoints to the file system, whilst those residing in the coprocessor use the SCIF interface and forward checkpointing requests to MCP. Applications that adopt the offload model of execution tend to save the state of execution in checkpoints after the completion of an offloaded kernel segment. In this case, the process that orchestrates the kernel offloading can directly issue checkpointing requests to MCP. However, MIC-Check can not yet handle cases where an application copies data onto the MIC, and uses it across offload kernel segments without copying the data back to the host after the completion of one of them. The MIC-Check design is also stackable, and is capable of working in unison with several checkpoint I/O optimization techniques that have been proposed in literature. In particular, the system is designed to work well with multi-level checkpointing solutions, like ScalableCR [93] and CRUISE [109], which tremendously increases the checkpointing efficiency by caching checkpoints closer to compute nodes while writing them to the parallel file system only as needed. The MCP provides APIs that can be used to expose staging buffers to external tools and libraries that can take control of the data-movement pipeline to either cache checkpoints in Non-Volatile Memory devices, spill-over to a local flash storage device when there is memory pressure, or ship it to burst-buffers that can absorb the checkpoint I/O and reduce the application overheads drastically.

94

In addition to supporting the application-aware checkpointing paradigm, wherein applications capture the state of their execution in snapshots and write them to persistent storage themselves, our design also supports the transport system-level checkpointing paradigm. As described in Section 2.1, transparent checkpointing protocols use external libraries, such as BLCR, to record the state of the parallel processes that constitute the application and write them into checkpoint files. The Intel Manycore Platform Software Stack (MPSS), which is the core system software stack needed to run the Xeon Phi coprocessor, has support for BLCR by default. The MIC-aware MVAPICH2 MPI library [107, 108] was enhanced to be able to take transparent distributed checkpoints of an application that natively runs on a set of MICs. The following section describes the implementation of MIC-Check in detail, and elaborates the various design elements that were described in this section.

6.3

Implementation

In this section, we delve into the details of the MIC-Check implementation. As elucidated in Figure 6.5, the MIC-Check framework comprises of three core components - a proxy servlet that stages I/O data from MICs through the host, an I/O interception library that can either intercept I/O calls from application or can take hints from application using APIs that it exposes, and an enhanced version of the MIC-aware MVAPICH MPI library that provides the transparent system-level checkpointing functionality. For applicationlevel checkpointing, the design makes use of the first two components only.

6.3.1

MIC-Check Proxy (MCP) and I/O Interception Library (MCI)

MCP is essentially a stand-alone Linux process that will be launched by mpirun_rsh [126] on every compute node allocated to an MPI job, prior to launching the application processes. For runs larger than 256 compute nodes, this phase takes advantage of the 95

Host Buffer Pools + I/O Threads

3

Xeon Phi Application Processes 1

MCP

MCI

2

MVAPICH

4 Parallel File System

Figure 6.5: Implementation of MIC-Check

hierarchical tree-based ssh feature of mpirun rsh to reduce start-up overheads. The use of mpirun rsh is not mandatory, so the MCP can also be launched using any other processlauncher such as Hydra, SLURM [139], and so on. Once bootstrapped, MCP constantly listens on a known port for incoming SCIF connections. Step 1. The application links to the I/O interception library, which can intercept all I/O calls in order to take control of I/O requests from the application, either statically during link-time or run-time using the LD_PRELOAD mechanism. MCI wraps all the necessary I/O calls which applications use to interact with checkpoint files, and takes control of the I/O requests to forward it to MCP as needed. During MPI initialization, the MCI instance associated with each MPI processes initiates a SCIF connection with the MCP on its compute node. These connections are persisted for the duration of the job and are terminated

96

only during MPI finalization. If the job is restarting from a failure, the connections are established again before resuming execution. Step 2. When an application is ready to take a checkpoint, MCI intercepts the open call and sends a request to the local MCP. This request also encapsulates metadata of the checkpoint from the application, including the checkpoint path name, size, and more importantly, information about the memory region that was registered with the SCIF-interface, which would give the MCP permissions to read to/write from this region without any involvement from the application processes. Once the application is ready to write the file, MCI intercepts the write call and sends a control message to MCP, letting it know that the checkpoint can be read out of the MIC’s memory. Step 3. For each MPI process that connects to the MCP, the MCP spawns a new thread that will progress the I/O on behalf of the client. Each thread also reserves a region of memory to stage data on the host before being written to the underlying file system, and registers this region with the SCIF interface to be able to perform one-sided Remote Memory Access (RMA) operations. On receiving the checkpoint control message from MCI, the MCP initiates a pipelined-RMA protocol using the scif_readfrom() routine to asynchronously pull data from the MIC processes’ memory, in units of a pipeline chunk-size. MCP initiated SCIF reads take advantage of superior Host DMA performance and relieves the application processes from taking part in the checkpoint. The pipelining is achieved by efficiently overlapping the RMA operations with the writes to the parallel file system, using a multiplexed usage of the staging buffers that is available to each thread that is progressing the I/O.

97

6.3.2

MIC-Check MVAPICH

Applications that do not employ their own checkpointing mechanism to handle failures can leverage the transparent system-level checkpointing and job-migration features [98, 65, 99] offered by the MVAPICH MPI library. Being able to transparently checkpoint an application involves three main stages: a) draining in-flight messages suspending further message-passing activity between MPI processes and releasing network resources; b) obtaining a snapshot of the application’s execution state and writing it to a globally-visible file system; and c) re-establishing communication channels for MPI processes to resume message-passing activity. Stage (a) is critical with transparent checkpointing as stray messages make the system susceptible to the Domino effect, as discussed in Section 2.1.

2 Node 3 1 Host 4

6

MIC

5

1 intra-host 2 intra-MIC 3 intra-node host-MIC 4 host-host 5 MIC-MIC 6 inter-node host-MIC

Figure 6.6: Connections established by MIC-aware MVAPICH MPI library

98

In particular, the MIC-aware MVAPICH library establishes several communication channels between the various MPI processes depending on their locality and the topology of the system, to fully utilize the system resource to make efficient communication progress. Figure 6.6 illustrates these communication channels that are setup inside the MPI library with a representative set of MPI processes that reside in two compute nodes, both of which have been provisioned with MICs. The channels (1, 2, 3) are setup to handle intra-node communication that uses a combination of SCIF and shared-memory interfaces, while channels (4, 5, 6) are setup to handle inter-node communication that uses the highbandwidth InfiniBand network interfaces. The enhanced MIC-Check MVAPICH library ensures that these additional communication channels established in the presence of a MIC are also flushed and suspended prior to a checkpoint, and re-established after it, in order to preserve the channel consistency. The stage (b) is similar to the application-assisted checkpointing case. However, the I/O operations are no longer intercepted by MCI, as the MPI library already has control over the I/O phase. MIC-Check MVAPICH generates the checkpoint images by itself with help from the BLCR kernel

6.4 6.4.1

Results Experimental Setup

In this section, we discuss the various experiments that were conducted to evaluate the performance and scalability of the MIC-Check framework. The experiments were conducted on the Stampede supercomputing system at the Texas Advanced Computing Center in Austin. Each Stampede node is a dual socket containing Intel Sandy Bridge (E5-2680)

99

Aggregate Throughput (MB/s)

1,000 800 600 400 200 0

128K 256K 512K 1M 2M 4M 8M 16M 32M Pipeline chunk size

Figure 6.7: Impact of pipelining chunk sizes on I/O throughput

dual octa-core processors, running at 2.70GHz. It has 32GB of memory, a SE10P (B0KNC) coprocessor and a Mellanox IB FDR MT4099 HCA. The host processors are running CentOS release 6.3 (Final), with kernel version 2.6.32- 279.el6.x86 64. The KNC runs MPSS 2.1.4346-16. The compiler suite used is the Intel Composer xe 2013.2.146. In order to evaluate the enhanced MVAPICH library’s transparent checkpointing capabilities, we used a local test-bed comprised of 2 nodes with a configuration similar to Stampede, which has BLCR 0.8.5 installed on the host machines. The same version is also made available in the MIC by the latest Intel MPSS version 3.1.1.

6.4.2

Impact of the pipeline chunk size

The pipelining achieved by the MCP by overlapping RMA operations with I/O operations gives the aggregate throughput that is seen by application processes a significant boost. This pipelining happens in units of pipeline chunk-size. Since this is a critical parameter that will determine the performance of the MIC-Check framework, it is critical to 100

understand its impact. Figure 6.7 describes the experiment that evaluates various pipelining chunk sizes, ranging from 128 KB to 32 MB. This experiment was conducted on a single node of Stampede that hosted one MCP. An MPI benchmark that measures the aggregate file write-throughput across its processes was used, where each process writes a 256MB file via MCI. As can be seen from the trends in the graph, smaller chunk sizes severely affect performance. Smaller chunks causes multiple small data exchanges to happen over the SCIF interface, which is known to have poor performance with small messages. This also causes multiple I/O requests to be generated from the MCP, which creates contention at the VFS layer, hence hurting the throughput. Likewise, with really large chunk-sizes, the multiple MCP threads begin to saturate the InfiniBand network bandwidth, and start to content for access to the physical channel. This contention creates stalls in the pipeline, which translates into poor performance from the I/O throughput perspective. It is no surprise that the chunk size would be highly sensitive to the various system configurations including the network speed, the parallel file system configuration, and the compute node’s processor speed. Based on our experiments, a pipeline chunk size of 1 MB seems ideal for the Stampede system architecture. Based on this trend, we have used a pipeline chunk size of 1 MB for all experiments that follow.

6.4.3

Intra-node scalability

In the next experiment, we studied the intra-node scaling trends of the MIC-Check framework. The Xeon Phi coprocessor has 61 processing cores that are available for applications to use. However, the degree of parallelism from which applications will benefit will highly depend on their memory requirements, given the limited memory per-core that is available on MIC processors. Figure 6.8 illustrates the performance of MIC-Check, with

101

Aggregate Throughput (MB/s)

1,000 Baseline MIC I/O MIC I/O using MIC−Check

800 600 400 200 0

1

2 4 8 Num. processes writing data

16

Figure 6.8: Intra-node scaling

increasing number of application clients on a single compute node of the Stampede system. The MPI benchmark used in the previous experiment is used here as well. The MPI processes in benchmark, which are run natively on the MIC, each write a 256 MB checkpoint file to the production Lustre parallel file system available on the system. The first set of bars (grey) show the aggregate throughput as observed by the benchmark, when the MPI processes on the MIC directly write the checkpoints to the file system. The second set of bars (red) show the aggregate throughput as observed by the benchmark, when MIC-Check progresses the I/O on behalf of the benchmark. The former case does not scale at all, with an aggregate throughput of 15.14 MB/s with 1 MPI process, and an aggregate throughput of 21.41 MB/s with 16 MPI processes. While the peak throughput of moving data out of a coprocessor’s memory is a mere 962MB/s 6.1, this limit is achieved with just 4 MPI processes writing checkpoints. Any more than 4 MPI processes (for the Stampede configuration) will only hurt the throughput significantly. However, the proposed MIC-Check

102

40,000 Aggregate Throughput (MB/s)

Aggregate Throughput (MB/s)

6,000 5,000 Baseline MIC I/O MIC I/O using MIC−Check

4,000 3,000 2,000 1,000 0

16

32 64 Num. processes writing data

35,000 30,000 25,000 20,000 15,000 10,000 5,000 0

128

(a) Baseline vs. MIC-Check (small-scale)

128

256 512 1,024 2,048 Num. processes writing data

4,096

(b) MIC-Check (large-scale)

Figure 6.9: Inter-Node Aggregate Bandwidth Scalability

design scales tremendously well, with a throughput of 59.73 when the benchmark is run with 1 process, and a throughput of 731.48 when the benchmark is run with 16 processes on the MIC. With 16 MPI processes on one MIC, the aggregate throughput with MIC-Check was 35 times that of the baseline case. A majority of this benefit comes from efficient usage of the bandwidth offered by Lustre, and the appropriate use of the SCIF channel to move data from MIC-memory to host memory.

6.4.4

Inter-node scalability

MIC-Check was designed to be a shared-nothing architecture in order to allow it to scale with the number of nodes. We evaluated the framework’s inter-node scaling capacity, the results of which are shown in Figure 6.9. Along the horizontal axis is the number of MPI processes that are writing data to the Lustre file system, and along the vertical axis is the aggregate throughput as observed by the MPI benchmark. The number of processes were varied from 16 to 128 in the baseline case, and from 16 to 4096 in the MIC-Check case, with 16 processes per coprocessor. Since each compute node hosts a MCP to exclusively

103

(a) CPU Usage profile

(b) Write pattern profile

Figure 6.10: System resource utilization of MCP

serve its MPI processes, or those residing on its coprocessor, the only factor that affects the I/O throughput scaling observed by an application will be the file system bandwidth. Unlike the baseline case, even with the MCP serving 16 MIC processes, the checkpointing throughput is not bottle-necked by the InfiniBand bandwidth available to the node 6.8, hence allowing the architecture to scale to higher core counts. At a scale of 128 MPI processes, the throughput with MIC-Check was 54 times that of the baseline. At a scale of 4096 (MIC processes), the MIC-Check framework is able to deliver a throughput of 36.17 GB/s.

6.4.5

Resource Utilization

In order to better understand the resource utilization foot-print of the MIC-Check framework, we profiled the CPU utilization and I/O patterns of the MCP. Figure 6.10(a) shows the CPU usage trends of MCP for the duration of a single checkpoint. This experiment was run on a single node of Stampede, with one MCP connecting to 16 MPI processes that issue

104

a single checkpoint request each for a 1GB snapshot. Along the horizontal-axis is the application run time in seconds, and along the vertical axis is the CPU usage in percentage (with a 100% usage implies that a single core of the node is being fully utilized). The CPU usage was sampled at fine intervals equal to twice the kernel clock tick rate. The initial spike in CPU usage comes from the SCIF connection-establishment protocol, which involves creating a new thread for each MPI processes that connects to MCP. During the checkpointing operation itself, MCP issues scif readfrom requests to progress the RMA operations in such a manner that it uses DMA operations instead of using programmed read/writes which would otherwise consume a considerable amount of CPU cycles. The average CPU utilization at the end of the test was 38.88%. For the same test, the write requests that were initiated from the MCP process were profiled for the duration of a single checkpoint. The amount of bytes written per sample is plotted along the vertical axis of Figure 6.10(b). As indicated by the results, once the checkpoint requests arrive, the MCP threads are able to constantly overlap the SCIF RMA operations and the writes to the destination Lustre file system, keeping the pipeline busy while efficiently using the file system bandwidth available to it.

6.4.6

Evaluation with Real-World Applications

While MIC-Check has shown promising performance improvements, and demonstrated remarkable intra-node and inter-node scaling in comparison to the baseline case, it is important to understand its behavior with real applications codes. The following section evaluates the benefits of using MIC-Check with two applications - one that has its own checkpointing capability, and another which uses the transparent checkpointing capability of the MVAPICH MPI library.

105

Resume Checkpoint Suspend Compute

140

800

120

700 Time (seconds)

Time (seconds)

Checkpoint Compute

100 80 60

600 500 400 300

40

200

20

100

0

Baseline

0

MIC−Check

(a) Application-level checkpointing with ENZO

Baseline

MIC−Check

(b) System-level checkpointing with P3DFFT

Figure 6.11: Evaluation with applications

ENZO is a complex astrophysics code for multi-scale and multi-physics applications using large (Eulerian) fixed meshes or multi-level adaptive mesh refinement [58]. The ENZO code has also been ported to run on coprocessors natively [122]. ENZO has inherent capability to save frequent checkpoints of its execution states into files called data dumps. For this experiment, we used 128-process runs of the application on the Stampede system. Figure 6.11(a) shows the total execution time of two ENZO runs - the baseline run in which checkpoints from MIC processes are directly written to Lustre, and the MIC-Check case where data is staged through the host. Both these runs simulate cosmology adiabaticexpansion using sample initialization files provided with the application itself. A single checkpoint of the application was taken in both these cases, and the time it took for the checkpoint to be completely persisted into a file is noted in the graph. In both cases, the aggregate checkpoint size was 5.37 GB/s. As observed with benchmark-based evaluation, the MIC-Check framework was able to vastly absorb the checkpointing overhead. The

106

checkpointing time in the baseline case was 44.8s, while that in the MIC-Check case was just 1.49s, giving 30x improvement in checkpointing throughput. P3DFFT is a popular library for parallel three-dimensional Fast Fourier Transforms. It is written using Fortran and MPI+OpenMP hybrid model. P3DFFT itself does not have any checkpointing capability. It relies on the underlying MPI implementation to save execution states transparently. We use this to evaluate MIC-Check MVAPICH library. The version considered in our runs initializes a 3D array with a 3D sine wave, then performs a 3D forward Fourier transform followed by a backward transform in each iteration. These experiments were run with 32 MPI processes running on two MICs. During the execution, we issue a request to MVAPICH to take a snapshot of the application transparently. MVAPICH then suspends the communication channels, takes a snapshot of the MPI processes using BLCR, and resumes the communication channels. Figure 6.11(b) shows the breakdown of these different times. The checkpointing time drops from 331.07s in the baseline case, to 16.26 seconds in the MIC-Check case, for a 20x improvement.

6.5

Related Work

Finding methods to address I/O bottlenecks, which consequently reduces the costs of checkpointing, is an active area of research in HPC. A majority of them optimize the I/O costs using multi-level [110] or asynchronous [118] checkpointing schemes. Multi-level checkpointing techniques leverage the the hierarchical storage architectures provisioned on current generation supercomputing systems ,and cache checkpoints in a storage location that is closest to compute processes, and flush only a limited number of checkpoints to parallel file systems. Likewise, asynchronous schemes allow the application processes to offload the task of checkpointing to an external agent, either on the same node or on a

107

dedicated node, after which they can proceed with execution while the checkpoint files are being written. A third technique that has been proposed in literature is the use of incremental checkpointing, where the checkpointing system only saves the state of system resources and memory pages that have been modified between two checkpoint requests, into the applications snapshot. This helps drastically reduce the sizes of checkpoints, which helps reduce the cost of checkpointing. However, none of these techniques have been studied for MIC architectures yet. As discussed in Section 6.2, MIC-Check was designed with these techniques in mind, and can work with them in a complementary manner. MICCheck addresses a critical limitation present on the MIC, without which the reduction in checkpointing costs by using these techniques would have been minimal. Checkpointing for heterogeneous systems in general has been an active area of research as well. Takizawa et. al [132] designed a tool to checkpoint CUDA-applications that benefit from using GPUs as accelerators. Nukada et. al [95] proposed a similar library that has the ability to checkpoint CUDA-enabled applications. However, to the best of our knowledge, there has been no significant effort on optimizing the checkpointing overheads on the MIC architecture. Arya et. al tested the feasibility of using DMTCP [37] on a single Xeon Phi coprocessor, but did not evaluate the checkpointing costs or propose designs to reduce the checkpointing costs. Murty et al. [26] have proposed optimizations to the VFS kernel, and have studied its impact on volatile memory-backed tmpfs. However, these optimizations do not take into consideration the extrinsic factors that affect network file system performance.

108

6.6

Summary

In this work, we outline and analyze the intrinsic and extrinsic issues that limit the I/O performance when checkpointing parallel applications on Xeon Phi clusters. We propose MIC-Check, a novel checkpointing framework, that works around these limitations and provides scalable I/O performance on these systems. The proposed checkpointing framework provides a 35x improvement in the aggregate I/O throughput with 16 processes running on a Xeon Phi, and 54x improvement with 4096 MPI processes running on 256 MICs. We have demonstrated the benefits of MIC-Check with both application-level and system-level checkpointing using end-applications. In ENZO, a real-world astro-physics application, our framework improves the checkpointing time by 30x. With P3DFFT, a widely used FFT library, MIC-Check improves system-level checkpointing time by 20x times. Adapter-based coprocessor solutions are expected to be a main-stay even with the next generation MIC architectures, Knight’s Landing and Knight’s Hill. The solutions discussed in this chapter will be applicable to these emerging coprocessor solutions as well.

109

Chapter 7: Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters

Modern processors are known to operate at low-power states in the absence of useful compute work. Such characteristics are exhibited by memory-bound workloads where the processor sparingly uses arithmetic functional units with majority of cycles spent in loading data from one of the hierarchies of memory. Several previous works are motivated by this observation and metrics such as micro-operations-per-memory-load (UPM) have been used to indirectly arrive at ideal frequency, voltage tuples for compute phases [79, 113]. While this observation is true, the following experiment shows that there is room for further optimization.

Figure 7.1: CPU utilization during a checkpoint

110

Figure 7.1 shows the CPU utilization percentage of a node with processes engaged in a checkpointing phase, with an average CPU utilization of about 30-40%. While this utilization rate is considerably lesser than when the CPU is active (90-100%), it does indicate that there exists an opportunity to save further by bringing down utilization and hence achieve power and energy savings. More importantly the figure indicates that while there is scope for further reduction of CPU utilization during I/O phases, the onus is on middleware to understand the nature of workloads in progress and apply appropriate power saving schemes for a lower power footprint that the hardware is capable of offering. Furthermore, naively applying power throttling around checkpointing phases introduces a disproportionate loss in performance for marginal gains in energy savings and as such, is not a scalable method. Hence, in order to circumvent these negative aspects of power capping, a non-trivial use of throttling mechanisms is required. This chapter addresses the following questions that the above challenges pose: 1. Is it possible to analyze power and energy usage of processes during checkpointing phases to reduce CPU utilization during IO phases to a bare minimum range? Can we obtain a reduced energy-footprint through this? 2. Can we design a generic framework that can be used by applications and runtimes to achieve power/energy efficiency during checkpointing? 3. Can we exploit I/O-manipulation mechanisms and power-capping techniques in tandem to achieve power/energy savings at no performance loss?

111

Figure 7.2: Design of the I/O Funneling Layer

7.1

Architecture of Power-Check

This section provides a high-level overview of the energy-efficient checkpointing framework proposed in this work, Power-Check, while Section 7.2 describes it in further detail. We were guided by a set of goals when designing Power-Check. First and foremost, we wanted to develop a framework which minimizes the energy footprint of checkpointing systems without drastically affecting their performance. Second, we wanted to do this with minimal amount of modifications required to existing production codes and middleware libraries. Third, we wanted it to be portable across various system architectures. Figure 7.2 illustrates the overall architecture of Power-Check. The components introduced or enhanced by Power-Check are highlighted in green. At the core of the framework is the libpowercheck module. It is a user-space library that has several key roles — a) to interact with the Intel Running Average Power Limit (RAPL) subsystem and obtain inband energy usage telemetry, as well as to cap the power limits of the CPU sockets and the 112

Figure 7.3: Execution workflow during a checkpoint

memory subsystem; b) to act as a low-level interface that third-party checkpointing systems such a Berkeley Lab Checkpoint Restart(BLCR)[68] and Distributed MultiThreaded CheckPointing system (DMTCP)[33] can leverage to profile, and control, the energy-usage of checkpointing protocols; and c) to provide a high-level interface that end applications which implement their own checkpointing functionality can use to leverage the benefits of Power-Check. These roles are realized using a set of APIs (enumerated in Section 7.2.2) exposed through the libpowercheck library. The framework also depicts two representative checkpointing systems, BLCR and DMTCP. Both these components have been enhanced to be aware of energy-usage telemetry and to have the capability to cap its rate of consuming energy using RAPL’s hardware-enforced 113

power-limiting mechanism. Both these core components enable parallel programming middleware to checkpoint end-applications transparently. Hence, any unmodified application code will be able to leverage the capabilities of these enhancements, which are described in detail in Section 7.2.3.2 and 7.2.3. The most important component of Power-Check is the I/O funneling layer. As discussed in the introduction of this chapter, merely capping the power limits for the duration of a checkpoint, or using typical DVFS techniques to throttle down cores before checkpoint I/O operations, does not bode well with performance. To address this challenge, we have designed a complementary I/O management layer that not only reduces the energy-footprint of checkpointing when progressing I/O operations, but also improves its performance. It transparently intercepts, and takes control of, any checkpointing I/O requests coming from BLCR, DMTCP, or the application. The interception happens in user-space itself, and does not impose any significant overheads. Since this layer works by the principle of interception, the checkpoints are persisted in the storage media they were intended originally to. Any data-staging or multi-level checkpointing policies are still maintained, and the locality of the checkpoints are not affected. While fulfilling the I/O request in an efficient manner by efficiently funneling the data to storage, as detailed in Section 7.2.4, the framework restricts power limits in a finer-granularity, giving higher energy savings. In summary, Figure 7.3 provides a holistic view of the system, and demonstrates the flow of execution during a checkpoint. This example is for a system-level checkpoint, but the control flow is similar for application-aware checkpointing as well. When it is time to take a checkpoint, the application sends a request to either DMTCP or BLCR. In case of application-aware checkpointing, the application processes provide a hint to libpowercheck of the impending checkpoint. The checkpointing library would then capture the state of 114

the compute processes’ execution by capturing the actively used pages, sockets, open file descriptors, and other state-defining kernel structures. Before handing over the checkpoint data to the I/O funneling agent, the checkpointing library would intelligently cap the power limit (discussed in Section 7.2.4). After this, the checkpoint data is written to storage in a pipelined manner by the I/O funneling agent.

7.2 7.2.1

Design and Implementation Design Scope and Assumptions

Power-Check has a generic design that can work with both transparent system-level checkpointing and application-assisted checkpointing. All the major MPI libraries in use today support transparent checkpointing using one or both of BLCR and DMTCP. While the techniques described in this work have been implemented and evaluated with two specific libraries, the designs are applicable to any checkpointing system, including modern multi-level checkpointing systems such as SCR [93] and CRUISE [109]. Furthermore, applications which implement their own snapshotting capability can leverage the interfaces exposed by libpowercheck to benefit from this framework. The proposed framework assumes a coordination checkpointing protocol where all processes of an MPI job synchronize before a checkpoint to reach a quiescent state. This is the model that is most widely employed in production systems. While there are other uncoordinated checkpointing protocols discussed in literature, they are beyond the scope of this work. Uncoordinated checkpointing brings in a different set of challenges and opportunities from an energy perspective, and we plan on extending Power-Check to handle that in the future.

115

Prior research in literature (discussed in Section 7.4) has evaluated the various phases of checkpointing and have concluded that the I/O phase is the most dominant, both in terms of overheads imposed as well as in that of energy-consumption. With that as a premise, the focus of this work is purely on the I/O phase of checkpointing, and not on the pre-/postcoordination phases. In this regard, Power-Check makes certain assumptions about the I/O pattern itself, which is true of most checkpointing I/O workloads. Although this work exclusively focuses on checkpointing for MPI applications, the energy challenges remain for other programming models as well, as noted by [90] for instance. Given that Power-Check focuses on saving energy during the I/O phase, which is an inevitable part of any distributed checkpoint-restore protocol, it can be adopted for any other programming model such as the PGAS languages. Finally, we use an in-band power monitoring and actuation subsystem, RAPL, which is a more practical choice for use in production systems compared to other out-of-band counterparts employed by prior studies in literature.

7.2.2

libpowercheck: Measuring Energy and Actuating Power

All the designs proposed in this work rely on libpowercheck for measurement of energy consumed and to actuate the power limits enforced by the hardware. This is done in-band, without the need for any external power meters, using Intel’s Running-Average Power Limit (RAPL) system. RAPL capabilities are available on-board all Intel processors starting from the SandyBridge architecture. Users can get power-meter and powerclamping functionality by reading and writing to various Model-Specific Registers (MSRs) using privileged instructions. Intel has provided a kernel module that exposes MSRs as device files — /dev/cpu//msr, where specific registers can be read using different

116

offsets into the file. libpowercheck provides an intuitive API as a means to provide easy access to RAPL-specific registers. More information on these RAPL-specific registers and the offsets themselves are detailed in Intel’s Software Developer’s Manual [10]. Broadly, libpowercheck provides three sets of interfaces — one to sample them for profiling, one to control the hardware-enforced power bounds, and a higher-level interface for end-applications to use to demarcate the various phases of checkpointing. The following enumeration describes the key aspects and functions of these APIs.

pcheck init(); pcheck finalize(); pcheck pcap get pkg info(. . . );

The above interfaces are used to initialize the RAPL subsystem, finalize it, and to get metadata about the CPU package’s power capabilities, respectively. During initialization, the library establishes a file descriptor to interact with the MSRs by opening the device file (exposed by the msr kernel module) corresponding to each CPU core with permissions to read and write. This file remains open until libpowercheck is finalized. The third function is used to get information about the design capabilities of the CPU and memory, such as the Thermal Design Power (TDP), maximum and minimum limits that can be set on the power usage of the CPU and memory, etc.

117

pcheck energy read(. . . );

A call to the above function will return the energy status of the package at the instant it was called. For instance, in case of the Sandy Bridge processor, this would read the value of MSR_PKG_ENERGY_STATUS register (offset 0x611) and return it to the user. The components described later in this work use this interface to profile the energy usage of the checkpointing I/O phase.

pcheck pcap get sock limit(. . . ); pcheck pcap set sock limit(wlimit, window);

These two functions allow the user to get the current power limit, or set the limit to an explicit power value in watts (wlimit), and a capping window in seconds (window).

pcheck ckpt start(); pcheck ckpt complete();

In cases where an application writes its own checkpoints without relying on a thirdparty tool like BLCR or DMTCP, it can explicitly wrap the checkpointing code with the above functions to leverage the benefits of Power-Check.

pcheck cs start(); pcheck cs end();

118

In case of system-level checkpointing, the point in execution during which a checkpoint is taken need not always be optimal from a performance standpoint. For instance, applications tend to have a larger memory footprint when executing compute kernels as they hold the working dataset, but this might not be the case during a communication phase. Having a larger memory footprint increases the checkpoint size, and consequently the checkpointing overhead. The above functions allow applications to mark such critical sections in execution which are not conducive for checkpointing, during which Power-Check will block any checkpointing or power-management activity.

7.2.3

Enhanced Checkpointing Libraries

7.2.3.1

Enhanced MVAPICH-BLCR Integration

The BLCR library has been the most predominantly used system-level checkpointrestart solution for HPC systems. All the major MPI libraries have a tight integration with BLCR, providing a larger coverage for end applications. BLCR is a kernel-level solution which uses a kernel module to access the kernel data structures when saving the state of a process. It does not, however, have the capability to natively handle parallel or distributed applications. In the case of Power-Check, we have enhanced the MVAPICH-BLCR integration that provides transparent checkpointing capability to MPI applications that leverage the InfiniBand network. In order to handle potential stray, out-of-order, or zombie messages at the network level, MVAPICH ensures that the MPI processes have coordinated and reached a quiescent state at which no message transfers are in progress. This is done by flushing pending operations and tearing down network connections before taking a checkpoint, and by re-establishing them when resuming a job after a checkpoint. We added prologue and epilogue code to this checkpointing protocol, just before passing control to BLCR, which uses libpowercheck to 119

profile the energy-usage of BLCR’s checkpointing operation down pre-established network connections. A second layer of prologue and epilogue code enforces a power-usage limit, if requested using libpowercheck. 7.2.3.2

Power-Aware Plugins to DMTCP

DMTCP is a widely-used user-space checkpointing tool that can transparently save the state of applications without requiring any modifications. DMTCP also supports the checkpointing of parallel MPI applications. It has an extensible architecture that provides a plugin-capability for third-party modules. It offers two important features that make it extensible: event-hooks and wrapper functions. Event hooks allow a plugin to execute additional action at the time of checkpointing, resuming or restarting. The wrapper functionality allows a plugin to wrap any library or system calls using some prologue or epilogue code, or provide an entirely new implementation for them. As mentioned earlier, Power-Check provides two additional plugins to DMTCP — one to profile the energy usage during various phases of checkpointing and another to actuate power usage by setting power-bounds using RAPL. Both of these plugins leverage the APIs provided by libpowercheck to interact with the RAPL subsystem. When a plugin is loaded at runtime, DMTCP looks for the definition of a dmtcp_event_hook symbol in the plugin. Based on the hooks implemented by a plugin, DMTCP transfers control appropriately. The two plugins in Power-Check implement functionality for several DMTCP event hooks. During each event, all processes synchronize on an internal barrier before executing a hook. For both these libraries, the energy-profiling component uses the pcheck energy *() set of interfaces from libpowercheck to get the energy consumed for the duration of said event. Of most interest for the scope of this work is the energy consumed during the I/O phase 120

of a checkpoint. This is the energy consumed between DMTCP_EVENT_WRITE_CKPT and DMTCP_EVENT_RESUME events in case of DMTCP, and the time taken for BLCR to return from a request in case of the MVAPICH-BLCR integration. These translate into the amount of energy consumed in persisting the checkpoint data to the underlying storage system. The power-capping capabilities uses the pcheck pcap *() set of interfaces to enforce hardware-bounds on the power during the checkpoint I/O phase. A particular power-bound can be enforced at runtime by setting the environment variables PCHECK_PCAP_LIMIT(in watts) and PCHECK_PCAP_WINDOW (in seconds). Due to the semantics of how DMTCP handles file descriptors inside plugins, the various event hooks had to ensure that any MSR that was opened for reading using libpowercheck had to be closed using pcheck finalize() within that event itself to ensure correctness.

7.2.4

I/O Funneling Agent

The central and most critical piece of the Power-Check architecture is the I/O funneling agent. Figure 7.4 describes how this agent functions. In a typical parallel checkpointing scenario, all compute processes (P1-P4), residing on physical CPU sockets (S1 and S2), write out their out checkpoint snapshots to a globally visible storage system such as NFS, Lustre, or Ceph. In case of hierarchical (or multi-level) checkpointing systems, the compute processes would write their checkpoints to a local storage medium such as a flash device or HDD, and an external agent would then asynchronously copy the snapshots over to the global storage in a relaxed manner depending on various factors such as the Mean Time Between Failures (MTBF), network utilization, and load on the file system.

121

Figure 7.4: Design of the I/O Funneling Layer

With this approach, the compute processes are actively involved in the checkpointing operation in one of two ways depending on the checkpointing model used. In case of application-aware checkpointing, the compute processes themselves progress the I/O operations necessary to persist a checkpoint. This would involve writing out a series of computation output datasets or arrays into the checkpoint file(s). In case of transparent system-level checkpointing, a separate checkpointing thread would be invoked by the MPI library to request an external tool like BLCR to capture the state of all MPI processes after a global coordination and write it out to a file. In the former case, the MPI process will be actively performing the I/O operation, and in the latter, it will be polling on a lock within the MPI library, waiting for completion of the checkpoint I/O by BLCR. In both these cases, the OS can not do much as far as conserving energy goes, using the power governors [5]. 122

Instead, Power-Check provides MPI processes the ability to delegate the I/O progress to a single funneling agent (running on S1), while it caps the power usage of the unused socket (S2). This funneling improves the I/O throughput by reducing the VFS-level overheads introduced by the concurrent parallel writes, while providing an opportunity to minimize energy. The funneling agent is implemented as a user-level file system that can intercept all POSIX-based I/O operations (such as open(), read(), write(), etc.,) and alter their behavior as needed. When mounted, this file system allocates a contiguous region of memory for use with the funneling data path, and organizes them into a pool of buffers. It also initializes a pool of worker threads which will progress I/O on behalf of the MPI processes. On a mount, the file system process is explicitly bound to the first core of the first socket. The open() call initializes the libpowercheck and RAPL components with a call to pcheck init(). Finally, it adds an entry to a hash-table of files for each opened file, and initializes internal data structures to maintain metadata about the checkpoint files for use in other file system operations. During a write() operation, the MPI processes lease out the required number of buffers from the memory pool managed by the file system, and fill it with the checkpoint file data. Once this copy completes, the write() call returns and the process does not block on its completion. The funneling agent enqueues all the delegated I/O requests, along with metadata about the destination file and the offset within, in a funneling queue. As requests get enqueued, the I/O threads perform the actual write to the underlying storage medium using the pwrite() syscall. The buffers leased out are returned to the pool once the low-level I/O operation is complete. On a call to close, the file system ensures that all writes have been flushed out to the underlying storage medium before clearing out all the metadata structures. 123

7.3

Experimental Evaluation

Having described the architecture and design of the Power-Check framework in detail, we evaluate its impact on performance and energy using benchmarks and application kernels to demonstrate its benefits. All experiments were conducted on a 8-node test-bed, where each node has a 16-core Intel Sandy-Bridge processor with 8 cores per socket, 32GB of physical memory, an 80GB Fusion-IO ioDrive PCI-e SSD, a 1TB HDD, and an NFS file system.

7.3.1

Understanding the Energy-Usage of Checkpointing

Firstly, it is important to study the energy-footprint of the different facets of checkpointing systems to better understand the benefits of Power-Check. In this section, we particularly study the energy characteristics of three aspects of checkpoint I/O: the storage media used, the parallel-write pattern adopted, and the effects of naively done power-capping on checkpointing. 7.3.1.1

Impact of the storage media used

Any checkpointing mechanism would have to persist data in a globally-visible datastore that is guaranteed to persist the data forever until it is explicitly deleted, or in a shared-nothing medium while relying on redundancy and parity schemes to tolerate data loss.. Most of the popular parallel file systems satisfy these requirements, but have several shortcomings, including contention at scale, centralized points of failure, etc. With emerging dense-node architectures, checkpointing libraries have a wider range of storage media to choose from — node-level RAM, NVM, SSDs, HDDs, etc.

124

Energy Consumed (Joules)

160

RAM Disk (CPU) SSD (CPU) HDD (CPU) NFS (CPU) RAM Disk (memory) SSD (memory) HDD (memory) NFS (memory)

140 120 100 80 60 40 20 0

1

2

4

8

Number of local processes

Figure 7.5: Impact of storage media on the energy-footprint

Figure 7.5 illustrates the results from our studies in understanding the energy used for checkpointing to each of these. For this experiment, we used the IOR benchmark suite (v3.0.1) [12] developed at LLNL. Each test involved 8 processes within a compute-node sequentially writing a 64MB file to a given device in 1MB chunks. The energy consumed was measured using a wrapper to the job-launcher, mpirun_rsh, that reads MSRs from the Sandy-Bridge CPU before and after each run of the benchmark and computes the total energy consumed by the CPU as well as the memory (in joules). Each test was run five times, and the averages are shown in the graph. The most intuitive observation is how increasing the number of processes writing to a medium increases the energy consumed, both by the CPU and the memory alike, regardless of the kind of storage used. Yet, this is one of the key take-away from this experiment. In limiting the number of processes progressing the I/O operations, Power-Check reduces the energy consumption, as will be shown by the later experiments. The other observation to be made from this experiment is 125

Energy Consumed (Joules)

350 300 N−N (CPU) N−1(CPU) N−N (Memory) N−1 (Memory)

250 200 150 100 50 0

1

2

4

8

16

Number of processes

Figure 7.6: Impact of write-patterns on energy-footprint

the behavior of SSD devices. The energy consumed by the memory for this I/O workload when writing to the SSD was between 5.3-7.4% that of the CPU, depending on the number of processes involved. However, in case of the other media, it was consistently between 41.8-59.4%, depending on the number of writer processes. This goes to show that the PCIe SSD driver makes efficient use of the data path available between the memory and the SSD device. Hence, it is desirable for multi-level checkpointing libraries to prefer SSDs as the local storage option (over even the RAM disk), if energy is of primary importance. 7.3.1.2

Impact of the parallel-write pattern

The other facet of the I/O phase in checkpointing is the parallel-write pattern that is adopted. This tends to be one of an N-N or N-1 pattern. In case of the former, each application process saves the state of its computation in its own file. No other process can write to this file or read from it at any point during or after the checkpointing protocol. DMTCP

126

follows the N-N model, and so do a majority of the MPI libraries that use BLCR. In case of the latter, all the application processes save their state to a single file in a globally-visible storage medium by writing to different offsets within the file, where the offset is typically a function of the process ID (the rank, in case of MPI). A vast majority of the applicationaware checkpointing codes, such as ENZO[96] for instance, adopt the N-1 pattern. This is mainly due to the convenience offered by high-level I/O libraries such as MPI-IO, which easily allow multiple processes write to the same file in a strided or segmented manner. We used the benchmark that was also used in the previous experiment, to study the energy-consumption of these two write patterns. Figure 7.6 shows the results of this experiment. With increasing number of processes, the amount of energy spent by N processes writing to a single file is continually increasing. The difference in energy consumed grows from 8.1% with 2 processes to 22.1% with 16 processes. When 16 processes are writing to a shared file in an NFS mount, the CPU uses 316.56J for the N-1 pattern, but just 259.29 for the N-N pattern. This is mainly caused due to the interspersion of the parallel writes within the file, at unaligned offsets, which usually does not bode well for performance. The sharing of files also leads to locking at multiple places within the file system itself. The results also show that the write pattern adopted has minimal effect on the energy-usage of the memory system, as the amount of data moved out of the processes’ memory pages remains the same regardless of how the files are written.

7.3.2

Evaluating Power-Check

7.3.2.1

Resource Utilization

A significant portion of the benefits obtained by using Power-Check comes from its efficient use of the system resources and from orchestrating an optimal I/O path to store

127

the checkpoint data. These set of experiments delves into the system-level profiling information obtained for the duration of a checkpoint benchmark with 8 processes writing a 200MB file in parallel a node-local HDD. The iostat utility provided by Linux was used to gather CPU and I/O statistics from the operating system. For these experiments, the different metrics were sampled once every second. Figures 7.7(a) and 7.7(b) show the rate at which write I/O requests to the HDD are completed at the block level (top half) and the total CPU utilization (bottom half), for the default case and the Power-Check architecture respectively. As noted earlier in this chapter,, the parallel writes initiated by the checkpointing benchmark use a significant portion of the CPU time, albeit in an ineffective manner. The continuous stream of several small write requests initiated in the default case, as seen by the write-requests completion rate, not only uses more of the CPU time (30-50%), but also causes contention at the Virtual File System layer within the kernel. Power-Check, however, coalesces the requests into few larger requests from the funneling agent which significantly reduces the contention on the CPU (down to 10%) while completing the requests sooner. This also reduces the amount of time spent by the CPU idling, while waiting for I/O requests. The next experiment quantifies this aspect. Figures 7.8(a) and 7.8(b) show the percentage of time that the CPU cores were idle while waiting for an I/O request, in the default and Power-Check case respectively. A higher value here indicates an inefficient usage of the CPU. This also indicates that energy is being consumed when no effective progress has been made towards completion of the checkpoint I/O. The Power-Check architecture is able to drastically minimize the IOwait time by streamlining the writes and scheduling the I/O requests entirely from the funneling agent, as seen in Figure 7.8(b). 128

(a) Default

(b) Power-Check

Figure 7.7: CPU Utilization during checkpointing

(a) Default

(b) Power-Check

Figure 7.8: Percentage of CPU time spent waiting for I/O requests

7.3.2.2

Evaluation with Applications

While the previous evaluations gave an insight into the performance and energy characteristics of checkpointing and the Power-Check framework itself, it is critical to study them using real application codes that are used in the community. The Mantevo [60] miniapps

129

native (mem) native−capped (mem) powercheck(mem)

1.00 0.80 0.60 0.40 0.20 0.00

miniMD

miniFE

Clover_Leaf

native naive−capped powercheck)

1.20

Normalized checkpointing time (seconds)

Normalized Energy Consumption (Joules)

Normalized Energy Consumption (Joules)

native (CPU) naive−capped (CPU) powercheck (CPU) 1.20

1.00 0.80 0.60 0.40 0.20 0.00

miniMD

Application

miniFE

Clover_Leaf

1.20 1.00 0.80 0.60 0.40 0.20 0.00

miniMD

Application

miniFE

Clover_Leaf

Application

(a) Energy Consumption of CPU (b) Energy Consumption of Memory

(c) Performance

Figure 7.9: Application-level evaluation of Power-Check (DMTCP)

native (mem) native−capped (mem) powercheck(mem)

1.00 0.80 0.60 0.40 0.20 0.00

miniMD

miniFE Application

Clover_Leaf

native naive−capped powercheck)

1.20

Normalized checkpointing time (seconds)

Normalized Energy Consumption (Joules)

Normalized Energy Consumption (Joules)

native (CPU) naive−capped (CPU) powercheck (CPU) 1.20

1.00 0.80 0.60 0.40 0.20 0.00

miniMD

miniFE

Clover_Leaf

Application

(a) Energy Consumption of CPU (b) Energy Consumption of Memory

1.20 1.00 0.80 0.60 0.40 0.20 0.00

miniMD

miniFE

Clover_Leaf

Application

(c) Performance

Figure 7.10: Application-level evaluation of Power-Check (BLCR)

suite is a well-defined set of application kernels that are representative of different domains of science, and of different performance characteristics. We evaluate the Power-Check framework using three core kernels from the Mantevo suite — MiniMD, MiniFE, and CloverLeaf. MiniMD is a light-weight Molecular-Dynamics code that derives from the original LAMMPS code. MiniFE is an approximation of an unstructured implicit finite

130

code that includes all important computational phases. Lastly, CloverLeaf is a hydrodynamics code that investigate the behavior and responses of materials when applied with varying levels of stress. Figures 7.9 and 7.10 show the normalized energy and performance characteristics of our evaluations with these miniapps, using DMTCP and BLCR respectively. For each of these libraries, we evaluated three cases — a) The base-case, native, in which a checkpoint of the application was taking using the standard DMTCP and MVAPICH-BLCR libraries, without the use of any I/O funneling or applying any power caps; b) naive-capped, in which a checkpoint of the miniapps was taking with a naive power-cap (of 51W) set for the duration of the checkpoint on the entire node; c)powercheck: where a checkpoint of the miniapps was taken using the enhanced DMTCP and MVAPICH-BLCR libraries which selectively cap the power limit of the non-active CPU socket when using Power-Check’s I/O funneling capability. As can be seen from these results, and as one would expect, the amount of energy consumed during a checkpoint does reduce by merely setting a limit on the power (nativecapped). In case of transparent checkpointing using DMTCP, the amount of energy consumed by each node for a single checkpoint of the application (native) was 130.66J, 155.68J and 51.75J, for MiniMD, MiniFE and CloverLeaf applications respectively. With a naive capping applied, the energy consumption reduced to 103.59J, 116.76J and 48.2J respectively. Across the applications, the naive capping was able to reduce the energy consumption by up to 25% per node for a single checkpoint. It does however hurt the performance of the checkpointing operation by reducing the I/O throughput, and consequently increasing the time to persist the checkpoint to disk. The checkpointing time for these applications under power-capping increased by up to 11% across these three applications. A similar 131

trend can also be observed with naive-capped in the MVAPICH-BLCR case. Here, capping was able to reduce the energy-consumption per checkpoint by up to 24%, but at a loss of performance by up to 9%. More restrictive capping hurts the performance severely, to a point where it is no longer a viable solution. As can be seen from Figures 7.9(b) and 7.10(b), the naive-capping scheme also has a detrimental impact on the energy consumed by the memory subsystem. It increases by up to 6% in case of DMTCP, and up to 9% in case of BLCR. Now, in case of Power-Check, there are clear gains, both in terms of energy-usage, as well as in that of performance of the checkpointing operation. For instance, with DMTCP, the energy-usage of a single node’s CPU reduced to 84.24J with MiniMD, to 104.24J with MiniFE, and to 26.95J with CloverLeaf, in comparison to the native checkpoints. In other terms, Power-Check was able to reduce the energy consumption by up to 48% for a single checkpoint in comparison to the native scheme, as opposed to the 25% gains obtained with the naive-capped scheme. The key benefit of Power-Check comes from the fact that these gains do not have a negative impact on performance. In fact, it was able to decrease the checkpoint time by 14% with MiniMD, 8% with MiniFE, and 34% with CloverLeaf. Moreover, the tandem usage of funneling and selective power-capping noticeably improves the energy usage of the memory system by up to 30% as well.

7.4

Related Work

With the US Department of Energy classifying system power as a first-class constraint on exascale system performance and effectiveness, there have been multiple studies in literature that have evaluated the various software and hardware components in the supercomputing environment, and evaluated their power-requirements and identified opportunities to

132

reduce it. Freeh et. al first proposed the use of micro-operations-per-memory-load (UPM) to assign suitable frequencies for energy conservation in power scalable clusters [62]. This technique requires the use of code instrumentation which can be arduous for large codes. Lim et.al proposed application-transparent methods of identifying regions and assigning appropriate frequencies within MPI programs [85] to conserve energy. The work does require empirically arriving at ‘closeness’ and ‘long enough’ parameters in addition to formulating a function that maps micro-ops-retired to suitable P-state. In Jitter, Kappiah et. al target MPI processes that are not in the critical path for frequency scaling in order to arrive at an MPI call ‘just in time’ and hence conserve energy in programs that suffer from load imbalance [79]. The work assumes stable iterative phases to dominate the core execution time of an application. Rountree et. al proposed Adagio that identifies slack in MPI programs without global communication and uses calltrace hash to predict execution time and energy patterns of recurring ‘tasks’ to conserve energy in primarily iterative applications [113]. More recently, researchers have studied the energy footprint of system-level faulttolerance mechanisms such as checkpoint-restart, process-migration, message-logging, etc. Diouri et. al[53, 54] study the energy-requirement of checkpointing and message-logging and provide suggestions on which protocol to adopt based on available form of local storage media, and on the amount of data exchanged by the application. The study focuses on the MPICH-BLCR implementation of checkpointing MPI jobs, and uses an out-of-band energy monitoring system. Meneses et. al [90] again did a similar study by comparing three protocols: checkpoint-restart, message-logging, and parallel recovery. Their study proposes an energy-consumption model and concludes that checkpointing consumes the

133

most energy of the three protocols. Along the same lines, Mills et. al [92] examine the percomponent energy consumption trends of a coordinated checkpointing protocol, including the CPU, memory, network, etc. This was again using out-of-band energy-monitoring hardware modules[74]. Ibtesham et. al [73] present a coarse-grained model to calculate energy consumption of applications when checkpoint-compression strategies are employed. Using this model, they show that the energy savings due to reduced wall-clock time outweighs the additional costs of compression which is a CPU-intensive task. Saito et. al [114] have studied the energy-consumption of checkpoint I/O to a PCI-e based NAND-flash memory system. They use a model-driven approach to dynamically vary the CPU frequency and number of I/O threads when checkpointing based on a Markov-model that they have developed. They have also compared their approach to preset DVFS governors provided by the system kernel and demonstrated noticeable improvements.

Summary In addition to looking and performance and scalability, this thesis also tackles the challenge of energy-efficient checkpointing for HPC applications. This chapter described Power-Check, a novel and generic power-aware checkpointing framework which supports both system-level and application-aware checkpointing. Power-Check uses intelligent datafunneling mechanisms and selective power-capping to reduce the CPU utilization during the I/O phases of the CR, thus increasing the energy savings without negatively affecting the performance. Furthermore, this work extends two widely-used checkpointing libraries, BLCR and DMTCP, to support monitoring and actuation of energy consumed. The evaluation with three different application kernels of the Mantevo miniapps suite: MiniFE,

134

MiniMD and CloverLeaf show that Power-Check can achieve as much as 48% energy savings during a checkpoint, while improving checkpointing time by 14%. In contrast, a naive power-capping scheme achieves just 25% reduction in the energy usage while increasing the checkpointing time by 9%. We are actively working on evaluating our work on largerscale systems.

135

Chapter 8: FTB-IPMI: Low-Overhead Fault Prediction

Although most of the individual hardware and software components within a cluster implement mechanisms to provide some level of fault tolerance, these components work in isolation. They work independently, without sharing information about the faults they encounter. This lack of a system-wide fault information coordination has emerged to be one of the biggest problems in leadership-class HPC systems. Furthermore, fault-prediction is a challenging issue that several researchers are trying to address. Fault-prediction models and toolkits will have to work in unison with fault coordination and propagation frameworks allowing system middleware to make informed decisions. This will also make way for proactive measures and actions that provide resiliency to end-user applications. In this context, our work presented in this chapter addresses the following questions: 1. Can a scalable light-weight tool that provides services like distributed fault monitoring, failure event propagation and failure prediction be designed? 2. How can existing HPC middleware leverage fault-information from such a service to provide preemptive fault-tolerance? Figure 8.1 illustrates the fundamental contributions of this work and shows how existing software stacks can make use of the proposed service, namely FTB-IPMI. FTB-IPMI uses two key technologies - the Fault-Tolerance Backplane (FTB) developed as a part of 136

Figure 8.1: FTB-IPMI Architecture

the Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) initiative [1] to propagate fault information, and the IPMI interface standard to monitor system events. These two technologies are explained in depth in Section 2. A rule-based prediction engine has been developed as part of the MVAPICH2 [17] MPI library to demonstrate how it can benefit from the failure information given by FTB-IPMI. Such an integration to provide fault-resilience capabilities is possible with any FTB-enabled software that can subscribe to events from FTB-IPMI. Section 8.1.5 discusses how existing FTB-enabled middleware can benefit from FTB-IPMI.

8.1

Design and Implementation

FTB-IPMI is designed to run as a single stand-alone daemon which handles multiple operations like reading IPMI sensors, classifying events based on severity, and propagating 137

the fault information via FTB. A single instance of the FTB-IPMI daemon running on one node can manage an entire cluster.

IPMI physical network

A Computing node

Computing node

Computing node

Process launcher

Application

Application

Application

Job Scheduler

MPI library

MPI library

MPI library

Frontend node

FreeIPMI library FTB client library

FTB−IPMI daemon B

C

Fault Tolerance Backplane (FTB) composed of multiple FTB agents

D

Figure 8.2: FTB-IPMI Work-flow

Once initialized, the following actions are performed at periodic user-set intervals (as illustrated in Figure 8.2): (A) Querying IPMI: Collects the values and states of IPMI sensors on all nodes (B) Sensor State Analysis: Analyzes collected data and identifies relevant events that need to be propagated (C) Event Publication: Publishes FTB events that correspond to observed sensor state changes (D) Component Notification: The FTB framework notifies all FTB-enabled system components that subscribe to events from FTB-IPMI

138

Figure 8.2 describes the work-flow of FTB-IPMI running on a representative computing cluster installation with a front-end node and a set of compute nodes. In such a typical configuration, FTB-IPMI runs on the head node as a central service to monitor all the compute nodes. The following sections describe steps (A) through (C) in detail.

8.1.1

Querying IPMI

FTB-IPMI gathers information about the IPMI sensors of all the nodes using FreeIPMI library described in Section 2.12. Privileged-access to any of the compute nodes or the node which hosts the FTB-IPMI daemon is not required, as the sensor data is only being fetched and not written. However, FreeIPMI’s out-of-band monitoring interface requires a username/password based authentication to communicate with remote BMCs. FTB-IPMI takes a host-list as mandatory argument from the user during startup. This host-list should contain the hostnames/IPs of the nodes that need to be monitored for faults by FTB-IPMI. Typical supercomputing systems have a dedicated Ethernet network for IPMI traffic. In such cases, the hostname or the IP address corresponding to this network would have to be specified. Based on this host-list, an internal task-list is generated. Each task corresponds to an IPMI query that must be performed. In order to efficiently collect data from a large list of hosts, the FTB-IPMI daemon is multi-threaded. The number of threads is defined internally during initialization, or via a user-provided configuration file. Threads in this context refer to a pool of worker threads that pick tasks from the task-list as they become idle. Such a mechanism also provides load-balancing amongst the threads. Once assigned a list of tasks, each worker thread uses the out-of-band monitoring interface provided by IPMI to query sensor data from the nodes in its list. On fetching sensor data from a given node, it is removed from the task list. Once

139

the task list is empty, data collection is considered complete. The internal task list gets regenerated for each periodic iteration. Sensor data queries posted to IPMI using the FreeIPMI library are blocking in nature. A querying thread blocks on the response from a remote BMC without querying any other BMC. Using multiple threads allows several BMCs to be queried in an overlapped manner. Furthermore, these multiple worker threads provide some level of tolerance to node failures. If a node stops responding to IPMI requests, the worker thread handling this IPMI query will be blocked until the operation times out. However, the other threads can progress with their tasks, thereby limiting the delay induced by the node failure. The performance impact of having multiple worker threads is evaluated by means of experimentation in Section 8.2.

8.1.2

Sensor State Analysis

Table 8.1 shows an example of the data collected by FreeIPMI for a given node in a single iteration. In addition to sensor name, information about a sensor such as its Type, its current Value, the Unit, and its State are fetched. The State of a sensor is decided by FreeIPMI, based on its current value and a sensor-specific threshold. These thresholds and the FreeIPMI behavior can be customized for each node using a configuration file (freeipmi interpret sensor.conf). From the sample readings listed in the table, we can notice that some sensors are in the Critical state and report the value to be “0”. This indicates that the particular hardware sensor unit is not available. This case is quite frequent and can be safely ignored. For fault-tolerance purposes, the most relevant information is the state change of a sensor rather than its actual value. Based on this observation, FTB-IPMI is designed to

140

Sensor Name CPU1 Temperature CPU2 Temperature TR1 Temperature TR2 Temperature VCORE1 VCORE2 +1.5V ICH +1.1V IOH +3.3VSB +3.3V +12V VBAT +5VSB +5V P1VTT P2VTT +1.5V P1DDR3 +1.5V P2DDR3 FRNT FAN1 FRNT FAN2 FRNT FAN3 FRNT FAN4 CPU1 ECC1 CPU2 ECC1 Chassis Intrusion

Type Temperature Temperature Temperature Temperature Voltage Voltage Voltage Voltage Voltage Voltage Voltage Voltage Voltage Voltage Voltage Voltage Voltage Voltage Fan Fan Fan Fan Memory Memory Physical Security

State Nominal Nominal Critical Critical Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Critical Nominal Nominal Nominal Nominal Critical

Value 25.00 26.00 0.00 0.00 0.94 0.94 1.53 1.10 3.22 3.24 12.10 3.22 4.96 4.99 1.14 1.14 1.50 1.50 9840.00 0.00 9840.00 9520.00 N/A N/A N/A

Unit C C C C V V V V V V V V V V V V V V RPM RPM RPM RPM N/A N/A N/A

Table 8.1: FTB-IPMI Sensor Readings from a single compute-node on Cluster A (See Section 8.2)

maintain the history of states that each sensor shifts through. This allows for easy detection of state changes and also provides a framework for history-based failure prediction. Table 8.2 summarizes the set of internal rules based on which FTB-IPMI generates FTB events. When a sensor moves to the Warning or Critical state, an event with severity WARNING (and not FATAL) is generated. We consider that a sensor with a Warning or Critical state is just an indication of a potential failure and not an actual failure. It is also

141

IPMI Sensor Event State change to Nominal State change to Warning State change to Critical Read Error

⇒ ⇒ ⇒ ⇒

FTB Action Publish event (Severity INFO) Publish event (Severity WARNING) Publish event (Severity WARNING) Publish event (Severity CRITICAL)

Table 8.2: FTB-IPMI rules to generate FTB event

important to generate an FTB event when a sensor goes back to the Nominal state, in order to roll-back or cancel any recently triggered (and possibly incomplete) preventive action.

8.1.3

FTB Event Publication

The FTB events are published using the FTB client library which automatically connects to FTB agents and forwards the events to all interested parties. Publishing an event is non-blocking action that can also tolerate failures like network faults.

Event Name IPMI SENSOR STATE NOMINAL IPMI SENSOR STATE WARNING IPMI SENSOR STATE CRITICAL IPMI ERROR

Payload Hostname,SensorID,Name,Type,Value,Unit,State Hostname,SensorID,Name,Type,Value,Unit,State Hostname,SensorID,Name,Type,Value,Unit,State Hostname,Error

Severity INFO WARNING WARNING CRITICAL

Table 8.3: FTB Events published by FTB-IPMI

The details of the published FTB events and their payload are given in Table 8.3. The payload includes as much information as possible, including the hostname and the sensor ID. Components that subscribe to events from FTB-IPMI can build logical prediction engines by correlating events, the history of sensor states and the sensor information that is packed into the payload.

142

8.1.4

Rule-Based Prediction in MVAPICH2

In order to demonstrate this capability, we have developed a rule-based prediction engine within the MVAPICH2 MPI library. This prediction engine uses the information provided by FTB-IPMI to predict impending node failures. The prediction from this engine is used to trigger a proactive process-migration protocol which migrates MPI processes from a failing node to a healthy spare-node. This prediction engine is an FTB-enabled component that subscribes to events from FTB-IPMI in the FTB.IPMI namespace and publishes prediction information for the MVAPICH2 library to use in the FTB.MPI.MVAPICH2 namespace. The PREDICTOR_NODE_FAILURE event is published with the failing node’s hostname and the sensor state information as payload. This component gathers state history information from the events published by FTB-IPMI, and applies certain rules to predict failures. One such rule expects at least 3 events of WARNING severity from the same sensor to be generated before it can deem that hardware to be dying. This rule ignores CRITICAL events from sensors having a reading of “0” in order to curb pseudo-predictions. In case of subsequent WARNING events related to the system fan speeds, a failure prediction is not generated unless a rise in CPU-temperatures is observed too. This rule ensures that automatic fan-speed control mechanisms ( [36, 127]) employed by modern motherboards do not lead to false-predictions. Similar logical rules can be added to this fabric to increase the accuracy of predictions, and to completely suppress false-alarms. This component can also be ported to any of the FTB-enabled software. Although there is a direct mapping between the rules and the actions, this engine enables an end fault-tolerance mechanism to predict a failure proactively before the failure occurs. Section 8.2.3 shows a demonstration of how this component in MVAPICH2 is used to provide proactive fault-tolerance using a job-migration framework. 143

8.1.5

Applications of FTB-IPMI

Several High-Performance middleware including MPI libraries, Checkpointing libraries, Network monitoring systems, math libraries, resource managers and parallel applications have been FTB-enabled. Any of these middleware can communicate through the backplane to publish and/or subscribe to fault information. This section discusses the applications of FTB-IPMI service in the context of these FTB-enabled software. The Message Passing Interface (MPI) is one of the most important programming models in high-performance computing. MVAPICH2, MPICH2, and OpenMPI are three of the most popular MPI implementations that heavily dominate the high-performance computing space. All these MPI libraries have incorporated FTB into their stack. MVAPICH2 and OpenMPI have support for preemptive process migration which relies on a prediction of node failure. On subscribing to events from FTB-IPMI, these libraries can predict impending failures and trigger the migration protocol, thereby providing resiliency to the end-user application. BLCR, the Berkeley Lab Checkpoint/Restart library for Linux, is the one of the most prominent software available for system-level checkpointing. Several high-level libraries including MPI use it to ensure a certain degree of fault tolerance in their software. BLCR has adopted FTB to propagate failure information and other checkpointing-related events. The FT-LA software package is a dense linear algebra library that features algorithmbased fault-tolerant routines. The integration of FTB in FT-LA encompasses an event schema that includes events such as the ability to recover from failures, time to completion depending on the number of available spare resources should a failure occur, and notification of successful or unsuccessful recoveries. When a node-failure is predicted using data

144

from FTB-IPMI, the FT-LA library can transfer the complete resource dataset to a reserve node pointed to by the resource manager or MPI library. SLURM is a popular, open-source resource manager and job scheduler developed by Lawrence Livermore National Laboratory. SLURM is FTB-enabled by means of a new notifier plugin to announce events to FTB. Notifications related to monitoring of resources, scheduling of jobs, and failure events internal to SLURM are supported. The SLURM controller daemon, slurmctld, publishes these events to FTB through its various hooks using the notifier plugin. FTB-aware components interested in these events can thus track resource changes, job status and SLURM failures. FTB-IPMI and SLURM can benefit from each other by correlating events from their respective namespaces before predicting a node-failure.

8.2

Experimental Evaluation

In this section, we discuss some experimental results that demonstrate the capabilities, and evaluate the performance impacts of FTB-IPMI. The computing cluster used for these evaluations, Cluster A, is a 160-node Linux-based system. Each compute node has eight Intel Xeon cores organized as two sockets with four cores per socket and has 12 GB of memory. They are equipped with InfiniBand QDR Mellanox ConnectX-2 HCAs. The operating system used is Red Hat Enterprise Linux Server release 6 with the 2.6.32-71.el6.x86 64 kernel. The cluster also has a separate Ethernet network for the IPMI system to keep its operations independent of user-traffic on the regular network.

8.2.1

Resource Utilization

In these experiments, the CPU utilization of FTB-IPMI was profiled. Figure 8.3 compares the real-time CPU usage of an instance of FTB-IPMI which monitors 128 compute 145

40

128 Threads 64 Threads

CPU Utilization %

35 30 25 20 15 10 5 0 0

2

4 6 Time (s)

8

Figure 8.3: Real-Time CPU Usage

Figure 8.4: Average FTB-IPMI CPU Usage

146

10

nodes using 128 and 64 worker threads. The initial spike that goes up to 39.4% and 24.6% respectively, is from the operation of spawning all the worker threads during startup. Subsequent smaller spikes are from multiple iterations of sensor sweeps performed by FTBIPMI. For the purpose of experimentation, each iteration is separated by a 2-second delay. In a typical installation, this iteration delay has to be decided based on the MTBF of the system being monitored, and the tolerable latency with which failure predictions are needed to take appropriate preventive measures. The effect of having multiple client nodes assigned to a each worker thread is seen in the case of 64-threads, where each thread has to monitor two hosts from the task-list. As discussed in Section 8.1.1, having multiple-threads allows several remote BMCs to be queried in an overlapped manner. This explains the additional spikes in the 64-threads case. In this configuration, about 3.7% of available CPU resources on the node hosting FTB-IPMI are used during a single iteration of querying the sensors. However, to fully capture the CPU utilization of FTB-IPMI service, the average CPU usage of the daemon was measured over time, and is plotted as a graph in Figure 8.4. For this experiment, the iteration delay was set to 10 seconds and the CPU utilization values were measured for up to 10 iterations. This graph shows an initial spike as well, which can be attributed to the phase where multiple worker threads are spawned. The average CPU utilization increases marginally with increasing number of threads. Considering the case with 128 threads for instance, as execution progresses over the 10 iterations, the average CPU usage stabilizes at about 1.5%. This would be much lesser in a realistic deployment which will have a much larger delay between iterations. This indicates that the FTB-IPMI service has a very low system-utilization footprint and does not let other

147

services and libraries starve for resources. This will not affect any compute-intensive userapplications that run on the compute nodes either, as the daemon runs only on the head-node or management-node of a cluster. Network resources are critical in a supercomputing environment. However, the communication between different IPMI BMCs used by FTB-IPMI is carried-out over a separate Ethernet network. This ensures that the IPMI out-of-band monitoring traffic does not congest the network that is used for communication by the user/application. Given this exclusivity of the network resource, we have not profiled its usage for the purpose of this paper.

8.2.2

Scalability

The ability to scale to a large number of nodes was one of the design goals of FTBIPMI. In order to demonstrate the scalability of our tool, we conducted several experiments by varying the number of worker threads used, and the number of compute nodes monitored. Figure 8.5 provides a comprehensive summary of how FTB-IPMI scales as the number of nodes, and number of threads increases. In this experiment, the number of worker threads used are either equal to, or lesser than the number of nodes in the host list. As the trends in the results indicate, even as the number of nodes increase geometrically, the sensor sweep times increase just sub-linearly. To get a deeper understanding of how the multi-threaded design of FTB-IPMI enhances the scalability of the service, we varied the number of threads used to query sensors across 128 compute nodes, from 1 to 128 (see Figure 8.6). The time taken for the execution of a single iteration drops significantly when multiple threads are used. Adding more threads

148

14

1 Thread 4 Threads 16 Threads 32 Threads 64 Threads 128 Threads

Sensor Sweep Time (s)

12 10 8 6 4 2 0

1

4

16

32

64

128

Number of Nodes

Figure 8.5: Scalability with Multiple Threads

beyond a certain point does not help reduce the sweep time either. This is due to contention at the thread level, where resource allocations get multiplexed. A single iteration of FTBIPMI to read all the sensor readings from 128 nodes takes just 0.75 seconds with 128 threads.

8.2.3

Proactive Process Migration in MVAPICH2

Section 8.1.4 describes the fault-prediction engine integrated into the MVAPICH2 library. To study how this component interacts with the FTB ecosystem and system software to provide proactive fault-tolerance, we ran the Tachyon [129] MPI ray-tracing application on 128 nodes of Cluster A. Multiple IPMI events indicating a rise in CPU Temperatures (WARNING severity) were injected into the FTB-IPMI service in order to simulate

149

14

128 nodes

12

Time (s)

10 8 6 4 2 0 14 8 16

32 64 Number of Worker Threads

128

Figure 8.6: Execution times for Single Iteration

a potential failure. On seeing three consecutive WARNING events from IPMI, the prediction engine realizes that one of the predefined rules have been satisfied, and generates a PREDICTOR_NODES_FAILURE FTB event, with information about the failing node as payload. This event is then published in the FTB namespace, and delivered to all components that have subscribed to events from the prediction engine. The mpirun rsh process manager in MVAPICH2, which is one of these subscribers, uses the prediction event as a cue to initiate the process migration protocol which moves processes from the failing node to a healthy spare node. In Figure 8.7, the X-Axis marks the progress of the application execution in seconds and Y-Axis marks the computation progress in terms of percentage completed. The vertical bars represent either a checkpoint in progress, or a preemptive process migration in progress. The labels marked by pointers from these bars indicate the standardized FTB events that 150

Figure 8.7: Prediction-Triggered Preemptive Fault-Tolerance in MVAPICH2

are published by the MPI library for other system components to make use of when taking preventive or reactive measures to handle failures. In this experiment, the MPI library is configured to record periodic checkpoints of the application. In the absence of a faultprediction mechanism, the application will be rolled-back to a particular checkpoint in case of a failure, from which point the execution resumes during recovery. However, with the help of the failure prediction generated by the rule-based prediction engine, the processmigration protocol is triggered, on completion of which the job resumes execution from the same point at which it was suspended. This protocol significantly improves application resiliency with a minimal overhead incurred to the application performance.

151

8.3

Related Work

There are several services and tools that use IPMI to monitor node health. Ganglia [88] and Nagios [18] are two tools that are used widely for node health monitoring in HPC and Grid computing systems. Ganglia is a scalable distributed monitoring system, where each node monitors itself and sends the data to all other nodes. OVIS 2 [44] is a hierarchical monitoring and analysis tool that collects system health information directly from nodes or from other monitoring solutions, such as Ganglia, for processing using statistical methods for graphical presentation. The FTB-InfiniBand monitoring software (FTB-IB) [8] publishes fault information related to InfiniBand adapter availability/unavailability, activation status of InfiniBand ports, status of InfiniBand adapter local ids and protections keys, and information on subnet manager changes as events to the FTB framework. Although there are several related tools and services, FTB-IPMI is the first of its kind that provides a framework for fault-monitoring and prediction using the Fault-Tolerance Backplane and Intelligent Platform Management Interface technologies. Supermon [66] is another monitoring system for Linux systems, which acts as a performance monitoring server that queries individual compute nodes and gathers the data, thereby minimizing the number of queries that are directly sent to the compute node. The Cluster Systems Management (CSM) [2] tool that is designed for AIX systems enables low-cost management of distributed and clustered IBM power systems by providing a single point-of-control that allows fast responses and consistent policies, updates and monitoring by a small staff. There is constant research in identifying accurate failure-prediction methodologies. Libby et al. talk about the different features provided by IPMI and discuss how system software can leverage these features to predict failures [84]. Leangsuksun et al. have developed a fault-monitoring system [82] as part of the Open Source Cluster Application 152

Resources (OSCAR) using the Hardware Platform Interface (OpenHPI) [50]. There have also been efforts to predict system failures based on system logs [134, 124, 142] and Support Vector Machines [63]. Several designs have been proposed in literature for preemptive migration. The MVAPICH2 and OpenMPI MPI implementations have process migration support, based on the Berkeley Lab Checkpoint Restart (BLCR) library. The authors of the work discussed in [57] classify preemptive migration into different categories based on the monitoring capabilities available to the compute nodes. Virtual machine migration using VMM-bypass is also available in the Xen framework [69].

8.4

Summary

This chapter presented the design and implementation of FTB-IPMI, a light-weight service for HPC clusters, which aids in hardware fault-monitoring and fault-information dissemination. In addition to this service, we have also proposed a portable rule-based fault-prediction engine that can be adopted by any FTB-enabled system software to assist preventive fault-tolerance protocols. Experimental results clearly show that the service is scalable and uses minimal system resources during its operation. As part of future work, we plan to enhance the rule-based prediction engine within MVAPICH2 library to reduce false predictions by taking historical information from system logs into account. We also plan on adding support to correlate events from other FTB-enabled components like FTB-IB and job-schedulers to increase the accuracy of failure-predictions. With FTB being supported in a variety of other system architectures such as IBM BlueGene(P and L) and Cray, we would like to explore the possibility of porting FTB-IPMI to these HPC systems.

153

Chapter 9: Conclusions and Contributions

With the abundance of compute capabilities, compute-bound problems are now turning into I/O-bound ones. With extreme processing parallelism provided by modern CPU architectures, there is no dearth of computational power. This trend will be prevalent in future supercomputing architectures as well. Parallel I/O often tends to be the bottleneck in application execution pipelines. Nevertheless, it is a key step that applications and middleware require to read input data, write intermediate data that can be used for in-situ visualization and analysis, write checkpoint snapshots to be able to recover in the event of failures, and to manage out-of-core data in cases where the working dataset does not fit into the processor’s physical memory. It is also clear that persistence is a key ingredient of any protocol or mechanism that is designed to handle failures. Any fault-tolerating mechanism would have to persist either some amount of metadata or a large amount of actual application data itself, in a globallyvisible or accessible data-store that is guaranteed to persist the data forever until it is explicitly deleted. Existing I/O middleware are not versatile enough to handle this paradigm-shift with memory and storage hierarchies. This dissertation takes on this challenge by proposing cross-layer solutions that efficiently handles such a diverse I/O and storage environment while improving performance, scalability, and energy-efficiency.

154

Using Stage-FS, this dissertation explored several design alternatives to develop a hierarchical data staging framework to alleviate the bottleneck caused by heavy I/O contention for shared storage when multiple processes in an application dump their respective checkpoint data. Using the proposed framework, we have studied the scalability and throughput of hierarchical data staging and the merits it offers when it comes to handling large amounts of Checkpoint data. We have evaluated the Checkpointing times of different applications, and have noted that they are able to resume their computation up to 8.3 times faster than what they would normally, in the absence of data staging. This clearly indicates that Checkpoint-Restart mechanisms can indeed benefit from hierarchical data staging. With the Stage-QoS design, we have developed a data-staging framework file system that takes advantage of the QoS features of InfiniBand network fabric to reduce the contention in the network by isolating the I/O data flow from the MPI communication flow. This is a portable solution that can work with any MPI library and any backend parallel file system in a pluggable manner. We have also studied the impact of our solution with representative micro-benchmarks and real applications. Experimental results show that with the proposed solution, the point-to-point latency of MPI applications in the presence of I/O traffic can be reduced by up to 320 microseconds for 4MB message size, and the bandwidth for which can be increased by up to 674MB/s. Collective operations such as MPI AlltoAll could also benefit from this work, with its operation latency reducing by about 235 microseconds in the presence of file system noise. The AWP-ODC MPI application’s runtime in the presence of I/O traffic was reduced by about 9.89%. The time spent in communication by the CG kernel with I/O traffic was reduced by 23.46%. The dissertation also proposes a new file system called CRUISE to extend the capabilities of multilevel checkpointing libraries used by today’s large scale HPC applications. 155

CRUISE runs in user-space for improved performance and portability. It performs over twenty times faster than kernel-based RAM disk, and it can run on systems where RAM disk is not available. CRUISE stores file data in main memory and its performance scales linearly with the number of processors used by the application. To date, we have benchmarked its performance at 1 PB/s, at a scale of 96K nodes with three million MPI processes writing to it. CRUISE implements a spill-over capability that stores data in secondary storage, such as a local SSD, to support applications whose checkpoints are too large to fit in memory. CRUISE also allows for Remote Direct Memory Access to file data stored in memory, so that multilevel checkpointing libraries can use processes on remote nodes to copy checkpoint data to slower, more resilient storage in the background of the running application. We outlined and analyzed the intrinsic and extrinsic issues that limit the I/O performance when checkpointing parallel applications on Xeon Phi clusters. We propose MICCheck, a novel checkpointing framework, that works around these limitations and provides scalable I/O performance on these systems. The proposed checkpointing framework provides a 35x improvement in the aggregate I/O throughput with 16 processes running on a Xeon Phi, and 54x improvement with 4096 MPI processes running on 256 MICs. We have demonstrated the benefits of MIC-Check with both application-level and system-level checkpointing using end-applications. In ENZO, a real-world astro-physics application, our framework improves the checkpointing time by 30x. With P3DFFT, a widely used FFT library, MIC-Check improves system-level checkpointing time by 20x times. Adapterbased coprocessor solutions are expected to be a main-stay even with the next generation MIC architecture, Knight’s Landing. The solutions discussed in this chapter will be applicable to these emerging coprocessor solutions as well 156

The dissertation tackles the challenge of energy-efficient checkpointing for HPC applications. It proposes and describes Power-Check, a novel and generic power-aware checkpointing framework which supports both system-level and application-aware checkpointing. Power-Check uses intelligent data-funneling mechanisms and selective power-capping to reduce the CPU utilization during the I/O phases of the CR, thus increasing the energy savings without negatively affecting the performance. Furthermore, this work extends two widely-used checkpointing libraries, BLCR and DMTCP, to support monitoring and actuation of energy consumed. The evaluation with three different application kernels of the Mantevo miniapps suite: MiniFE, MiniMD and CloverLeaf show that Power-Check can achieve as much as 48% energy savings during a checkpoint, while improving checkpointing time by 14%. In contrast, a naive power-capping scheme achieves just 25% reduction in the energy usage while increasing the checkpointing time by 9%. We proposed a hardware fault-monitoring and fault-information dissemination service, FTB-IPMI. In addition to this service, we have also proposed a portable rule-based faultprediction engine that can be adopted by any FTB-enabled system software to assist preventive fault-tolerance protocols. Experimental results clearly show that the service is scalable and uses minimal system resources during its operation. A single iteration of FTB-IPMI to read all the sensor readings from 128 nodes takes just 0.75 seconds with 128 threads. A minimal 3.7% of available CPU resources on the node hosting FTB-IPMI are used during a single iteration.

157

9.1

Impact on the HPC Community

The techniques proposed as part of this work have significantly influenced related research efforts in the HPC community. The data-staging technique proposed in this framework was the first such solution that leveraged the RDMA capability of the InfiniBand interconnect to alleviate the checkpointing overheads on applications. Based on this study, similar research efforts [117] have followed suite and studied the benefits of InfiniBand RDMA-based checkpoint data-staging. The CRUISE file system has been evaluated at a scale of 3 million MPI processes on the Sequoia system at the Lawrence Livermore National Laboratory, which at the time of evaluation was the world’s fastest supercomputing cluster. Work is currently in progress to integrate CRUISE with the ScalableCR software package, which is used in production on the Sequoia system, in addition to several other Top500 systems, to reduce the checkpointing overhead on HPC applications running on these respective systems. Likewise, the CIFTS Fault Tolerance Backplane [1] described in this thesis provides a common infrastructure for operating systems, system middleware, libraries, and applications, to exchange information related to hardware and software failures in real time. Several widely-used HPC software components, including the Berkeley Lab Checkpoint/Restart library for Linux, the FT-LA dense linear algebra library that features algorithm-based fault-tolerant routines, and the SLURM open-source resource manager and job scheduler, have all been enhanced to exchange failure information using FTB. Consequently, all these components are directly benefiting from the low-overhead failure-prediction system developed as a part of this thesis.

158

The Stampede supercomputing system at the Texas Advanced Computing Center is currently the 7th fastest supercomputer in the world. It is a heterogeneous system provisioned with Intel Xeon Phi coprocessor on each compute node. The cluster is used to run a wide variety of scientific applications including medicine, biology, energy, chemistry and geo-sciences domains. The benefits of the MIC-Check framework developed as part of this thesis have been demonstrated with real-world applications running on this system.

9.2

Open-source contributions to the community

MVAPICH2 [17], is an open-source implementation of the MPI-3.0 specification over modern high-speed networks such as InfiniBand, 10GigE/iWARP and RDMA over Converged Ethernet (RoCE). This software is being used by more than 2,250 organizations world-wide in 74 countries and is powering some of the top supercomputing centers in the world. As of November ’14, more than 227,000 downloads have taken place from this project’s site. This software is also being distributed by many InfiniBand, 10GigE/iWARP and RoCE vendors in their software distributions. Several of the solutions proposed in this thesis have already been, or are in the process of being, incorporated into the MVAPICH2 package. The CRUISE file system sources have also been made available for public use under the BSD License, and can be obtained from github.com/hpc/cruise. Information and publications pertaining to this project are disseminated to other researchers through a web page - computation-rnd.llnl.gov/scr/file-system.php.

159

Likewise, the FTB-IPMI tool has been released for public use, and is available at the project page - nowlab.cse.ohio-state.edu/projects/ftb-ib/. Information and publications pertaining to this tool have been made available through the project webpage as well. All of these solutions are being adopted by several researchers around the world, to enhance the fault-tolerance capabilities of their supercomputing systems and applications. Support for these software packages is also provided using the public mailing lists: [email protected] and [email protected].

160

Bibliography

[1] CIFTS Initiative. www.mcs.anl.gov/research/cifts. [2] Cluster Systems Management.

http://www-03.ibm.com/systems/

software/csm/. [3] CP2K Molecular Dynamics. http://cp2k.berlios.de/. [4] CPMD Ab-Initio Molecular Dynamics. http://cpmd.org. [5] CPU frequency and voltage scaling code in the Linuxkernel. https://www. kernel.org/doc/Documentation/cpu-freq/governors.txt. [6] fakechroot. https://github.com/fakechroot/fakechroot/wiki. [7] Filesystem in Userspace. http://fuse.sourceforge.net. [8] FTB-IB: Infiniband Monitoring Software.

www.mcs.anl.gov/research/

cifts/docs/files/ftb_api_05_specification.pdf. [9] GNU FreeIPMI. www.gnu.org/software/freeipmi. [10] Intel 64 and IA-32 Architectures Software Developers Manual. http://www. intel.com. [11] Intelligent Platform Management Interface Specification v2.0. 161

[12] IOR

Benchmark

Suite.

http://sourceforge.net/projects/

ior-sio/. [13] IOzone Filesystem Benchmark. http://www.iozone.org. [14] LS-DYNA Finite Element Software. www.lstc.com/lsdyna.htm. [15] Lustre Parallel Filesystem. http://www.lustre.org. [16] Message Passing Interface Forum. http://www.mpi-forum.org. [17] MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE.

http://

mvapich.cse.ohio-state.edu. [18] Nagios Infrastructure Monitoring. www.nagios.org. [19] OpenIPMI. http://openipmi.sourceforge.net/. [20] OSU Micro-Benchmark Suite. http://mvapich.cse.ohio-state.edu/ benchmarks. [21] PVFS2 Parallel Filesystem. http://www.pvfs.org. [22] The InfiniBand Architecture. www.infinibandta.org. [23] Top 500 Supercomputers. http://www.top500.org. [24] Xeon Phi Software Developer’s Guide. http://www.intel.com/content/dam/www/public/us/en/ documents/product-briefs/xeon-phi-software-developers-guide.pdf. [25] The ASC Sequoia Draft Statement of Work.

https://asc.llnl.gov/

sequoia/rfp/02\_SequoiaSOW\_V06.doc, 2008. 162

[26] Improving

File

IO

performance

on

Intel

Xeon

Phi

Coprocessors.

http://software.intel.com/en-us/blogs/2014/01/07/ improving-file-io-performance-on-intel-xeon-phi,

January

2014. [27] Hasan Abbasi, Matthew Wolf, Greg Eisenhauer, Scott Klasky, Karsten Schwan, and Fang Zheng. DataStager: Scalable Data Staging Services for Petascale Applications. In Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing, HPDC ’09, pages 39–48, New York, NY, USA, 2009. ACM. [28] Rishi Agarwal, Pranav Garg, and Josep Torrellas. Rebound: Scalable Checkpointing for Coherent Shared Memory. SIGARCH Comput. Archit. News, 2011. [29] Adnan Agbaria and Roy Friedman. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations. Cluster Computing, 6(3):227–236, July 2003. [30] F. Alfaro, J. Sanchez, M. Menduina, and J. Duato. A Formal Model to Manage the InfiniBand Arbitration Tables Providing QoS. IEEE Trans. Comput., 2007. [31] F. J. Alfaro. A Strategy to Compute the InfiniBand Arbitration Tables. In Proceedings of the 16th International Symposium on Parallel and Distributed Processing, IPDPS ’02, 2002. [32] Lorenzo Alvisi and Keith Marzullo. Message Logging: Pessimistic, Optimistic, Causal, and Optimal. IEEE Transactions on Software Engineering, 24(2):149–159, 1998.

163

[33] J. Ansel, K. Arya, and G. Cooperman. DMTCP: Transparent checkpointing for cluster computations and the desktop. In IEEE International Symposium on Parallel Distributed Processing, 2009. [34] Jason Ansel, Kapil Arya, and Gene Cooperman. DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop. In International Parallel and Distributed Processing Symposium, Rome, Italy, May 2009. [35] Walid G. Aref, Khaled El-Bassyouni, Ibrahim Kamel, and Mohamed F. Mokbel. Scalable QoS-Aware Disk-Scheduling. In Proceedings of the 2002 International Symposium on Database Engineering and Applications, IDEAS ’02, 2002. [36] A. Arredondo, P. Roy, and E. Wofford. Implementing PWM Fan Speed Control Within a Computer Chassis Power Supply. In Applied Power Electronics Conference and Exposition, 2005. APEC 2005. Twentieth Annual IEEE, 2005. [37] Kapil Arya, Gene Cooperman, Andrea Dotti, and Peter Elmer. Use of CheckpointRestart for Complex HEP Software on Traditional Architectures and Intel MIC. arXiv preprint arXiv:1311.0272, 2013. [38] Leonardo Bautista-Gomez, Dimitri Komatitsch, Naoya Maruyama, Seiji Tsuboi, Franck Cappello, and Satoshi Matsuoka. FTI: High Performance Fault Tolerance Interface for Hybrid Systems. In SC, 2011. [39] J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate. PLFS: A Checkpoint Filesystem for Parallel Applications. In SC, 2009.

164

[40] George Bosilca, Aurelien Bouteiller, Thomas H´erault, Pierre Lemarinier, and Jack J. Dongarra. Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols. In EuroMPI, pages 189–197, 2010. [41] Aurelien Bouteiller, George Bosilca, and Jack Dongarra. Redesigning the Message Logging Model for High Performance. Concurr. Comput. : Pract. Exper., 22:2196– 2211, November 2010. [42] Aur´elien Bouteiller, Franck Cappello, Thomas H´erault, G´eraud Krawezik, Pierre Lemarinier, and Fr´ed´eric Magniette. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. Supercomputing Conference, 0:25, 2003. [43] Aur´elien Bouteiller, Pierre Lemarinier, G´eraud Krawezik, and Franck Cappello. Coordinated Checkpoint versus Message Log for Fault Tolerant MPI. IEEE International Conference on Cluster Computing, 0:242, 2003. [44] J.M. Brandt, B.J. Debusschere, A.C. Gentile, J.R. Mayo, P.P. Pebay, D. Thompson, and M.H. Wong. Ovis-2: A Robust Distributed Architecture for Scalable RAS. In IEEE International Symposium on Parallel and Distributed Processing, 2008. [45] J. Bruno, J. Brustoloni, E. Gabber, B. Ozden, and A. Silberschatz. Disk Scheduling with Quality of Service Guarantees. In Multimedia Computing and Systems, 1999. IEEE International Conference on, 1999. [46] D. Buntinas, C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, and F. Cappello. Blocking vs. Non-blocking Coordinated Checkpointing

165

for Large-scale Fault Tolerant MPI Protocols. Future Generation Computer Systems, 2008. [47] Philip Carns, Kevin Harms, William Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert Ross. Understanding and Improving Computational Science Storage Access through Continuous Characterization. 2011. [48] K. Mani Chandy and Leslie Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, 3(1):63–75, 1985. [49] Camille Coti, Thomas H´erault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI. Supercomputing Conference, 0:18, 2006. [50] Dague and Sean. OpenHPI: An Open Source Reference Implementation of the SA Forum Hardware Platform Interface. In Service Availability, 2005. [51] Howard David, Eugene Gorbatov, Ulf R Hanebutte, Rahul Khanna, and Christian Le. Rapl: memory power estimation and capping. In Low-Power Electronics and Design (ISLPED), 2010 ACM/IEEE International Symposium on, pages 189–194. IEEE, 2010. [52] Ben Eckart, Xubin He, Chentao Wu, Ferrol Aderholdt, Fang Han, and Stephen Scott. Distributed Virtual Diskless Checkpointing: A Highly Fault Tolerant Scheme for Virtualized Clusters. IEEE International Parallel and Distributed Processing Symposium Workshops, 2012. 166

[53] M. El Mehdi Diouri, O. Gluck, L. Lefevre, and F. Cappello. Energy Considerations in Checkpointing and Fault Tolerance Protocols. In IEEE/IFIP Int’l Conference on Dependable Systems and Networks Workshops, 2012. [54] M. El Mehdi Diouri, O. Gluck, L. Lefevre, and F. Cappello. ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications. In Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on, 2013. [55] E. N. Elnozahy and J. S. Plank. Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery. IEEE Transactions on Dependable and Secure Computing, 2004. [56] Elmootazbellah N. Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys, 34(3):375–408, 2002. [57] C. Engelmann, G.R. Vallee, T. Naughton, and S.L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Parallel, Distributed and Network-based Processing, 2009. [58] Collaboration Enzo. Enzo: An Adaptive Mesh Refinement Code for Astrophysics. [59] Celso L. Mendes Esteban Meneses and Laxmikant V. Kale. Team-based Message Logging: Preliminary Results. In 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010)., May 2010. [60] Michael A. Heroux et. al. Improving Performance via Mini-applications. Technical report, Sandia National Laboratories, 2009. 167

[61] K. Ferreira, R. Riesen, R. Oldfield, J. Stearley, J. Laros, K. Pedretti, T. Kordenbrock, and R. Brightwell. Increasing Fault Resiliency in a Message-Passing Environment. Sandia National Laboratories, Tech. Rep. SAND2009-6753, 2009. [62] Vincent W Freeh, David K Lowenthal, Feng Pan, Nandini Kappiah, Robert Springer, Barry L Rountree, and Mark E Femal. Analyzing the energy-time trade-off in highperformance computing applications. Parallel and Distributed Systems, IEEE Transactions on, 2007. [63] Errin W. Fulp, Glenn A. Fink, and Jereme N. Haack. Predicting Computer System Failures Using Support Vector Machines. In Proceedings of the First USENIX conference on Analysis of system logs, 2008. [64] Q. Gao, W. Yu, W. Huang, and D. K. Panda. Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand. In ICPP, 2006. [65] Qi Gao, Weikuan Yu, Wei Huang, and Dhabaleswar K Panda.

Application-

transparent Checkpoint/restart for MPI Programs Over InfiniBand. In Parallel Processing, 2006. ICPP 2006. International Conference on, pages 471–478. IEEE, 2006. [66] R. Gupta, P. Beckman, B.-H. Park, E. Lusk, P. Hargrove, A. Geist, D. K. Panda, A. Lumsdaine, and J. Dongarra. Supermon: High-performance monitoring for Linux clusters. Proceedings of the 5th Annual Linux Showcase and Conference, 2001. [67] R. Gupta, P. Beckman, B.-H. Park, E. Lusk, P. Hargrove, A. Geist, D. K. Panda, A. Lumsdaine, and J. Dongarra. CIFTS: A Coordinated Infrastructure for FaultTolerant Systems. ICPP, 2009. 168

[68] P. H. Hargrove and J. C. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters. In SciDAC, 2006. [69] Wei Huang, Jiuxing Liu, Matthew Koop, Bulent Abali, and Dhabaleswar Panda. Nomad: migrating OS-bypass networks in virtual machines. In Proceedings of the 3rd international conference on Virtual execution environments, 2007. [70] J. Hursey and A. Lumsdaine. A Composable Runtime Recovery Policy Framework Supporting Resilient HPC Applications. Technical report, University of Tennessee, 2010. [71] J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI. In IPDPS, 2007. [72] Joshua Hursey, Jeffrey M. Squyres, Timothy I. Mattox, and Andrew Lumsdaine. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, 03 2007. [73] Dewan Ibtesham, David DeBonis, Dorian Arnold, and Kurt B. Ferreira. CoarseGrained Energy Modeling of Rollback/Recovery Mechanisms. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on, June 2014. [74] James H. Laros III, Phil Pokorny, and David DeBonis. Powerinsight - A Commodity Power Measurement Capability. The Third International Workshop on Power Measurement and Profiling in conjunction with IEEE IGCC, 2013. 169

[75] F. Isaila, J. Garcia Blas, J. Carretero, R. Latham, and R. Ross. Design and Evaluation of Multiple-Level Data Staging for Blue Gene Systems. TPDS, 2011. [76] Kamil Iskra, John W. Romein, Kazutomo Yoshii, and Pete Beckman. ZOID: I/OForwarding Infrastructure for Petascale Architectures. In PPoPP, 2008. [77] Sitaram Iyer and Peter Druschel. Anticipatory Scheduling: a Disk Scheduling Framework to Overcome Deceptive Idleness in Synchronous I/O. SIGOPS Oper. Syst. Rev., 2001. [78] David B. Johnson and Willy Zwaenepoel. Sender-Based Message Logging. In In Digest of Papers: 17 Annual International Symposium on Fault-Tolerant Computing, pages 14–19. IEEE Computer Society, 1987. [79] Nandini Kappiah, Vincent W Freeh, and David K Lowenthal. Just in time dynamic voltage scaling: Exploiting inter-node slack to save energy in mpi programs. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. [80] Dries Kimpe, Kathryn Mohror, Adam Moody, Brian Van Essen, Maya Gokhale, Kamil Iskra, Rob Ross, and Bronis R. de Supinski. Integrated In-System Storage Architecture for High Performance Computing. In Workshop on Runtime and Operating Systems for Supercomputers, 2012. [81] Richard Koo and Sam Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transactions on Software Engineering, 13(1):23–31, Jan. 1987. [82] Chokchai Leangsuksun, Tong Liu1, Tirumala Rao, Stephen L. Scott, and Richard Libby. A Failure Predictive and Policy-Based High Availability Strategy for Linux

170

High Performance Computing Cluster. In 5th Linux Cluster Institute Conference, 2004. [83] Pierre Lemarinier, Aur´elien Bouteiller, Thomas H´erault, G´eraud Krawezik, and Franck Cappello. Improved Message Logging versus Improved Coordinated Checkpointing for Fault Tolerant MPI. In CLUSTER ’04: Proceedings of the 2004 IEEE International Conference on Cluster Computing, pages 115–124, Washington, DC, USA, 2004. IEEE Computer Society. [84] R. Libby. Effective HPC Hardware Management and Failure Prediction Strategy Using IPMI. Proceedings of the Linux Symposium, 2003. [85] Min Yeol Lim, Vincent W Freeh, and David K Lowenthal. Adaptive, transparent frequency and voltage scaling of communication phases in mpi programs. In SC 2006 Conference, Proceedings of the ACM/IEEE, pages 14–14. IEEE, 2006. [86] Soulla Louca, Neophytos Neophytou, Adrianos Lachanas, and Paraskevas Evripidou. MPI-FT: Portable Fault Tolerance Scheme for MPI. Parallel Processing Letters, 10(4):371–382, December 2000. [87] Ra´ul Mart´ınez, Francisco J. Alfaro, and Jos´e L. S´anchez. A Framework to Provide Quality of Service over Advanced Switching. IEEE Trans. Parallel Distrib. Syst., 2008. [88] Matthew L. Massie, Brent N. Chun, and David E. Culler. The Ganglia Distributed Monitoring System: Design, Implementation And Experience. Parallel Computing, 2003.

171

[89] M. McKusick, M. Karels, and K. Bostic. A Pageable Memory-Based Filesystem. In Proceedings of the United Kingdom UNIX Users Group Meeting, 1990. [90] Esteban Meneses, Osman Sarood, and Laxmikant V. Kale.

Energy Profile of

Rollback-Recovery Strategies in High Performance Computing. ParCo, 2014. [91] Sarah E. Michalak, Kevin W. Harris, Nicolas W. Hengartner, Bruce E. Takala, and Stephen A. Wender. Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory’s ASC Q Supercomputer. IEEE Transactions on Device and Materials Reliability, 2005. [92] Bryan Mills, Ryan E. Grant, Kurt B. Ferreira, and Rolf Riesen. Evaluating Energy Savings for Checkpoint/Restart. In Proceedings of the 1st International Workshop on Energy Efficient Supercomputing, E2SC ’13, 2013. [93] A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In SC. [94] NUDT.

Tianhe-2 Supercomputer.

http://www.top500.org/system/

177999. [95] Akira Nukada, Hiroyuki Takizawa, and Satoshi Matsuoka. NVCR: A Transparent Checkpoint-restart Library for NVIDIA CUDA. In Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, pages 104–113. IEEE, 2011. [96] BrianW. O’ Shea, Greg Bryan, James Bordner, MichaelL. Norman, Tom Abel, Robert Harkness, and Alexei Kritsuk. Introducing enzo, an amr cosmology application. In Adaptive Mesh Refinement - Theory and Applications. 2005. 172

[97] X. Ouyang, R. Rajachandrasekhar, X. Besseron, H. Wang, J. Huang, and D. K. Panda. CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart. In ICPP, 2011. [98] Xiangyong Ouyang, Sonya Marcarelli, Raghunath Rajachandrasekar, and Dhabaleswar K Panda. RDMA-based Job Migration Framework for MPI over Infiniband. In Cluster Computing (CLUSTER), 2010 IEEE International Conference on, pages 116–125. IEEE, 2010. [99] Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, and Dhabaleswar K Panda. CRFS: A Lightweight User-level Filesystem for Generic Checkpoint/Restart. In Parallel Processing (ICPP), 2011 International Conference on, pages 375–384. IEEE, 2011. [100] Hewlett Packard. MemFSv2 - A Memory-based File System on HP-UX 11i v2 . In Technical Whitepaper, 1990. [101] Jun Peng, Jinchi Lu, Kincho H. Law, and Ahmed Elgamal. ParCYCLIC: Finite Element Modeling of Earthquake Liquefaction Response on Parallel Computers. In International Journal for Numerical and Analytical Methods in Geomechanics, 2004. [102] Fabrizio Petrini. Scaling to Thousands of Processors with Buffer Coscheduling. In Scaling to New Height Workshop, Pittsburgh, PA, 2002. [103] James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel, Laxmikant Kal, and Klaus Schulten. Scalable molecular dynamics with NAMD. Journal of Computational Chemistry, 2005. 173

[104] Ian R. Philp. Software Failures and the Road to a Petaflop Machine. Workshop on High Performance Computing Reliability Issues (HPCRI), 2005. [105] J. S. Plank, Y. Chen, K. Li, M. Beck, and G. Kingsley. Memory Exclusion: Optimizing the Performance of Checkpointing Systems. In Software: Practice and Experience, 1999. [106] James S. Plank. Efficient Checkpointing on MIMD Architectures. Phd thesis, Princeton University, Princeton, NJ, USA, 1993. [107] Sreeram Potluri, Devendar Bureddy, Khaled Hamidouche, Akshay Venkatesh, Krishna Kandalla, Hari Subramoni, and Dhabaleswar K Dk Panda.

MVAPICH-

PRISM: A Proxy-based Communication Framework using InfiniBand and SCIF for Intel MIC clusters. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, page 54. ACM, 2013. [108] Sreeram Potluri, Akshay Venkatesh, Devendar Bureddy, Krishna Kandalla, and Dhabaleswar K Panda. Efficient Intra-node Communication on Intel-MIC Clusters. In Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on, pages 128–135. IEEE, 2013. [109] Raghunath Rajachandrasekar, Adam Moody, Kathryn Mohror, and Dhabaleswar K Panda. A 1 PB/s File System to Checkpoint Three Million MPI Tasks. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, 2013. [110] Raghunath Rajachandrasekar, Xiangyong Ouyang, Xavier Besseron, Vilobh Meshram, and Dhabaleswar K Panda. Can checkpoint/restart Mechanisms Benefit 174

from Hierarchical Data Staging? In Euro-Par 2011: Parallel Processing Workshops, pages 312–321. Springer, 2012. [111] Thomas Ropars and Christine Morin. Improving Message Logging Protocols Scalability through Distributed Event Logging. In Pasqua DAmbra, Mario Guarracino, and Domenico Talia, editors, Euro-Par 2010 - Parallel Processing, volume 6271 of Lecture Notes in Computer Science, pages 511–522. Springer Berlin / Heidelberg, 2010. [112] Rob Ross, Jose Moreira, Kim Cupps, and Wayne Pfeiffer. Parallel I/O on the IBM Blue Gene/L System. Technical report, Blue Gene/L Consortium Quarterly Newsletter. [113] Barry Rountree, David K Lownenthal, Bronis R de Supinski, Martin Schulz, Vincent W Freeh, and Tyler Bletsch. Adagio: making dvs practical for complex hpc applications. In Proceedings of the 23rd international conference on Supercomputing, 2009. [114] Takafumi Saito, Kento Sato, Hitoshi Sato, and Satoshi Matsuoka. Energy-aware I/O Optimization for Checkpoint and Restart on a NAND Flash Memory System. In Proceedings of the 3rd Workshop on Fault-tolerance for HPC at Extreme Scale, FTXS ’13, 2013. [115] Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine, Jason Duell, Paul Hargrove, and Eric Roman. The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. International Journal of High Performance Computing Applications, 19(4):479–493, Winter 2005.

175

[116] Vivek Sarkar. ExaScale Software Study: Software Challenges in Exascale Systems. 2009. [117] Kento Sato, Naoya Maruyama, Kathryn Mohror, Adam Moody, Todd Gamblin, Bronis R. de Supinski, and Satoshi Matsuoka. Design and Modeling of a Non-blocking Checkpointing System. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pages 19:1–19:10, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. [118] Kento Sato, Naoya Maruyama, Kathryn Mohror, Adam Moody, Todd Gamblin, Bronis R de Supinski, and Satoshi Matsuoka. Design and Modeling of a Non-blocking Checkpointing System. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 19. IEEE Computer Society Press, 2012. [119] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinksi, Naoya Maruyama, and Satoshi Matsuoka. Design and Modeling of a Non-blocking Checkpointing System. In SC, 2012. [120] Bianca Schroeder and Garth Gibson. Understanding Failure in Petascale Computers. Journal of Physics Conference Series: SciDAC, June 2007. [121] Bianca Schroeder and Garth A. Gibson. A Large-Scale Study of Failures in HighPerformance Computing Systems. In DSN, June 2006. [122] Karl W Schulz, Rhys Ulerich, Nicholas Malaya, Paul T Bauman, Roy Stogner, and Chris Simmons. Early Experiences Porting Scientific Applications to the Many Integrated Core (MIC) Platform. In TACC-Intel Highly Parallel Computing Symposium, 176

Tech. Rep, 2012. [123] Jan Seidel, Rudolf Berrendorf, Marcel Birkner, and Marc-Andre Hermanns. HighBandwidth Remote Parallel I/O with the Distributed Memory Filesystem MEMFS. In EuroPVM/MPI. 2006. [124] M. Shatnawi and M. Ripeanu. Failure Avoidance through Fault Prediction Based on Synthetic Transactions. In Cluster, Cloud and Grid Computing (CCGrid), 2011. [125] Mukesh Singhal and Niranjan G. Shivaratri. Advanced Concepts in Operating Systems. McGraw-Hill, Inc., 1994. [126] Jaidev K Sridhar, Matthew J Koop, Jonathan L Perkins, and Dhabaleswar K Panda. ScELA: Scalable and Extensible Launching Architecture for Clusters. In High Performance Computing-HiPC 2008, pages 323–335. Springer, 2008. [127] J. Steele. ACPI Thermal Sensing and Control in the PC. In Wescon/98, 1998. [128] Georg Stellner. CoCheck: Checkpointing and Process Migration for MPI. International Parallel Processing Symposium, 0:526, 1996. [129] John Stone and Mark Underwood. Rendering of Numerical Flow Simulations Using MPI. In Proceedings of the Second MPI Developers Conference, 1996. [130] Robert E. Strom and Shaula Yemini. Optimistic Recovery in Distributed Systems. ACM Transactions on Computer Systems, 3(3):204–226, 1985. [131] Hari Subramoni, Ping Lai, Sayantan Sur, and Dhabaleswar K. (DK) Panda. Improving Application Performance and Predictability Using Multiple Virtual Lanes

177

in Modern Multi-core InfiniBand Clusters. In Proceedings of the 2010 39th International Conference on Parallel Processing, ICPP ’10, 2010. [132] Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi. CheCUDA: A Checkpoint/restart Tool for CUDA Applications. In Parallel and Distributed Computing, Applications and Technologies, 2009 International Conference on, pages 408–413. IEEE, 2009. [133] Yuval Tamir and Carlo H. S´equin. Error recovery in multicomputers using global checkpoints. In Proc. of 1984 International Conference on Parallel Processing, pages 32–41, August 1984. [134] J. Thompson, D.W. Dreisigmeyer, T. Jones, M. Kirby, and J. Ladd. Accurate fault prediction of BlueGene/P RAS logs via geometric reduction. In Dependable Systems and Networks Workshops (DSN-W), 2010. [135] Gang Wang, Xiaoguang Liu, Ang Li, and Fan Zhang. In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes. In EuroPVM/MPI, 2009. [136] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. Ceph: A Scalable, High-Performance Distributed File System. In Proceedings of the 7th symposium on Operating systems design and implementation, OSDI ’06, 2006. [137] Joel C. Wu and Scott A. Brandt. Providing Quality of Service Support in ObjectBased File System. In Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, MSST ’07, 2007.

178

[138] Yiqi Xu, Lixi Wang, D. Arteaga, Ming Zhao, Yonggang Liu, and R. Figueiredo. Virtualization-based bandwidth management for parallel storage systems. In Petascale Data Storage Workshop (PDSW), 2010 5th, 2010. [139] Andy B Yoo, Morris A Jette, and Mark Grondona. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing, pages 44–60. Springer, 2003. [140] Xuechen Zhang, Kei Davis, and Song Jiang. QoS Support for End Users of I/OIntensive Applications Using Shared Storage Systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, 2011. [141] Gengbin Zheng, Lixia Shi, and Laxmikant V. Kal´e. FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI. In IEEE Cluster, 2004. [142] Ziming Zheng, Zhiling Lan, B.H. Park, and A. Geist. System Log Pre-Processing to Improve Failure Prediction. In Dependable Systems Networks, 2009. DSN ’09, 2009.

179