PermaNT: Persistent Shared Memory for Windows NT/2000 Clusters Evan Speight School of Electrical and Computer Engineering Computer Systems Lab, Cornell University, Ithaca, NY 14853 [email protected]

Abstract This paper examines a new architecture for cluster-based shared memory parallel computing on networks of industry-standard workstations. Traditionally, such software distributed shared memory systems have been implemented as a user-level library that is linked statically or dynamically with a shared memory application. The library provides the abstraction of shared memory, relieving the application programmer from the burden of explicitly moving data between nodes in the systems. The PermaNT system represents a departure from these systems in that a single instance of the shared memory runtime system hosts multiple applications simultaneously, resulting in lower application initialization overhead, more balanced resource usage, reduced system management, and improved support for fault tolerance. This paper examines the proposed architecture of the PermaNT system. Keywords: binary redirection, clusters, software distributed shared memory, multiprogramming

1. Introduction In the past decade, the conventional wisdom concerning the appropriate platform for parallel computing has shifted from custom hardware to commodity-based solutions consisting of clusters of high-performance servers or workstations. Because these clusters are comprised of separate machines, each with its own operating system, memory hierarchy, network access, address space, etc., providing a cohesive parallel computing environment remains a challenge. Issues such as fault

tolerance, load balancing, resource utilization, and prioritized usage are made difficult because each parallel application run on the cluster exists within its own series of address spaces, one per machine, eliminating the effective implementation of centralized control for the entire cluster. We propose a new approach to providing software runtime support for distributed shared memory systems. Our approach allows a single instance of the runtime system on each node to support multiple applications concurrently within the same address space. Such a design addresses several issues: • High initialization overhead - Clusterbased parallel jobs require separate processes to be started on each machine, accompanied by runtime system initialization in addition to any necessary application initialization. Setup time associated with this initialization can be substantial, especially for small-tomedium size parallel jobs. • Runtime system overhead - Separate parallel jobs running concurrently on a cluster must duplicate the entire runtime system environment, even if the application does not make use of all available resources allocated by the runtime system. This situation can waste valuable cluster resources that would be better utilized by concurrent applications. • A reduction in system management capabilities - Parallel applications running on a cluster of machines are each controlled by individual processes, and no

mechanism exists for a centralized approach to the control of cluster resources, multiprogramming support, nor facilities for providing fault tolerance. • Uneven resource usage - When multiple applications execute on the nodes in a cluster, inter-application communication does not normally take place. This, coupled with the lack of a centralized management entity, can lead to severe cluster resource deprivation for some applications, as some applications are severely network-constrained and may leave idle processors that could otherwise be used to perform useful computation. In this paper we describe the design and implementation of the PermaNT Parallel Environment, a system that allows multiple applications to be executed within the confines of a single instance of the runtime environment on clusters comprised of Windows NT and Windows 2000 SMP servers. The PermaNT system is a persistent layer that is always resident on a cluster, providing shared memory support for parallel applications. PermaNT is implemented through the close cooperation of one Windows NT service per node in the cluster. Each PermaNT service is normally initiated at boot-time. Instead of being executables linked to a runtime system library, applications are written as dynamic link libraries (DLLs) that are injected into the runtime process that always exists on the cluster. This provides several benefits, including reduced application footprints, reduced initialization overheads, built-in support for multiprogramming, and a mechanism for centralized control of currently executing parallel applications. Additionally, the changes required to a program written for use in a traditional DSM setting and one written for PermaNT are minimal. Security between threads in different applications is maintained by providing multiple “views” of the shared address space, one per application. The rest of the paper is organized as follows. Section 2 describes the architecture of PermaNT, and how it supports multiple user applications within a single address space. Section 3 discusses user-application issues arising from the design of PermaNT, and how

these issues are addressed in our design. Section 4 details the benefits of the PermaNT system over existing solutions, and Section 5 concludes the paper and outlines avenues for future work.

2. Architecture of PermaNT The architecture of the PermaNT runtime system differs substantially from that of other software shared memory environments [3, 6, 8, 9, 11]. Traditional software DSM systems consist of shared memory application code linked with a runtime library that provides a distributed shared memory abstraction, allowing the application to then be executed on a distributed memory machine such as a cluster of workstations. The runtime system is therefore an integral part of each shared memory application, as the parallel runtime library is either statically linked at compile time with the application, or is provided as a dynamically-linked library at runtime. Either way, the functionality contained within the runtime system is provided to a single application at a time. In PermaNT, the roles have been reversed. Instead of the runtime library providing a set of functions used to implemented shared memory on a cluster, the PermaNT system is a standalone application that permanently resides on the cluster, even when no shared memory application is present. Applications are dynamic-linked libraries (DLLs) that are injected into the PermaNT process address space, and threads within the PermaNT address space execute the requested application code. We make use of the Detours package [5] to inject the application DLL into the PermaNT address space. This is problematic in that DLL's are typically self-contained pieces of code: because DLLs are usually used to provide a set of functionality to other applications, all external references must be resolved at link time. Functionality provided by PermaNT, such as shared memory management, synchronization routines, etc. that have to be accessed by the application DLL once injected into the PermaNT runtime layer must resolve to valid functions before the DLL can compile correctly.

Figure 1. Binary Redirection in the PermaNT System

2.1.

Redirection via Binary Rewriting

We link application DLLs with a small library (the stublib.dll library in Figure 1) of functions with null bodies that mirror the functions provided by the PermaNT system (such as calls for barriers, lock acquire and release, etc.). Calls by the application DLL to the routines in the stublib library are re-directed to the proper function in the PermaNT process by Detours once the DLL has been injected into the PermaNT process address space. In Figure 1, the calls made from sor.dll and lu.dll to the barrier() function in stublib.dll are redirected by Detours to the correct function (Pbarrier()) in PermaNT. The code in the stublib.dll library never gets executed, but only serves as a place holder to allow sor.dll and lu.dll to be successfully compiled. The PermaNT process is responsible for spawning threads that will execute code contained in the application DLL. Each new thread calls the predetermined function UserMain(), which is also contained in the stublib.dll library. Again, because the actual code for the application-specific UserMain() resides in the application DLL and is not available when PermaNT is compiled,

stublib.dll also contains a place holder for UserMain(). PermaNT uses the functionality of Detours to redirect calls made to UserMain to the application-specific starting place, allowing calls to UserMain() by PermaNT threads to be directed to different application-provided code. As shown in Figure 1, the application LU can cause PermaNT threads calling UserMain to actually execute AppLuMain, and the application SOR can cause PermaNT threads calling UserMain to execute AppSorMain, providing support for starting multiple applications within the same PermaNT process address space. Synchronization is provided to allow only a single application to redirect calls to UserMain() at a time. 2.2.

PermaNT Initialization

The PermaNT initialization procedure is similar to that of a standard software DSM system. The PermaNT service is initially brought up on a single node in the cluster, designated as the root node. The root process contacts a remote execution service running on each other node in the cluster designated as a participant in the shared memory layer. Upon successful

initiation of a PermaNT service on each machine, network communication connections are made between each node in the cluster. These connections serve as the messaging conduits for runtime coherence and synchronization messages between user applications. A separate network connection is also created to allow the PermaNT services to exchange out-of-band data for control purposes. After the establishment of connections between PermaNT processes, each process will initialize a region of virtual memory to be used as the shared memory region accessible by applications. Runtime data structures necessary to keep track of shared pages and synchronization objects are also initialized. Because PermaNT exists on the each node of the cluster even when no shared memory applications are resident, the region of shared memory is deallocated after the system has been idle for a specified period of time, thereby releasing the memory consumed by the PermaNT process. Finally, the PermaNT system creates threads that will eventually be used to execute application code once the user applications have been injected into the PermaNT layer.

3. Application Issues This section details several aspects of executing shared memory applications within the confines of the PermaNT runtime environment.

3.1. Application Initiation After a shared memory application DLL has been injected into the PermaNT process address space via the PermaNTInject program, the only required initialization is that necessitated by the application itself. In contrast to other DSM systems, all initialization of DSM runtime buffers, shared pages structures, and synchronization variables has already been carried out when the PermaNT runtime layer is installed on the cluster. One potential problem arises from the possible lack of addressable space when many application DLLs are being hosted simultaneously by the PermaNT system. However, as processors move toward true 64-bit address spaces, the amount of addressable

memory will be sufficient for as many simultaneous shared memory applications as can be reasonably expected to be utilizing a cluster's resources without thrashing and severely hindering performance. Shared Region SOR SOR threads

LU threads LU

Figure 2. Multiple Views of Shared Memory

3.2. Application Security Because all applications running concurrently in the PermaNT layer execute within the confines of the same process (the PermaNT service), it is possible for errant pointers to change data that is not part of their application space. This may be unintentional or malicious. In order to protect one application's data from another application's threads, we present a full view of the shared memory region to each application by mapping multiple views of the shared memory region. In Windows NT, this is easily accomplished via calls to the Win32 API MapViewOfFile(). Each view can have its own set of virtual page protection attributes that are maintained separately. As shown in Figure 2, regions of the shared space are assigned to each application, with pages residing outside of an application's shared memory region protected to incur a segmentation fault if accessed (indicated by the shaded region in Figure 2). These segmentation faults are trapped by the PermaNT runtime system. Once inside the access violation handler, the runtime system can determine if the fault resulted from the mechanism used to maintain coherence (i.e., the access is a “valid access” to a thread's own application), or if the fault occurred because one application's thread accessed a region of shared memory outside its application space. This provides protection for shared memory between multiple applications, but does not

address problems resulting from errant pointers or overrun arrays that may occur in static or heap variables. We are currently investigating ways to address this issue.

3.3. Application Control PermaNT provides a single process space in which all shared memory applications on the cluster are run. In this scenario, a thread in a single application may cause the entire process to crash, bringing down all applications currently executing. To address this issue, we trap all exceptions raised by application threads. Exceptions, such as a divide-by-zero exception, must be handled by the PermaNT layer instead of passed up to the OS for handling, as is typically done. PermaNT will determine the application causing the fault, kill all threads associated with the application, recla im shared resources used by the application, and contact remote PermaNT processes to terminate the application. The execution of other application may proceed unhindered. This brings up another issue, namely that of application development. The PermaNT layer is designed and intended to be used by mature applications that have already been debugged by the application programmer. In other words, PermaNT should not be used as a development environment, for the reason that when a debugger is attached to a user application, the entire process is halted and eventually terminated when the debugger is detached. Clearly, this is not an option in an environment where other applications may be running concurrently in the same process. PermaNT can be brought up in ``single application mode'' on the cluster to facilitate application development.

4. PermaNT System in Practice This section briefly outlines the chief benefits provided by providing a common runtime substrate for all DSM applications as opposed to utilizing separate runtime systems for each application. We also present some preliminary results on the possible performance of utilizing multiprogramming in the context of the PermaNT system by examining the required network and CPU utilization of PermaNT

applications executing in isolation on our cluster. PermaNT has been implemented on a network of Dell PowerEdge 1500 machines running the Windows 2000 operating system. Each machine contains dual 866 MHz Pentium III processors, 1 GByte of main memory, and the interconnection network used is the cLAN architecture manufactured by Emulex. This network provides unidirectional latencies of 5 µsec through a hardware implementation of the VI Architecture, well below those of traditional Ethernet, Fast Ethernet, or Gigabit Ethernet. We examine the network and CPU utilization requirements of 8 shared memory parallel applications running in isolation on the PermaNT system. The applications used include five from SPLASH-2 [10]: Barnes Hut-spatial, a modified version of the original benchmark; LU; Raytrace; Water-nsq; and Water-spatial. FFT3D comes from the NAS parallel benchmark suite [2] , and SOR is a locally written implementation of successive over-relaxation. Finally, Ilink [4] is a genetic linkage package used to trace genes through family genealogies.

4.1. Resource Usage Cluster resource usage varies widely with application. In many instances, applications may be severely network-bound, leaving processors idle that may be used to complete work for other, less network-intensive applications. Figure 3 shows that the network requirements varies greatly across the 8 applications studied, with only 3DFFT requiring a consistently high network utilization. Barnes shows cyclic utilization, while the network requirements of the other 6 applications are relatively modest. Figure 4 shows that the CPU utilization of the 8 applications studied also vary widely, with only 2 applications utilizing the CPU resources at consistently high levels throughout the time slice examined. One solution to address the cluster resource problem is to simply run multiple traditional software DSM layers simultaneously. However, this introduces twice the overhead of a single DSM system (twice the network connections, twice the shared memory page structure

multiprogrammed software DSM layer. Possible performance improvements may arise from improved load balancing, message sharing between application threads, and a more even resource distribution.

30

3D-FFT Raytrace

Barnes SOR

Ilink Water-nsq

LU Water-spatial

20

100 15

90 80

10

5

0

0

5

10

15

20

% Processor Utilization

Link Bandwidth Used (MB)

25

70 60 50 40 30

Time

Figure 3. Network Requirements of PermaNT Applications

20 10 0

information, twice the synchronization variable overhead, etc.). Especially in a DSM system that makes use of a user-level network such as the virtual interface architecture, in which message buffers must be pinned into physical memory, the overhead in terms of real memory consumed can be high [7]. Additionally, most user-level networks limit the number of concurrent open network connections to the number of hardware queue-pairs available on the NIC (1024 in the case of cLAN). Allowing multiple applications to share network endpoints addresses this issue.

4.2. Multiprogramming PermaNT provides the first multiprogrammed software DSM runtime environment. Application threads from multiple applications run within the same context of the PermaNT runtime system, eliminating the need for multiple instances of redundant runtime support structures. The PermaNT system contains performance monitoring that seeks to prevent starvation of any single application by fairly allocating resources when contention arises. Alternately, the PermaNT command application allows users to raise or lower the priorities of each application individually according to the needs of the cluster users. We are currently investigating the performance improvements gained in utilizing a

0

5

10

15

20

Time 3D-FFT Raytrace

Barnes SOR

Ilink Water-nsq

LU Water-spatial

Figure 4. Processor Utilization of PermaNT Applications

4.3. Fault Tolerance Our previous work on providing fault tolerance in software DSM systems [1] has shown that it is possible to automatically recover from single node failures, add and remove nodes from a running computation on-the-fly, and bring down nodes for maintenance and upgrades without stopping running shared memory cluster applications. PermaNT, by providing a common runtime system for all shared memory applications, allows this fault tolerance work to be extended to cover multiple applications instead of being carried out individua lly on a per-application basis, greatly reducing the amount of time that must be spent on fault tolerance techniques.

5. Future Work and Conclusions We have presented the design and implementation of the PermaNT system, a runtime system to facilitate the execution of multiple software distributed shared memory application simultaneously in a cluster setting. PermaNT has many advantages over traditional

one-library-per-application shared memory systems, including built-in support for multiprogramming, lower application initialization overhead, more balanced resource usage, reduced system management, and improved support for fault tolerance. Ongoing research in the PermaNT system includes providing cluster management support functionality, developing the correct metrics to dynamically alter application resource usage, and extending our previous work in fault tolerance to provide for the recovery of multiple applications simultaneously.

References [1] H. Abdel-Shafi, E. Speight, and J. K. Bennett. Efficient User-Level Thread Migration and Checkpointing on Windows NT Clusters. In Proceedings of the Third Usenix Windows NT Symposium, pp. pages 1-10, July 1999. [2] D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS Parallel Benchmarks. NASA Ames RNR-91-002, August 1991. [3] J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence. In Proceedings of the 1990 Conference on the Principles and Practice of Parallel Programming, pp. pages 168176, March 1990. [4] S. Dwarkadas, R. W. C. Jr., P. Keleher, A. A. Schaffer, A. L. Cox, and W. Zwaenepoel. Parallelization of General Linkage Analysis Problems. Human Heridity, vol. 44, pp. 127-141, 1994. [5] G. Hunt and D. Brubacher. Detours: Binary Interception of Win32 Functions. In Proceedings of the 3rd USENIX Windows NT Symposium, July 2000. [6] I. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood. Fine-grain Access Control for Distributed Shared Memory. In Proceedings of the Sixth Internation Conference on Architectural Support for Programming Languages and Operating Systems, pp. 297-306, 1994.

[7] E. Speight, H. Abdel-Shafi, and J. K. Bennett. Multiprogramming in the BrazosMP Parallel Runtime Systsm. Cornell University CSL-TR-2002-1022, February 2002. [8] E. Speight and J. K. Bennett. Brazos: A Third Generation DSM System. In Proceedings of the First USENIX Windows NT Workshop, pp. pages 95-106, August 1997. [9] K. Thitkamol and P. Keleher. MultiThreading and Remote Latency in Software DSMs. In Proceedings of the 17th International Conference on Distributed Computer Systems, May 1997. [10] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. Methodological Considerations and Characterization of the SPLASH-2 Parallel Application Suite. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 24-36, June 1995. [11] D. Yeung, J. Kubiatowicz, and A. Agarwal. MGS: A Multigrain Shared Memory System. In Proceedings of the 23rd International Symposium on Computer Architecture, pp. 44-55, May 1996.