User Transparent Run Time Performance Optimization 1

User–Transparent Run–Time Performance Optimization Russ, Meyers, Robinson, et al. This paper has been accepted to EHPC ’97, the 2nd International Wo...
Author: James Hancock
3 downloads 0 Views 60KB Size
User–Transparent Run–Time Performance Optimization

Russ, Meyers, Robinson, et al.

This paper has been accepted to EHPC ’97, the 2nd International Workshop on Embedded HPC Systems and Applications at the 11th IEEE International Parallel Processing Symposium. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the workshop organizers. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors and by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. This work may not be reposted without the explicit permission of the copyright holder. 1

User–Transparent Run–Time Performance Optimization

Dr. Samuel H. Russ, Brad Meyers, Jonathan Robinson, Matt Gleeson, Laxman Rajagopalan, Chun– Heong Tan, and Bjørn Heckel Mississippi State University NSF Engineering Research Center for Computational Field Simulation Abstract Abstract –– High–performance embedded systems are being implemented as parallel and distributed systems with increasing frequency. There is strong motivation to make such systems adaptive and/or dynamic in order to obtain maximum performance and maximum reliability. One important component of an adaptive or dynamic parallel system is the ability to monitor the performance of running programs and make some effort to optimize execution. The goal of this research, part of a project named Hector, is to monitor the performance of parallel programs ‘‘behind the user’s back’’ to try to minimize run time on available resources.

1. Introduction Development of parallel systems continues. The recent trend in supercomputers away from monolithic sequential systems to clusters of commodity processors highlight the emergence of parallel programming and architectures into the mainstream. One commonly use parallel programming model is that of shared–memory. For example, Linda permits memory sharing on a variety of distributed platforms, and so offers a similar level of architecture–independence [4]. This programming model abstracts away the issues associated with data distribution, which leads some to conclude that it presents a simpler interface to the programmer. However, abstraction away from the underlying physical implementation may penalize performance which may impose severe penalties in arenas where high performance is of maximal value. Another model is that of message–passing. Based on Hoare’s Communicating Sequential Processes model [1], it explicitly expresses all communication between sequential tasks. Many architecture–independent standards for message–passing have been proposed, and at least two (PVM [2] and MPI [3]) have gained widespread acceptance. Using these systems has the advantage of expressing parallel programs in a way that can run on a very wide variety of architectures. 1

There are several critical technologies that must be combined in order to form a complete parallel run–time system capable of fully exploiting whatever resources are available. For example, one is in the area of maintaining awareness of resource availability and actual run–time performance. Another desirable property is total transparency to the parallel programmer. This paper describes a system named Hector that has been developed to maintain such awareness and make run– time performance optimizations transparently. Section 2 describes Hector’s structure at run–time, its library’s ‘‘middleware’’ structure, and modifications to the MPI implementation that permit task migration. Section 3 describes methods of estimating resource usage and of instrumenting the communications library to collect performance results. Section 4 describes the infrastructure needed to collect performance data from distributed systems. Section 5 describes the central optimization process. The paper concludes with a discussion of future plans.

2. Middleware Layer Design to Support User Transparency 2.1. Hector’s Run–Time Structure Hector is designed to provide the infrastructure to control parallel programs during their execution and monitor their performance. It does this by running in a distributed manner, as shown in Figure 1. The central decision–maker and control process is called a ‘‘master allocator’’ or ‘‘MA’’. Running on each candidate platform (where a ‘‘platform’’ can range from a desktop workstation to an SMP) is a supervisory task called a ‘‘slave allocator’’ or ‘‘SA’’. The SA’s gather performance information from the ‘‘tasks’’ (pieces of MPI programs) under their control and execute commands issued by the MA.

2.2. Interfaces between the Hector Library, MPI, and Command/Control System MPI programs are linked with a special MPI library in order to interface with the Hector run–time system. This library provides a complete MPI implementation as well as interfaces to a self–migration facility, to Hector’s com-

This work was funded in part by U.S. Army Grant No. EEC–8907070 Amend #021

Found in Proceedings of EHPC ’97

Miss. State Univ. NSF Engineering Research Center

User–Transparent Run–Time Performance Optimization

Russ, Meyers, Robinson, et al.

Master Allocator

Commands

Performance Information System Info

Other Slave Allocators

Slave Allocator Performance Information

Commands Local MPI Tasks

Figure 1: Hector’s Run–Time Structure mand and control structure, and to an instrumentation facility. (The MPI implementation is based on the MPICH implementation developed at Argonne National Laboratory and Mississippi State University.) Thus unmodified MPI programs can be linked with this library and obtain access to services such as task migration, checkpointing, and near–real–time performance estimation. Because these

run–time facilities are accessed via a modified MPI library, they are ‘‘invisible’’ to the programmer, a key aspect of Hector’s design. These interfaces are diagrammed in Figure 2. All programmer access into the Hector library is through calls into the MPI library. Hector’s command and control

Actual MPI Program MPI–based Source Code ( What the Programmer ‘‘Sees’’ ) Perf. Instrument. Hector’s Library To/From Other MPI Tasks

MPI Implementation

Self–Migration Facility

To/From Source/Destination

Interface to Control Infrastructure

Hector Control System

Figure 2: The Hector Library and Its Interfaces

Found in Proceedings of EHPC ’97

Miss. State Univ. NSF Engineering Research Center

User–Transparent Run–Time Performance Optimization

system (the SA’s and MA) connect into the library by signals and sockets. Its interface is therefore into a signal handler. This same signal handler is also used to access the self–migration facility. Performance instrumentation is inserted via ‘‘wrapper functions’’ between the programmer’s MPI call and the underlying MPI implementation.

2.3. Modifications for Task Migration There were two obstacles to the development of MPI– compatible, programmer–transparent task migration. First, a means of migrating a running Unix process had to be developed in such a way as to maintain the program’s state. Second, MPI had to remain intact during and after migration. Details of the development of the migration mechanism and of the ways of maintaining MPI’s integrity are discussed in [6].

3. User–Transparent Instrumentation of Parallel Programs In order to track the performance of an underlying communications library, it can be instrumented in order to extract meaningful run–time information. Each machine’s kernel can be used to gather aggregate performance information.

3.1. Reading Aggregate Performance and Memory Usage of a Task Recall that each candidate computer has an SA running on it. The slave allocator collects performance data and relays commands from the central decision maker (the ‘‘master allocator’’ or ‘‘MA’’). Every running task shares its machine with a slave allocator. Performance data can be collected from each running task using the ‘‘procfs’’ interface, a set of system calls that enables running processes on the same machine to share data and performance information. For example, one call using the procfs interface can read the total amount of memory or CPU time used by a single process. This enables the SA to maintain a picture of an MPI task’s size and memory usage. It also enables the SA to maintain a picture of aggregate system performance and availability, as discussed in section 4.1 below.

3.2. Differentiating Computation and Communication via Instrumentation The MPI library readily lends itself to detailed performance instrumentation. Since inter–process communications is presented to the programmer as a set of function calls, the functions can be redirected via ‘‘wrapper functions’’ that, in turn, perform any desired degree of instrumentation. One key figure of merit for a parallel task is the degree to which it is CPU–limited or communications–limited. Tracking it is a straightforward matter. Every time the communications library is entered or exited, adjustments are made to globally visible CPU time data. For example, when the program enters the MPI library, the current total of CPU time is subtracted from the total CPU time when the task last exited the library. The difference is credited to ‘‘computation time’’. The result of these calculations is a

Found in Proceedings of EHPC ’97

Russ, Meyers, Robinson, et al.

data structure that contains the amount of CPU time spent ‘‘communicating’’ (inside MPI library) and ‘‘computing’’ (outside MPI library). CPU time is further differentiated by ‘‘system time’’ and ‘‘user time’’. There is an implicit assumption, one that is warranted for message–passing–based parallel programs. It is assumed that time spent inside the MPI library is devoted exclusively to communications and time outside the library is devoted exclusively to computation.

3.3. Testing the Overhead of Instrumentation A series of tests was run to optimize and determine the overhead associated with instrumentation. The tests were run on an otherwise unloaded 110 MHz Sparcstation 5. Thus 1 ms in the ensuing discussion corresponds to 110 CPU clock cycles. First, the fastest means of determining a task’s CPU usage under Solaris was determined by experimentation. Second, the function that is called when a task enters or exits MPI was called 1,000,000 times. The resulting program ran in 102.8 sec, corresponding to an average of 102.8 ms/ function call. Since about 86 ms is spent reading the CPU time, the remaining 16 ms corresponds to extra overhead and calculations. The conclusion is that gathering fine– grained CPU usage statistics adds about 206 ms of overhead per MPI function call (including both entry and exit). How does this compare with typical usage? A ‘‘typical’’ MPI program may spend several seconds computing and several ms communicating. Even if the call to MPI takes 10 ms (which is optimistic in actual practice) the extra time overhead for tracking performance is about 2%. This overhead is only incurred at the entry and exit points per function call, and has no impact on available network bandwidth. Third, the overhead associated with an SA reading a single task’s usage was tested. The first method that was implemented opens the file descriptor, reads the data, and closes the descriptor every call. The second method opens the descriptor once and reads the data multiple times. The second method reduced the CPU time from 1612 ms/call to 581 ms/call, and so is the preferred method. An individual SA currently updates its statistics every 5 seconds. This process takes about 5 ms on a Sun Sparcstation 5, and so corresponds to an extra CPU load of 0.1%. The process of reading the task’s CPU usage adds 581 ms per task every time that the SA updates its statistics (every 5 s). Adding the reading of detailed usage information therefore adds about 0.01% CPU load per task. To summarize, the extra time overhead for keeping detailed CPU usage information occurs in two different places. Consider a 110 MHz Sparcstation 5. First, every call to the MPI library takes on the order of 2% longer in order to track usage properly. Second, the SA requires an extra CPU load of 0.01% per task in order to gather the usage information. For example, if 20 tasks were running on one machine, this would take an extra 0.2% CPU load.

3.4. Testing the Accuracy of Instrumentation The MPI library was modified to collect CPU time information. A two–task matrix multiply program was tested on two different workstations using the CPU–time infra-

Miss. State Univ. NSF Engineering Research Center

User–Transparent Run–Time Performance Optimization

structure. One workstation is a 143–MHz Sparc Ultra and the other a 167–MHz Ultra. The standard deviation in run–time measurements was obtained both for the total CPU time measurement and for the measurement of user CPU time in computation. The standard deviation divided by the average dropped from 1.35% to 1.33% for one task and from 3.8% to 0.69% for the second task. Thus the measurement of run time is made more accurate by ‘‘filtering out’’ time spent in computation and time spent in system calls. Likewise, the products of run time and CPU MHz for each workstation were compared. The difference between the two products dropped from 2.06% to 1.03% for one task and from 5.96% to 1.29% for the second task when system calls and communication time were ‘‘filtered out’’. Thus it was shown that differentiation of computation and communication time produces a more accurate picture of actual performance, and that the product of CPU time and relative performance is constant across nearly identical platforms of varying performance. The next step is to add this capability to the SA. This will require minor modifications to the communications protocol between the SA and each task and additions to the SA’s data structures.

4. Infrastructure to Collect Performance Data 4.1. Developing a Picture of System Performance

Milliseconds

The slave allocators are the focal point of gathering run– time performance information. The ‘‘procfs’’ system enables the SA’s to sum the CPU time spent by MPI tasks and the SA itself. This sum is considered ‘‘internal load’’. To gain an accurate picture of external load, the SA is run with root permission and it reads the total CPU time credited to all tasks (except idle time, of course). The difference in CPU time is considered ‘‘external load’’. When the ratio of external load to total load exceeds a pre–defined threshold, the SA notifies the MA. The MA recomputes a load distribution and moves the tasks off of the machine. Conversely, when the ratio drops below a

Russ, Meyers, Robinson, et al.

pre–defined threshold, the SA notifies the MA. The MA recomputes a load distribution and may move running tasks onto the newly available machine. The load samples are taken every 5 seconds and are averaged over time to smooth out transient loads. The SA uses kernel reads (under Solaris) or ‘‘sysmp’’ reads (under Irix) to determine the amount of memory that is available. Currently there is no differentiation of physical and virtual memory. This information is used to constrain allocation and prevent running out of memory. Tests run during the day show that a 15% external load limit provides a reasonable degree of responsiveness to interactive users. Often Hector is able to run small jobs without interactive users noticing the extra load, because the time to off–load the job is on the order of the time to page interactive jobs back in.

4.2. Gathering Information from Multiple Machines Each SA is responsible for maintaining detailed information about its host system and the MPI tasks under its control. This information is sent to the MA periodically (via Unix sockets) so that it can remain fully informed about the status of the entire cluster and use the information to make allocation decisions. One of the limits of Hector’s scalability is the speed with which the MA can process status messages from the SA’s. In order to test the time, a master allocator was run on an otherwise unloaded 110 MHz Sparcstation 5. A slave allocator was run on a different, comparable Sparc 5 and was connected over loaded, conventional 10 Mbit/s ethernet. The wall–clock time to process an ‘‘update’’ message was measured as the number of running tasks was varied from 1 to 10. The results are shown below in Figure 3. The results are that a single status message takes about 1.16 + (0.18 NT) ms, where NT is the number of tasks. For example, a message from an SA on a machine that is running 10 MPI tasks takes about 2.9 ms to process. Assuming that the MA should spend about 3 seconds out of every 5 processing messages (with the remainder devoted to optimization), this implies that a single MA running on



3 2 1 0 0

5 Number of Tasks

10

Figure 3: Time to Process Messages vs. Number of Tasks

Found in Proceedings of EHPC ’97

Miss. State Univ. NSF Engineering Research Center

User–Transparent Run–Time Performance Optimization

a 110 MHz Sparcstation 5 could support about 1000 SA’s each supervising 10 tasks.

5. Construction of the Central Optimizer 5.1. Algorithm Interface Design The Master Allocator is designed to collect performance information and make allocation decision. Besides collecting CPU load and CPU relative performance information, the allocator collects memory availability information and idle/non–idle information. The optimizer is invoked when resources become idle or non–idle, when jobs are launched, and, optionally, at periodic intervals to maintain a balance. The optimization algorithm is coded as a separate function so that it can be easily modified and/or replaced. It accepts as input CPU Usage, Relative CPU Performance, Task Memory Usage, and Available Memory of Each CPU. Future versions will also accept Fraction of Time Communicating, Fraction of Time Computing, Program Topology and Communications Traffic, Physical Network Topology, Node Fault Information, and a List of Suspended Jobs. After performing some optimization, it has the authority to launch jobs and migrate tasks. Future versions will be able to checkpoint jobs, suspend them, resume suspended jobs, and kill them. The relative performance of the CPU is measured by the SA when the SA is started. Aggregate CPU usage and available memory are found by reading information from the kernel. The CPU and memory usage of individual tasks is found by reading that task’s ‘‘procfs’’ information. Fractions of time communicating and computing will be tracked by individual tasks, as will program topology and communications traffic. The physical network topology will likely by provided directly to the MA from system administration information. Node fault information can be garnered from dropped or missing status updates from SA’s.

5.2. Optimization Algorithms An exhaustive optimization was coded in order to determine the CPU time. Note that the run time of the exhaustive search is O(TH), where T is the number of tasks and H the number of hosts. This proved to be prohibitively slow. For example, it took 734 seconds to map 8 tasks to 8 hosts [7]. An heuristic optimizer was then coded, and works as follows. The first phase of the algorithm determines the ‘‘optimal’’ allocation of tasks to hosts. This is done by dividing the number of tasks among the hosts in an amount proportional to each machine’s relative performance, as shown in equation (1). Ideal i

ȣ ȡ ȍ + ȧȍ ȧ Ȣ Ȥ Poweri

Poweri

Tasks i,

(1)

Russ, Meyers, Robinson, et al.

another, and there were six tasks to allocate, the faster machine should get 4.0 tasks and the slower 2.0. If there is not much available memory among the available computers, the allocation policy switches from ‘‘processor–limited’’ to ‘‘memory–limited’’. In the latter case, allocation is governed by available memory, and so that is the criterion used to evaluate changes in allocation. The former case requires more optimization. The second phase is to assign ‘‘hostless’’ tasks to machines that have less than the ‘‘optimal’’ number of tasks. Hostless tasks are either newly launched tasks or tasks running on machines that have become too busy to run jobs. The algorithm first tries to find ‘‘really hungry’’ machines and then falls back to ‘‘slightly hungry’’ ones. The ‘‘hungriness’’ of a machine is the difference between the ideal number of tasks and actual number of tasks. To continue the example, if the faster machine already had 2 tasks running on it and the slower had 1, the faster machine would have a deficit of 2 tasks and the slower a deficit of 1. Thus the faster machine is ‘‘hungrier’’. The algorithm searches to find machines with a task deficit greater than 1. There is no attempt to sort tasks by deficit, as that would increase the order of the search process. Instead, the first machine found to have a task deficit greater than one is assigned a hostless task. If it fails to find a machine with such a high task deficit, which can happen if the load is nearly optimally balanced, it searches to find a machine with a nonnegative task deficit. Notice that there is always at least one machine with a nonnegative task deficit, because the definition of ‘‘task deficit’’ forces the sum of all task deficits to be 0. The third phase is to search for machines that have a substantial deficit of tasks (not enough work to do). If it finds one, it then tries to find a machine with a surplus and, if so, migrate the task to maintain a load balance. Because this phase of the optimization uses incremental changes to the original allocation, it inherently reduces the number of load–balancing task migrations. This third phase is governed by the size factor, a heuristic coefficient less than 1. It searches for a host with a task deficit above the size factor. For example, if the factor is 0.85, it looks for a host with a deficit of 0.85 tasks. If it finds one, it then tries to find a host with a surplus of tasks at least as large as the size factor. If such a host is also found, the task is ordered to migrate. The algorithm has matched a host that is not busy enough with one that is too busy. As the size factor is made smaller, the algorithm will find more candidates for migration. The first phase is order H+T (where H=number of hosts and T=number of tasks), the second is order HT and the third order H2. (The third could be reduced to order HlogH if a more efficient search process were coded.) Thus the overall algorithm is order (HT + H2), and so is order T if T>H or order H2 if H>T.

i

i

where Ideali is the ideal number of tasks to place on machine i, Poweri is the relative computational power of machine i, and Tasksi is the current number of tasks on machine i. For example, if one machine was twice as fast as

Found in Proceedings of EHPC ’97

5.3. Testing and Results The heuristic optimizer was tested to determine both run–time and the percentage of time that it produced a successful optimization (where ‘‘success’’ was defined as not running out of memory) and to determine the run–time of

Miss. State Univ. NSF Engineering Research Center

User–Transparent Run–Time Performance Optimization

Russ, Meyers, Robinson, et al.

both hosts and tasks. Curve–fitting software was used to fit second–order equations to all run–time information. The results are shown in Figure 5 for different fractions of hostless tasks. Runtime is the run time in ms, H is the number of hosts, and T is the number of tasks. The primary conclusion drawn from these results is that the runtime is primarily a function of the number of hosts and number of tasks, and not a strong function of either HT or H2. Consequently, it is believed that the algorithm will scale well up to hundreds of tasks and hosts.

the optimization algorithm itself. Some of the results are summarized below. (More detailed information may be found in [8].) A simulation program was written to submit random distributions of mapped tasks, hosts with extra available memory, and unmapped tasks. These distributions were varied over numbers of hosts, numbers of tasks, and fraction of tasks that were considered ‘‘hostless’’. As long as there were more hosts than unmapped tasks, all allocation succeeded. As memory became tightly constrained, the algorithm’s performance degraded gracefully until allocation became impossible (more hostless tasks than available memory). The same simulation environment was used to test the execution time. Figure 4 shows the algorithm run–time for the case when all tasks were ‘‘hostless’’. Each tick of the vertical scale corresponds to an average run time of 10 ms. The horizontal axis is the number of tasks, ranging from 10 to 100 in steps of 10. The ‘‘Y’’ axis (the one that runs into the paper) is the number of hosts, ranging from 10 to 100 in steps of 10.

6. Future Work There are several ways that this work will be extended. First, determining the fraction of time the application is spending in computation versus communication must be added to the SA and incorporated into the MA’s optimization process. Second, new optimization heuristics and algorithms can be tested. Third, the information–gathering infrastructure can be applied to completely different parallel programming paradigms. For example, data–parallel algorithms, that re–divide the problem domain to maintain a load balance, can benefit from Hector’s ongoing awareness of resources and their usage.

7. Bibliography [1] Hoare, C.A.R., ”Communicating Sequential Processes”, Communications of the ACM, vol. 21, no. 8, pp. 666–667, August 1978. [2] Al Geist, Adam Beguelin, Jack Dongarra, Weiching Jiang, Robert Mancheck, Vaidy Sunderam, PVM: Parallel Virtual Machine, Cambridge Mass: The MIT Press, 1994. [3] William Gropp, Ewing Lusk, Anthony Skjellum, Using MPI, Cambridge Mass: The MIT Press, 1994. [4] Sudhir Ahuja, Nicholas Carriero, and David Gelertner, ‘‘Linda and Friends’’, Computer, Vol. 19 No. 8, August 1986, pp. 26–34. [5] Samuel H. Russ, Brian Flachs, Jonathan Robinson, and Bjorn Heckel, ‘‘Hector: Automated Task Allocation for MPI’’, Proceedings of the 10th International Parallel Processing Symposium, Honolulu, HI. [6] Jonathan Robinson, Samuel H. Russ, Brian Flachs, and Bjorn Heckel, ‘‘A Task Migration Implementation for the Message– Passing Interface’’, Proceedings of the IEEE 5th High Performance Distributed Computing Conference (HPDC–5), Syracuse, NY. [7] Dr. Samuel H. Russ, Jonathan Robinson, Dr. Brian K. Flachs, and Bjorn Heckel, ‘‘The Hector Parallel Run–Time Environment’’, Submitted to IEEE Transactions on Parallel and Distributed Systems [8] Dr. Samuel H. Russ, Brad Meyers, Jonathan Robinson, Matt Gleeson, Laxman Rajagopalan, and Chun–Heong Tan, ‘‘Dynamic Load Balancing for MPI Jobs under Hector’’, Submitted to The Journal of Parallel and Distributed Computing.

20 ms 100 hosts 10 ms

10 tasks 100 tasks

10 hosts

Figure 4: Average Run Time for 100% Hostless Tasks The maximum time for all scenarios was 31.03 ms, and represented the case with 100 tasks, 100 hosts, and 70% hostless tasks. As expected, the run time increases with

+ * .0006T ) 0.0011H ) .0027HT * .0482H ) .1546T ) 1.2938 + 0.0004T * 0.0001H T * 0.0012H ) 0.0069HT ) 0.0781H ) 0.0670T ) 2.2327 + 0.0003T ) 0H ) 0.0036HT ) 0.0211H ) 0.0851T ) 1.4130 + * 0.0002T * 0.0002H ) 0.0003HT ) 0.1054H ) 0.1378T ) 0.0550 2

Runtime 100% hostless Runtime 70% hostless Runtime 40% hostless Runtime 0% hostless

2

2

2

2

2

2

2

2

(2) (3) (4) (5)

Figure 5: Equations to Describe Run–Time as a Function of Hosts and Tasks

Found in Proceedings of EHPC ’97

Miss. State Univ. NSF Engineering Research Center

Suggest Documents