O for Gang Scheduled Workloads

Implications of I/O for Gang Scheduled Workloads Walter Lee, Matthew Frank, Victor Lee, Kenneth Mackenzie, and Larry Rudolph M.I.T. Laboratory for Com...

Author: Sylvia Hamilton

1 downloads 0 Views 168KB Size

Report

Download PDF

Recommend Documents

HP Compute Platforms for Data-intensive Workloads

On Synthetic Workloads for Multiplayer Online Games

On-Line Index Selection for Shifting Workloads

Leveraging Public Clouds for Financial Services Workloads

Geekbench 4 CPU Workloads

Workloads kennen und verstehen

Towards Multitenancy for IO-bound OLAP Workloads

Storage Quality of Service for Enterprise Workloads

Modelling Delay Propagation Trees for Scheduled Flights

SCHEDULED CASTE SUB PLAN GUIDELINES FOR IMPLEMENTATION

6-Gang-Schaltgetriebe. 6-Gang-Schaltgetriebe

ORAL ARGUMENT SCHEDULED FOR JUNE 2, 2016

Gang Automatikgetriebe

Scheduled Monuments in Oxfordshire

Scheduled Maintenance Guide

S.LEAGUE Match Played on 2 May Match(es) Scheduled for 6 May Match(es) Scheduled for 7 May Match(es) Scheduled for 8 May 2015

Development of Scheduled Castes and Scheduled Tribes in India

A Decision Support System for Moving Workloads to Public Clouds

Solaris Zones: Operating System Support for Consolidating Commercial Workloads

Compiler Optimizations for Transaction Processing Workloads on Itanium Linux Systems

Analysis of Multimedia Workloads with Implications for Internet Streaming

Implications of I/O for Gang Scheduled Workloads Walter Lee, Matthew Frank, Victor Lee, Kenneth Mackenzie, and Larry Rudolph M.I.T. Laboratory for Computer Science Cambridge, MA 02139, U.S.A. walt, mfrank, wklee, kenmac, rudolph @lcs.mit.edu http://cag-www.lcs.mit.edu/ walt

Abstract This paper examines the implications of gang scheduling for generalpurpose multiprocessors. The workloads in these environments include both compute-bound parallel jobs, which often require gang scheduling, and I/O-bound jobs, which require high CPU priority to achieve interactive response times. Our results indicate that an effective interactive multiprocessor scheduler must weigh both the benefits and costs of gang scheduling when deciding how to allocate resources to jobs. This paper answers a number of questions about gang scheduling in the context of a variety of synthetic applications and SPLASH benchmarks running on the FUGU scalable multiprocessor workstation. We show that gang scheduling interferes with the performance of I/O-bound jobs, that applications do not benefit equally from gang scheduling, that most real applications can tolerate at least a small amount of scheduling skew without major performance degradation, and that messaging statistics can provide important clues to whether applications require gang scheduling. Taken together these results suggest that a multiprocessor scheduler can deliver interactive response times by dynamically monitoring and adjusting its resource allocation strategy. 1 Introduction This paper examines the problems of gang scheduling workloads with an I/O-intensive component. Whereas gang scheduling provides better performance than uncoordinated scheduling for compute-bound parallel jobs with frequent synchronization, conventional gang scheduling leads to poor performance of I/O-bound jobs, which require short, high-priority bursts of processing. This paper presents the results of several experiments addressing four important multiprocessor scheduling issues: the costs of gang scheduling in terms of the response time of I/O-bound jobs; the class of parallel jobs that benefit from gang scheduling; how jobs behave in a gang scheduling environment with perturbations; and how a scheduler can make simple runtime measurements to determine whether a job will benefit from gang scheduling. Multiprocessor architecture is expanding from its supercomputing niche into the mainstream. Networks of workstations, scalable workstations, and SMP clusters attempt to fill the role of traditional uniprocessor workstations. A major challenge is to provide good response time with workloads which consist of both I/O-bound jobs and compute-bound parallel jobs. This paper takes on this challenge. Uniprocessor schedulers provide good response times to I/Obound jobs by giving them higher priority than compute-bound

This research is funded in part by ARPA contract # N00014-94-1-0985, in part by NSF Experimental Systems grant # MIP-9504399, in part by a NSF Presidential Young Investigator Award.

jobs. Traditional multiprocessor gang schedulers,on the other hand, schedule round-robin and do not collect priority information. We find that traditional gang scheduling schemes under-utilize disk resources when I/O-bound jobs are present. Because the scheduler has no notion of priority it often allocates processors to compute-bound jobs rather than I/O-bound jobs, leaving disks idle, and leading to unnecessarily long response times. The I/O-bound jobs would benefit signifigantly from an ability to interrupt compute-bound jobs. In addition, these I/O-bound jobs may not use their allocated time-slices particularly efficiently under gang-scheduling. Because each individual process in a parallel I/O-bound job may only use the cpu for a small portion of its dedicated time-slice, the cpu may also sit idle for periods of time. Cpu utilization can be improved if other jobs can be found to run in these fragmented slots. This raises two questions about compute-bound jobs. The first question considers whether and how compute-bound jobs can utilize fragmented resources. The second question concerns how well compute-bound jobs tolerate interruptions. We answer both of these questions by studying the degree to which compute-bound jobs benefit from gang scheduling. Applications with fine-grain synchronization require the abstraction of a dedicated machine; these applications benefit fully from gang scheduling and in fact require it to make progress. Load balanced applications with coarse-grain synchronization, on the other hand, can make progress with balanced allocation of computing resource. Gang scheduling these applications provides minimal advantages over scheduling them evenly but non-simultaneously across the processors, a scheduling criteria know as interprocess fairness. Finally, some applications synchronize little and have the capability of balancing their loads under any scheduling condition. For these applications, gang scheduling provides no advantage at all over any other form of scheduling. We study several synthetic applications and SPLASH benchmarks and identify attributes which influence the degree to which applications benefit from gang scheduling. The studies on both I/O-bound jobs and compute-bound jobs provide evidence that a scheduler needs to base its scheduling decisions on attributes of jobs in its workload. Two attributes are relevant for this purpose: whether a job is I/O-bound or computebound and how much a job benefits from gang scheduling. The first challenge in building such a scheduler is how the scheduler can identify the two attributes. The final part of the paper examines this problem. Identifying whether a job is I/O-bound or compute-bound is a simple problem. Traditional schedulers can collect information about I/O easily because I/O primitives are kernel calls which feed information back to the scheduler. From this information, standard methodology developed in the uniprocessor environment can be extended to identify whether a job is I/O-bound or compute-bound. The analysis might be slightly more complicated because jobs may contain multiple processes, but there is no fundamental obstacle to the problem.

Identifying the degree to which a job benefits from gang scheduling, on the other hand, is a poorly understood problem. To avoid large message-passing software latencies, it has become common to provide direct, user-level access to the low-level message-passing facility. An unfortunate by-product is the lack of direct scheduler knowledge as to how much compute-bound applications benefit from gang scheduling. Two useful methods for making such determination are identifying synchronization frequency and measuring application progress. Synchronization frequency corresponds directly to how much an application benefits from gang scheduling, and measuring application progress allows the scheduler to detect when it is not meeting the resource requirements of an application.

workloads, so the cost of the global work is small and the round size is kept minimal. The scheduler uses an Ousterhout-style matrix coscheduling algorithm to assign work to processors. Jobs have fixed processor needs and are assigned to processors statically, one process per processor, at the time the jobs begin. Each job is marked with a gang bit that indicates to the scheduler whether constituent processes may independently yield their timeslices when they have no work to do. Experiments are run on an instruction-level simulator of the Fugu multiprocessor. The simulator counts instructions, not strictly cycles. But since the scheduling issues we are interested in are orthogonal to any memory hierarchy issues, we believe instruction counts will give us the same qualitative results as cycle counts.

Coarse-grain messaging statistics contain information about both synchronization frequency and application progress. Messaging statistics contain information about synchronization frequency because synchronization is coordinated via messages. However, this information is hard to isolate because one cannot distinguish between messages involved in synchronization from those that do not. Instead, we focus on how such statistics reflect progress, and we show that measuring the length of the receive queues when a job is not gang scheduled can tell us how much the job is benefiting from gang scheduling.

3 Gang Scheduling and I/O Jobs In this section we study the implications of gang scheduling in the presence of I/O-bound jobs. We find that the requirements of gang scheduling lead to a tradeoff between disk utilization and cpu utilization. Traditional uniprocessor schedulers, based on multilevel feedback queues, take advantage of priority information to effectively overlap I/O requests with processing. Because gang schedulers ignore information about job behavior they make suboptimal choices, leading to slowdowns for both I/O-bound and computebound jobs.

Although gang scheduling improves the performance of many workloads, it conflicts with the goal of providing good response time for workloads containing I/O-bound applications. The results in this paper motivate the need to analyze the costs and benefits of gang scheduling each job, by showing that some jobs benefit only marginally from a dedicated machine abstraction, and by showing that gang scheduling jobs increase the response time of I/O-bound applications. In addition, we show that a scheduler can collect the necessary information for a cost-benefit analysis by monitoring the receive queues of each process.

Section 3.1 discusses a variety of ways in which gang scheduling can lead to poor I/O and cpu utilization. Section 3.2 demonstrates the tradeoffs that gang scheduling must make between I/O and compute-bound jobs. Our results suggest that gang schedulers require considerable information to make good decisions. Along with the priority information collected by traditional uniprocessor schedulers, a gang scheduler can benefit from knowledge about the coscheduling requirements of compute-bound jobs.

The rest of the paper is organized as follows. Section 2 describes our experimental environment. Section 3 studies the impact of gang scheduling on I/O-bound applications. Section 4 studies the performance of compute-bound applications in near-gang scheduling environments. Section 5 explores the use of messaging statistics to aid scheduling decisions. Finally, Section 6 and Section 7 present related work and conclusion, respectively.

3.1 Costs of Gang Scheduling The costs of gang scheduling can be divided into two categories, under-utilization of disk resources, which we call priority inversion, and under-utilization of cpu resources, which we call cpu fragmentation. Disk resources can best be utilized if I/O-bound jobs are given priority to use the cpu whenever they are ready to run. This insures that the job’s next I/O request will come as soon as possible after the previous one finishes. Gang schedulers cause priority inversion problems whenever they permit a compute-bound job to use the cpu while I/O-bound jobs are ready to run.

2 Experimental setup This section describes the experimental environment used in Sections 4 and 5. It provides information about the Fugu multiprocessor, the scheduler, and the multiprocessor simulator used by the experiments.

There are two different causes of priority inversion. Either the scheduling quantum length for a compute-bound job can be set too long or the scheduling quantum length for an I/O-bound job can be set too short. The left hand side of Figure 1 demonstrates the first of these problems. In this case process i of job A (an I/O-bound job) makes an I/O request. Shortly afterward job A’s quantum expires and the scheduler switches to running job B. When the I/O request finishes the scheduler does not return to job A, however, because job B’s quantum has not yet finished. The time remaining in the quantum is devoted to the compute-bound job, even though the disk is idle.

Fugu is an experimental, distributed-memory multiprocessor supporting both cache-coherent shared memory and fine-grain message passing communication mechanisms [12]. The applications studied in this paper use only the message-passing mechanism. Messages in Fugu have extremely low overhead, costing 10 cycles to send and 100 cycles to process a null active message via an interrupt. The Fugu operating system, Glaze, supports virtual memory, preemptive multiprogramming and user-level threads. The message system is novel in that messages received when a process is not scheduled are buffered by the operating system at an extra cost.

A second form of priority inversion occurs when the scheduler sets the quantum length for an I/O-bound job too short. This problem is shown in the middle part of Figure 1. Here, job A’s quantum ends shortly before process j of job A is ready to make an I/O request. The disk sits idle for the entirety of quantum B before job A is permitted to resume. If quantum A had been slightly longer a disk access could have been overlapped with the job B’s computation.

The Fugu scheduler is a distributed application organized as a two-level hierarchy with a global component and local, per processor components. The cost of the global communication and computation is amortized by pre-computing a round of several times-slices of work which is then distributed to the local schedulers. Results for this paper employ a four-processor configuration running small

2

Proc. k I/O Bound Job using CPU

Quantum for Job A

Fragmented CPU Resource

Quantum for Job B

2

1

|

Priority Inversion

3

|

I/O request outstanding

Slowdown

Proc. j

|

Proc. i

Priority Inversion

IO Job Quantum = 2.5 msec IO Job Quantum = 5 msec IO Job Quantum = 10 msec

Figure 1: Gang scheduling ignores notions of priority. In processor i (on the left), an i/o request finishes before the end of quantum B. Job A is ready to run, but cannot proceed until quantum B finishes, leaving the disk idle. Processor j (middle) demonstrates a similar priority inversion effect. Quantum A ends before the process makes it’s i/o request, leaving the disk idle for all of Quantum B. When job A makes a request before the end of its quantum (processor k, right), the processor remains idle until the start of Quantum B.

|

0

|

0

|

|

10

20

|

|

30 40 CPU Job Quantum (msec)

Figure 2: Increasing the cpu-bound job’s quantum length increases the priority inversion effect. An I/O-bound job and a compute-bound job are scheduled against each other on a 32 processor gang scheduled machine. Each process of the I/O job uses the CPU for an average of 5 msec between making 20 msec I/O requests. Three experiments are shown, setting the scheduler quanta for the I/O-bound job to 2.5, 10, and 20 msec. The scheduler quanta for the compute-bound job is varied on the X axis and slowdown for each I/O-bound job versus running on a dedicated machine is given on the Y axis.

In contrast the right hand side of Figure 1 demonstrates the cpu fragmentation problem that occurs when an I/O-bound job’s quanta are too long. In this case process k of job A makes an I/O request considerably before the end of quantum A. Because job B requires gang scheduling, it will be unable to make progress because the rest of the processors are still running processes of job A. Processor k will remain idle until the beginning of quantum B.

use the cpu only within the quanta allocated to it. The scheduler switches back and forth between the two jobs in a round-robin fashion. We observe the slowdown for each job as we vary the quantum lengths for each job.

Section 3.2 examines these issues quantitatively. We find that dealing with the priority inversion requires that the quanta allocated dynamically to suit the I/O requirements of the workload. A more flexible scheduling scheme can deal with the problems of priority inversion and resource fragmentation by allowing the characteristics of each job to drive the schedule.

We model the workload with a compute-bound job that never makes any I/O request and an I/O-bound job which alternates between waiting for I/O and running on the cpu for an average of 5 msec with an Erlang-5 distribution. The latency of I/O requests is fixed at 20 msec. We vary the gang scheduler quantum allocated to each of the two jobs to observe the effects described in Section 3.1. In the first experiment, three different settings – 2.5 msec, 5 msec, and 10 msec – are used for the quantum length for the I/O-bound job. The quantum length for the compute-bound job is varied from 1 msec to 40 msec. Figure 2 shows the results of the first experiment. The “waviness” in the plots for the I/O-bound job is an artifact of the harmonics between the periodicity of the I/O-bound job (at a frequency of about 25 msec) and the scheduling quanta.

3.2 I/O-CPU Utilization Tradeoffs We demonstrate the tradeoffs between disk and cpu utilization in the context of a strict gang scheduler in which we can set the quantum length individually for each job. By varying the quantum length for different jobs, we observe the effects discussed in Section 3.1. We observe priority inversion, which causes poor disk utilization, when either the quantum length for an I/O-bound job is too short or when the quantum length for compute-bound jobs is made too long. Disk utilization is improved by increasing the cpu utilization of I/O-bound jobs.

Setting the I/O-bound job’s quantum length to 2.5 msec demonstrates the costs of priority inversion. At the end of the I/O-bound job’s quantum, the probability that any particular process has made a disk request is quite small. The disk will remain idle during the entire period of the compute-bound job’s quanta. The result is that, as the quanta for the compute-bound job are made larger, the I/O-bound job experiences severe slow down. A 2x slowdown is reached when the compute-bound job is given a quantum length of 12 msec.

Because the variable quantum policy we used requires considerably more flexibility than is traditionally available in gang schedulers, we built a simple event-driven simulator for this experiment. The experiment consists of gang scheduling a synthetic I/O-bound job against a synthetic compute-bound job. The I/O-bound job alternates between short bursts where it requires the cpu and I/O requests where it simply waits for a disk request to finish. The compute-bound job makes no I/O requests. Each job is allowed to

At a quantum length of 10 msec, the I/O-bound job does con-

3

Although the slowdown of the I/O-bound job generally decreases as the quantum length of the I/O-bound job is increased, this policy will also slowdown the compute-bound job. For example, if both the I/O-bound job and the compute-bound job are given a quantum length of 40 msec, the compute-bound job will get only 50% of the cpu cycles, resulting in a 2x slowdown, while the I/O-bound job will still slow down by a factor of 1.6. Summary: These results show that in order to provide interactive response time, a multiprocessor scheduler needs to carefully monitor the requirements of each of its jobs and react accordingly. Currently, gang schedulers provide only round-robin scheduling with a fixed quantum length for all jobs. In this section, we have shown that the it is necessary for the scheduler to tune the quantum size to the needs of each individual job.

|

|

2

1

again use the cpu.

|

Slowdown

3

CPU Job Quantum = 40 msec CPU Job Quantum = 20 msec CPU Job Quantum = 10 msec

|

0

|

0

|

|

10

20

|

Even an adaptive quantum length is not sufficient to deal completely with the problems of priority inversion and cpu fragmentation. A more flexible scheduling policy is called for, where higher priority jobs can interrupt lower priority jobs in order to keep disk utilization high. In addition, the cpu fragmentation problem can be partially alleviated if compute-bound jobs can be scheduled into the fragmented slots.

|

30 40 IO Job Quantum (msec)

Interrupting a low priority job or scheduling a low priority job in a fragmented slot will only be beneficial, however, if that job is amenable to skew. If a compute-bound parallel job synchronizes frequently, interrupting one of its processes may improve disk utilization only at the cost of a large drop in cpu utilization. The next section explores this issue in more depth.

Figure 3: Increasing the I/O-bound job’s quantum length generally reduces the priority inversion effect. The workload parameters are identical to those in the last figure. Three experiments are shown, setting the scheduler quanta for the compute-bound job to 10, 20 and 40 msec. The scheduler quanta for the I/O-bound job is varied on the X axis and slowdown for each I/O-bound job versus running on a dedicated machine is given on the Y axis.

4 Application performance in near-gang scheduling environments

siderably better. Because most of the processes are able to make an I/O request each quantum, the slowdown is only minimal as long as the compute-bound job is given a quantum length of less than 20 msec. The slowdown begins to increase, however, when the quantum length of the compute-bound job increases beyond 20 msec, reaching a 2x slowdown when the compute-bound job’s quantum is 39 msec. Again, the slowdown is due to a priority inversion effect. The I/O requests all finish during the compute-bound job’s quantum, but the I/O-bound job must wait until its next quantum before it can again use the cpu.

Section 3 raises two questions about compute-bound applications. The first question considers whether and how these applications can utilize fragmented resources. The second question concerns how well they tolerate interruptions. In this section, we answer both of these questions by studying the degree to which compute-bound applications benefit from gang scheduling. These issues are related in the following way: the more a job benefit from gang scheduling, the less efficiently it can utilize non-gang, fragmented resources, and the less it can tolerate interruptions. Our goal is to identify characteristics of an application which relate to its degree of benefit from gang scheduling.

Setting the I/O-bound job’s quantum length to 5 msec produces a result somewhere in between the previous two results. The slowdown increases after the compute-bound job’s quantum reaches 20 msec, reaching a 2x slowdown when the compute-bound job’s quantum is 29 msec.

Many studies have measured the benefits of gang scheduling relative to uncoordinated scheduling [1, 3, 8, 17]. Our study differs in that we are interested in the marginal benefit of purely gang scheduling environment when compared to a gang scheduling environment with disruptions.

Figure 3 demonstrates that simply increasing the quantum length of the I/O-bound job does not always improve the disk utilization because the slowdown is not monotonic. In this experiment, the compute-bound job’s quantum length is held steady at 10, 20, or 40 msec. The I/O-bound job’s quantum length is varied from 1 msec to 40 msec. The bumps in the plot are caused by interactions between the period of the I/O-bound job and the quantum length.

In this experiment, we measure the performance of applications under various near-gang scheduling environments on a fourprocessor machine. These environments are produced by introducing perturbations to a fully ganged environment. We set up four environments, each with a different set of perturbation characteristics.

The local increase in slowdown when the I/O-bound job’s quantum length is set at 24 msec is particularly severe. This is entirely the result of the priority inversion effect. The I/O-bound job uses the cpu for an average of 5 msec and then make an I/O request that takes 20 msec. At an average length of 25 msec to compute a cycle, there is a very high probability that when the quantum length is set to 24 msec the I/O request will finish just after the scheduler switches to running the compute-bound job. The I/O-bound job will have to sit idle through the entirety of a quantum before it can

In environments and , each perturbation takes away a time quantum of processing time from a single processor. In , the processor is fixed. We call this fixed-processor takeaway. In , the processor is selected in a round-robin fashion. We call this round-robin takeaway. See Figure 4. In environments and , each perturbation gives an extra time quantum of processing time to a single processor. In 4

P0 P1 P2 P3

P0 P1 P2 P3

load balancing under a variety of scheduling conditions. We use four such applications, with the granularity of a work unit varying from 14% to 2400% of a quantum.

Time Quantum when application is running

Time Quantum taken

Msg In this set of applications, phases of request/reply communication are separated by barriers. The requests are asynchronous, but all replies must be received before a process proceeds to the next phase. We use three types of request patterns. In msg-self, requests are sent to itself. In msg-neighbor, requests are sent to a neighboring processor, and each processor handles requests from exactly one processor. In msg-all, requests on each processor are sent evenly to all processors.

Time

away

Based on the experimental results, each application above can be classified as one of three types. Figure 5 presents the characteristic plots of the three types of applications. Each line on the graph plots the number of perturbations versus the change in runtime for an environment. We have plotted the lines for all four experiments on the same graph. The three types of applications are: (a)

(b)

Synchronization intensive This type of applications cannot make progress unless it is being gang scheduled. When time quanta are taken away from a processor, all other processors stall as well. The entire application slows by the amount of time taken away. When time quanta are given to a processor, the processor stalls also, so the application receives no benefit from the extra time at all. Barrier and all of the msg applications fall into this category.

Figure 4: Experimental setup for (a) fixed-processor takeaway and (b) round-robin takeaway

, called fixed-processor giveaway, the processor is fixed. In , called round-robin giveaway, the processor is selected by round-robin.

Of course, we are not claiming that all applications with barriers will behave this way. Synchronization rate is the key parameter here. As we reduce synchronization rate, applications will behave more and more embarrassingly parallel-like (see below). We could indeed run an experiment where we vary this rate and observe the change in behavior. But this behavior has been studied before [5, 8], and here we are interested in illustrating the types of behavior at extreme ends of the spectrum.

The exact times of the perturbations are randomly distributed across the runtime of the application in batches of four. Within a batch of four, the time of the first perturbation determines the times for the other perturbations. A fixed interval of three time quanta separate perturbations within a batch. Perturbations are batched in closely spaced groups of fours so that round robin perturbations retain the desired properties. Quantum size is fixed at 500,000 instructions across the runs.

Embarrassingly parallel This type of applications exhibits the same poor behavior as synchronization intensive applications when time is given to or taken away from a single processor. However, the behavior is caused not by synchronization but by load imbalance. In the round-robin experiments, where load balance is maintained, application of this type performs much better. When time quanta are taken away round-robin, runtime degrades by 1/P quantum (here, P=4) per quantum taken away. The factor of 1/P arises because the single quantum of lost processing time is jointly recovered by the P processors. Similarly, when quanta are given away round-robin, runtime improves by 1/P.

We run each application under the four scheduling environments, and we measure its change in runtime when compared to the runtime under gang scheduling. The results are presented in two sections below, one for the set of synthetic applications and one for the set of real applications. As expected, we find that for load balanced applications with fine grain synchronization, the perturbations effect applications significantly. Real applications, however, exhibit internal algorithmic load-imbalance and are often somewhat latency tolerant. Because of these factors, the effects of perturbations are between a factor two and four smaller on a four-processor machine.

Emp and coarse-grain workpile (work unit = 24 time quanta) belong in this category.

4.1 Synthetic applications Self load-balancing This type of applications performs optimally under all scheduling conditions, because it suffers from neither synchronization nor load imbalances. Performance degrades by 1/P quantum per quantum taken away, and it improves by 1/P quantum per quantum given away. Three of the four workpile applications fall into this category. Their granularities of work unit range from 14% of a time quantum to 240%.

We use the following set of synthetic applications: Emp Each process in this application does a fixed amount of work independently. It is load balanced. Barrier This application consists entirely of synchronizing barriers.

Each class of applications above may also be identified by their minimum scheduling requirements. Synchronization intensive applications require gang scheduling. Embarrassingly parallel applications require fair scheduling of the constituent processes. We call this scheduling criteria interprocess fairness. Self load-balancing

Workpile In this set of applications, there is a fixed amount of global work which may be broken into independent units of work [18]. The units may then be distributed dynamically to achieve

5

Runtime delta vs gang [time quantums]

Runtime delta vs gang [time quantums]

Runtime delta vs gang [time quantums]

0

0

Embaressing parallel application (emp)

-6

|

|

|

6# 6 987 # 798 # # |

|

12 15 18 21 Number of perturbations

-3

#

|

|

3

#

|

Synchronization intensive application (barrier)

|

6

|

|

9

9

|

|

6

#

12

|

|

3

15

fixed takeaway round-robin takeaway fixed giveaway round-robin giveaway

|

|

0

18

|

|

-6

|

|

12 15 18 21 Number of perturbations

# 1 # 23 #4

21

|

0 -3

/

|

|

3

.

|

$

|

6

|

|

9

|

|

9

12

|

|

6

15

-

fixed takeaway round-robin takeaway fixed giveaway round-robin giveaway

|

|

|

3

18

-- *# + - # , -- # - # - # - - . . . . . . . #- - - . . . . . . . . 0/. # ./0 ./0 ./0 ./0 / / / / / / / / / / / / / / / 0 0 0 0 0 0 0 0 # 0 0 0 0 0 0 0 # ! " # ) |

|

0

21

|

(' ('

|

-6

|

0 -3

|

3

|

6

|

9

% %& &

|

12

fixed takeaway round-robin takeaway fixed giveaway round-robin giveaway

|

15

% %% %& & # % & # % %& & & % # % %& & & % # % %& & & % # % % & & % & % # %& & & ('& # (' (' (' (' (' (' (' (' (' (' (' (' (' (' (' (' (' (' # # ! " |

18

# |

21

0

6 6 6 6 67 67 67 7 7 6 6 6 6 7 7 6 6 6 6 6 67 67 67 7 7 7 7 7 7 7 779 8 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 89 89 ! " 5 |

|

|

3

6

9

|

|

|

|

12 15 18 21 Number of perturbations

Self load-balancing (workpile; unit = 14% of quantum)

Figure 5: Characteristic plots from giveaway-takeaway experiment for three types of applications. The actual application from which each plot is taken is listed in parenthesis below the plot. The four dotted lines are reference lines representing time taken away (worst case slowdown), 1/4 of time taken away (best case slowdown), zero (worst case speedup), and -1/4 of time given away (best case speedup).

App. Enum Water LU Barnes

Quanta/Barrier 50 10 10 50

Type Msgs Non-blocking Blocking Blocking Blocking

The results for Enum are actually consistently worse than that of a perfect embarrassingly parallel application. There are three reasons. First, even in the absence of synchronization, failure to gang schedule incurs overhead in the form buffering cost. At about 250 instructions per buffered message, this overhead can be up to three quanta when 20 quanta are taken away. Note that this cost is smaller for the quantum giveaway experiments because the buffer overhead is spread over P-1 processors.

Msgs/Proc/Quantum 254 12 3 28

Table 1: Characteristics of the real applications

Moreover, as load balanced as any real application can expect to be, Enum still has some load imbalances. The effect of imbalances on run time is evident by comparing the runtime of the takeaway experiment from processor 3 with the runtime of the takeaway experiments from processors 0 and 1. Processor 3 is the bottleneck processor for over 60% of the applications (as evident by the lack of valleys in much of the progress graph). As a result, taking away time from it results in a slower run time than taking away time from processor 0 or 1.

applications can utilize any processor resource; they have no requirement at all. 4.2 Real applications We run the takeaway/giveaway experiments on four real applications. Enum finds the total number of solutions to the triangle puzzle (a simple board game) by enumerating all possibilities breadth-first. Barnes, Water, and LU are scientific applications from the Splash benchmark suite implemented using CRL, an all-software sharedmemory system [11]. See Table 1 for statistics describing the applications. Because the applications are non-homogeneous in time, we obtain each data point by taking the average result from 20 runs, each with a different set of time quanta given or taken away. Because the applications are also non-homogeneous across processors, we run the fixed processor takeaway experiment on processors 0, 1, and 3.

Finally, for the round-robin experiments, the benefit from maintaining interprocess fairness is lost if a barrier interrupts a set of round-robin perturbations. Consequently, the slowdown is noticeably higher than the expected 25% of the time taken away. Water, LU, and Barnes The results for Water, LU, and Barnes are similar. Because these applications exhibit significant load imbalances (observe the deep and long valleys in their progress plots), their results do not directly resemble that of any of the synthetic applications, which are all load balanced. In fact, load imbalance and blocking messages are two common features which explain most of the results for these applications.

Figures 6-9 show the results of the experiments. To better understand the applications for the purpose of explaining the results, we obtain a trace for each application run under gang scheduling, and we plot the progress made on each processor versus time. These traces are presented next to the experimental results. Because the progress plot for Water follows such a regular pattern, we display only a magnified subsection of it.

In the fixed-processor quantum takeaway experiments, the amount by which the application slows down depends largely on the degree to which the processor taken away is a bottleneck. We can observe this degree in the progress plot, where a processor bottleneck is marked by full progress (absence of a valley) at a time when one or more processors are not making progress (valleys). For Water, processor 3 is the clear bottleneck, so it slows down by 100% of the time taken away. For LU, processors 0 and 3 alternate as bottlenecks. However, when processor 0 is the bottleneck, it is only slightly behind the other processors. So any other processor with time taken away readily overtakes processor 0’s role as half the bottleneck. The plot reflects these bottleneck conditions: processor 3 slows down by close to 100% of time taken away, while processors 0 and 1 slow down by considerably less, with processor 0 slower than processor 1 by a small amount. Finally, in Barnes,

Enum Of the four sets of results, Enum clearly stands out. Its experimental plot closely resembles that of an embarrassingly parallel application. In reality, Enum has three characteristics which makes it embarrassingly parallel:

:

It is load balanced, as suggested by its progress plot.

:

:

It has infrequent barrier synchronization. It communicates with non-blocking messages. 6

Progress

OProc 0H

|

JK

H

Proc 1

| |

|

A F

|

A

G

HA F

A

|

|

12 15 18 21 Number of perturbations

A

F

|

|

A

|

|

OProc 3H

N

|

N

M

|

M

OProc 2H H

L

|

|

|

L

|

|

|

H

I

|

|

H

|

JK

|

Runtime delta vs gang [time quantums]

|

I

fixed takeaway (proc 3)

takeaway (proc 0) = fixed fixed takeaway (proc 1) 20 H > round-robin takeaway ? fixed giveaway (proc 1) 16 H @ round-robin giveawway 12 H I 8H KJ I JK 4H L IH NMLKJ IJKMLN LM M 0 N N -4 H -8 B H A C3A D6A E9A 0

|

H ;