Scheduling User-Level Threads on Distributed Shared-Memory Multiprocessors

Scheduling User-Level Threads on Distributed Shared-Memory Multiprocessors Eleftherios D. Polychronopoulos1 and Theodore S. Papatheodorou1 High Perfor...
Author: Emery Rich
1 downloads 0 Views 103KB Size
Scheduling User-Level Threads on Distributed Shared-Memory Multiprocessors Eleftherios D. Polychronopoulos1 and Theodore S. Papatheodorou1 High Performance Computing Information Systems Laboratory Department of Computer Engineering and Informatics University of Patras Rio 26 500, Patras, GREECE e-mail:fedp,[email protected] Abstract. In this paper we present Dynamic Bisectioning or DBS, a simple but powerful comprehensive scheduling policy for user-level threads, which unifies the exploitation of (multidimensional) loop and nested functional (or task) parallelism. Unlike other schemes that have been proposed and used thus far, DBS is not constrained to scheduling DAGs or singly nested parallel loops. Rather, our policy encompasses the most general type of parallel program model that allows arbitrary mix of nested loops and nested DAGs (directed acyclic task-graphs) or any combination of the above. DBS employs a simple but powerful two-level dynamic policy which is adaptive and sensitive to the type and amount of parallelism at hand. On one extreme DBS approximates static scheduling, hence facilitating locality of data, while at the other extreme it resorts to dynamic thread migration in order to balance uneven loads. Even the latter is done in a controlled way so as to minimize network latency. All this is achieved through a simple and uniform scheduling policy that keeps runtime overhead low, and hence can be used with user-level thread runtime systems that support fine-grain parallelism.

1 Introduction Thread scheduling is a problem that has a significant impact on the performance of parallel computer systems, and has been an active area of research during the last few years. With the proliferation of NUMA parallel computers efficient scheduling at userlevel has become more important than ever before. This is particularly true given the fact that these machines are used not only as powerful performance boosters, but also as multi-user systems that execute a multitude of processes of different types at any given time. Thus far user-level scheduling policies have focused on either simple queueing models that handle scalar threads, or fine-tuned heuristic algorithms that try to optimize execution of (mostly singly) nested parallel loops. In this paper we present a novel comprehensive scheduling policy that targets the most general scenario of parallel applications: processes that are composed of nested parallel loops, nested parallel regions with scalar threads, and arbitrary combinations of the former two. Unlike other similar schemes, the proposed Dynamic Bisectioning Sscheduling (DBS) environment, accommodates adaptivity of the underlying hardware as well as of applications: processors can be added or preempted in the middle of program execution without affecting This work was supported by the NANOS ESPRIT Project (E21907).

load balancing. Moreover, the proposed scheme promotes locality of data by exploiting parallelism in a depth-first fashion, and inherently degenerates to static scheduling for regular applications (such as matrix multiply) or to fully dynamic scheme for irregular applications (such as database search). This work originated from an earlier collaboration with the OS kernel development group of SGI, and influenced the scheduling policies used at the kernel-level as well as the scheduling algorithms employed for thread scheduling in the libraries used by the Origin 2000. DBS has also been implemented as the user-level scheduling policy of the Illinois-Intel Multithreading Library [Girk98]. In this paper, however, we discuss an independent implementation that does not take advantage of specific kernel support, and it can thus be used on any distributed shared-memory architecture. DBS was implemented on the Nano-Threads Programming Model (NPM) as part of NANOS, a large-scale project involving the development of a runtime library for user-level threads, OS support, and a compiler that automatically generates threads using the library interface [Nano97a, Nano98b]. The nano-Threads environment aims at delivering optimized lightweight threads, suitable for both fine-grain parallel processing and event-driven applications [Niko98]. An important feature of nano-Threads is that it provides for coordinated and controlled communication between the user-level and the kernel-level schedulers. In this work we used “NthLib”, a nano-Threads Library [Mart96] built on top of QuickThreads for the implementation and performance evaluation of the proposed scheduling policy. Most well known scheduling policies address simple loop parallelism, and although there has been evidence [More95], that functional parallelism can be present in substantial amounts in certain applications, little has been done to address the simultaneous exploitation of functional or irregular and loop parallelism; both in terms of compiler support and scheduling policies. Although classical DAG scheduling (list scheduling) heuristics can be used for simple thread models, they address neither loop parallelism nor nested functional parallelism (which, in general, invalidate the ”acyclic” nature of DAGs). Unlike functional parallelism, loop-level parallelism and loop scheduling has been studied extensively. Some of the most well-known such policies are briefly outlined in Section 3 and include Static scheduling, as well as dynamic policies such as Affinity Scheduling and Guided Self-scheduling. They are based on decomposing a loop into parallel tasks and executing them on a multiprocessor, a problem which has been extensively studied [Poly87, Mark92]. Static scheduling assigns iterations to processors statically, therefore minimizing synchronization overhead. Dynamic methods assign iterations to processors at run-time. The aim of dynamic scheduling policies is to evenly balance the load among executing processors and keep run-time synchronization overhead low. DBS is implemented for the NPM on a Silicon Graphics Origin 2000 multiprocessor system. A benchmark suite of five application kernels, described in Section 4, was used to compare four representative user-level scheduling policies. The results indicate that generally DBS is superior, as it presents performance gains ranging between 10% and 80%. These results also demonstrate the importance of functional parallelism which is not exploited by the other scheduling policies. As a consequence an enhanced versions, integrating exploitation of functional parallelism, of the four policies was created and

the same set of experiments were again tested. These results demonstrate the importance of the exploitation of functional parallelism, as well as the superiority of DBS compared to other scheduling policies and the affinity exploitation gain. The rest of this paper is organized as follows: In Section 2, the nano-Threads environment is outlined. DBS is described in Section 3. The experiment framework, an outline of known policies used for the experiments and the evaluation results are presented and discussed in Section 4. Finally a brief conclusion paragraph is provided in Section 5.

2 The nano-Threads Environment

Application Codes

Parallel Code

Hand−written nanothreads code

Sequential Code

OpenMP Directives

Parallelizing NANOS compiler

Code Generation

User−Level Execution Model/Interface

NANOS User−Level Threads LIbrary

Kernel Interface User−Level Resource Manager

Stock Operating System (SGI, DEC)

In−Kernel Implementation

Modified Operating System (Chorus, Linux)

Fig. 1. The Nano-Threads Programming Model

NPM is outlined in Figure 1. The integrated compilation and execution environment consists of three main components [Nano98b]: The compiler support in NthLib (based on Parafrase2) performs fine-grained multilevel parallelization and exploits both functional and loop parallelism from standard applications written in high level languages such as C and FORTRAN. Applications are decomposed in multiple levels of parallel tasks. The compiler uses the Hierarchical Task Graph (HTG) structure to represent the application internally. Management of multiple levels of parallel sections and loops allows the extraction of all useful parallelism contained in the application. At the same time, the execution environment combines multiprocessing and multiprogramming efficiently on general purpose shared-memory multiprocessors, in order to attain high global resource utilization. During the execution of an application, nano-Threads are the entities offered by the user-level execution environment to instantiate the nodes of the HTG, with each nano-

Thread corresponding to a node. The user-level execution environment controls the creation of nano-Threads at run-time, ensuring that the generated parallelism matches the number of processors allocated by the operating system to the application. The overhead of the run-time library is low enough to make the management of parallelism affordable. In other words, the nano-Threads environment implements dynamic program adaptability to the available resources by adjusting the granularity of the generated parallelism. The user-level execution environment includes a ready queue which contains nanoThreads submitted for execution. Processors execute a scheduling loop, where they continuously pick up work from the ready queue until the program terminates. The last component of NPM is the operating system, which distributes physical processors among running applications. The resulting environment is multi-user and multi-programmed, allowing each user to run parallel and sequential applications. The kernel-level scheduling policy is concerned with the allocation of physical processors to the applications currently running in the system. The operating system offers virtual processors to applications, as the kernel abstraction of physical processors on which applications can execute in parallel. Virtual processors provide user-level contexts for the execution of nano-Threads. In the remaining of the paper the terms “virtual processor” and “kernel thread” are used interchangeably. The main scheduling objective of NPM is that both application scheduling (that is, nano-Threads to virtual processors mapping) and virtual processor scheduling (virtual to physical processors mapping) must be tightly coordinated in order to achieve high performance. In addition, the kernel-level scheduler [Poly98] must avoid any interference with the user-level scheduler, in order to keep runtime scheduling overhead low. The applications that cooperate with the system by adapting themselves to the execution conditions are known as cooperative applications.

3 The DBS Algorithm Dynamic Bisectioning Scheduling is designed for multithreading programming environments exploiting loop and functional parallelism. As described earlier, the environment in which DBS is implemented is the NthLib, a user-level multithreading environment built for automatic parallelization and execution of parallel applications in Uniform and Non Uniform Memory Access (UMA - NUMA) multiprocessors. The NUMA property of the targeting architectural platform led us to incorporate a central queue and per processor local queues for facilitating scheduling policies in order to balance the workload. The centralized queue is accessible by all processors. All tasks generated by the compiler of the NthLib, are enqueued in the tail of the central queue. All processors start searching for work, by visiting first the head of the central queue. Since the queue is centralized the enqueueing and dequeuing operations are rather trivial: – Enqueueing for central queue: all generated tasks are enqueued in the tail of the central queue. – Dequeuing for central queue : a processor always dequeues the task descriptor at the head of the queue. If the descriptor denotes regular (serial) task then it dequeues the task and executes. If it denotes a loop (a compound task), then two possibilities exist:

1. All iterations of the loop have been issued to other processors. In this case the processor searches to dequeue another thread. 2. There are N iterations left in the queue that are not yet scheduled (N > 0). Then the processor dispatches N=P iterations and changes the index of the loop in the descriptor, leaving the remaining N ? (N=P ) iterations to other processors. Here P denotes the total number of processors allocated to the specific application at this moment. The per processor local queues are located in shared memory, hence they can be accessed by all processors working in the specific application. However, a per processor local queue is accessed faster by the “owner” processor since it resides in memory location which is local to the owner. For local queues the enqueueing and dequeuing policies work as follows: – Enqueueing: Tasks that have been enabled by the processor either at the central or the local queue can be enqueued in the head of the local queue or in the head of another processor’s local queue, but never in the central queue. – Dequeuing: A processor first searches for a task to dequeue in the head of its local queue. There are two cases: 1. The processor finds a task (local queue not empty). Then two sub-cases are distinguished: (a) if this is a serial task then it dequeues the task and executes, as with the central queue case. (b) if however this is a compound task then it dispatches and executes 1=P of the remaining iterations, leaving the remaining iterations in the local queue. 2. The local queue is empty, then the processor first searches the head of the central queue for tasks. If it succeeds then it follows the dequeuing process for the centralized queue. If the central queue is empty the processor searches the local queues of other processors for work to steal, as described below. If all queues are empty the processor is self-released. If it finds a non empty local queue it bisects the load in that queue, i.e. it steals half of the tasks present and enqueues the stolen tasks in the head of its local queue. There are several ways for a processor to search for work in other local queues an d some of these ways need further investigation. One basic task is to exploit proximity, in the sense that it is better for a processor to steal work from the local queue of a neighboring processor who also works in the same application. This would reduce memory page faults and cache misses. Another significant factor is reducing the search overhead, thus it is important to use a process that it is as simple as possible. One such policy, that we use in this work, is as follows. We define a bit vector V , the size of which is equal to the total number of physical processors and such that at any moment V (j ) = 1 if and only if the local queue of processor j is nonempty. If processor j finds its local queue empty immediately sets V (j ) = 0. If this processor is newly assigned to the application or if it steals work from other local queues or dispatches work from the central queue, it sets V (j ) = 1. Processor j uses V when it needs to search other local queues, in order

/*Start: Execution starts with the first task(s) placed in global queue /* at the beginning of program execution. /*End: /* /* Declare:

Termination occurs when global and local queues are empty and all processors involved in the execution have attempted a dispatch of nill_task from the global queue. P: int; /*No. of processors GQ: global queue; LQ(i): local queue of processor i, i=1,..,P; done, non_empty_LQ: boolean; /* nill_task is a special task descriptor delimiting the end of program; /* it is generated explicitly by the compiler and queued in GQ /* as the last task descriptor.

Function Dispatch_to_local begin if (head of GQ is nill_task) then done=true else if (head task is parallel loop) then dispatch and queue in LQ a number of iterations from this task as dictated by the chosen loop scheduling policy (static N/P iterations is the default); else if /* head task is a sequential task dispatch entire task to this processor and call Execute_task; /* explicit queueing to LQ may be avoided endif endif end Dispatch_to_local Function Load_steal begin Determine next nearest neighbor; /* Different policies are used /* for selecting neighbour if LQ of this neighbour is not empty, then its local queue and dispatch this load to this processor’s LQ; set non_empty_queue=true; end Load_steal Function Dispatch_&_execute begin Execute user code; All tasks that are generated duringQ

end

/* /* /* /* (i) LQ if the task is sequential and (ii) GQ if it is a parallel loop; Dispatch_&_execute

LQ is accessed as a stack in this case update descendant tasks and queue each task that becomes ready to: set non_empty_queue=true, or

/* Each processor P(i), i=1,...,P executes the following main program: main() done = false; non_empty_queue = true; While (not done) do if (GQ is empty) then Load_steal else Dispatch_to_local endif while (non_empty_queue) do if LQ(i) is empty then non_empty_queue=false else dispatch_&_execute endif endwhile endwhile end_main

Fig. 2. The DBS Algorithm

to detect which of them are nonempty from the fact that the corresponding coordinates of V have value equal to 1. The examination of these coordinates may take place in several ways. In the case implemented here the processor examines the coordinates of V in a circular fashion. The DBS algorithm Figure 2 is presented in pseudo-code at the end of the paper.

4 Experimental Evaluation 4.1 Framework In this section we discuss the implementation of DBS as a user-level scheduler in the NthLib and several performance measurements obtained on a dedicated four-cluster SGI Origin 2000 comparing the proposed scheduling policy to the most widely used heuristic algorithms for loop parallelism. Execution time is used as the metric for comparison between different scheduling policies. In our experiments we used an application suite consisting of five application codes that have been used previously by other researchers for similar comparisons [More95]. These are two “real life”, complete applications, namely Computational Fluid Dynamics (CFD) and Molecular Dynamics (MDJ), a Complex Matrix Multiply (CMM) and two kernels that are found in several large applications, namely LU Decomposition (LU) and Adjoin Convolution (AC). We have incorporated the kernels of LU and AC in two synthetic benchmarks in order to achieve additional functional parallelism. The two resulting benchmarks are Synthetic LU Decomposition (SLU) and Synthetic Adjoint Convolution (SAC). In the first set of experiments the effect of introducing the DBS policy is first studied through comparison with three well established scheduling policies. These are Static scheduling (STATIC), Affinity Scheduling (AFS) and Guided Self-Scheduling (GSS) and are outlined later in this section. It is worth noting that, as opposed to DBS, all known techniques work with simple loop parallelism only, while DBS exploits multilevel loop parallelism and functional parallelism as well. Thus, in order to supply the other methods with equivalent functionalities we created enhanced versions of AFS, GSS and STATIC scheduling that can also exploit functional parallelism. For our experiments we implemented some of the most common scheduling policies employed by most of existing commercial systems. Specifically, we implemented traditional and extended versions of dynamic scheduling policies such as GSS and Affinity Scheduling, as well as traditional static partitioning. The scheduling mechanisms are outlined in the following paragraphs. Static Scheduling (STATIC). Under static scheduling loop partitioning is done at compile time and allocation is predetermined during run-time. Static scheduling incurs minimal run-time synchronization overhead, but it suffers on load balancing with irregular applications. In static scheduling, load imbalance can occur either in case where all iterations of a parallel loop do not take the same amount of time to execute or in the case where processors execute iterations at different points in time. If one of the previous cases occur, and due to the inherent barrier synchronization at the end of each loop, load imbalance may arise causing some processors to be idle while others continue execution. With irregular applications, this policy results in low system utilization and long execution times. Guided Self-Scheduling (GSS) [Poly87], is a dynamic policy that adapts the size of iteration parcels of a loop dispatched to each processor at run-time. Guided selfscheduling allocates large chunks of iterations at the beginning of the loop in order to reduce synchronization overhead (and promote locality), while it allocates small chunks towards the end of the loop so as to balance the workload among the executing processors. At each step, GSS allocates to each processor 1=P of the remaining loop iterations,

where P is the total number of processors. In GSS all processors finish within one iteration of each other and use a minimal number of synchronization operations, assuming that all iterations of the parallelized loop need the same amount of time to complete. One major drawback in GSS is the excessive attempt of processors to access the workqueue towards the end of the loop, competing for loop iterations in the head of the work-queue. Another drawback is that, in the case of unbalanced parallel loops such as triangular loops, the first few processors will end up taking most of the work of the loop, thus resulting in load imbalance. A combination of loop blocking and double-reverse application of GSS alleviates the latter problem. Affinity Scheduling (AFS) [Mark92], is also a dynamic scheduling policy based in the observation that for many parallel applications the time spent to bring data in local memory is a significant source of overhead, ranging from 30–60% of the total execution time. AFS attempts to combine the benefits of static scheduling while resorting to dynamic load balancing when processors complete their assigned tasks. This is achieved by partitioning and allocating chunks of N=P iterations to the work queue of each processor. Here N is the number of iterations in the loop and P the number of available processors. Each processor dispatches and executes 1=P iterations of the total iterations enqueued in its local queue. This dynamic loop partitioning among the available processors targets to minimize synchronization overhead, due to excessive accesses of the work queue. Also for purposes of load balancing, an idle processor examines the work queue of all processors participating in the execution of a loop, and steals 1=P iterations from the most loaded queue. Overhead is introduced in keeping track and updating processor loads, but most importantly, from the potential repetitive locking of the same local queue by several processors simultaneously. 4.2 Results and Discussion The above scheduling policies represent the state-of-the-art in loop scheduling used by research prototypes and commercial systems. However, although recent research activities have focused on tackling the problem of irregular parallelism (task or functional parallelism), at present there exists no commercial implementation that supports a comprehensive parallel execution model: multidimensional parallelism (arbitrarily nested parallel loops), nested task parallelism (nested cobegin/coend parallel regions) or arbitrary combinations of the two. DBS, on the other hand, addresses such a comprehensive model as does our implementation of the NthLib runtime library. This fact gives an off-the-start advantage to DBS due to the ability of the latter to exploit general task parallelism. However, previous studies have indicated that at the amount of task parallelism at coarse-granularity levels is limited in the majority of numerical applications [More95]. Furthermore, due to the 2-level queue model employed by DBS, the latter may involve more run-time overhead than necessary in computations where parallelism is confined to one-dimensional parallel loops. In order to alleviate the limited ability of GSS, AFS, and STATIC, to exploit functional parallelism we implemented enhanced versions of the above which enable task-level parallelism exploitation. Our performance studies focused on the execution time as well as the aggregate cache misses as an approximate estimate of locality behavior for each of the algorithms studied. Figure 3 shows the raw execution time for each of the five benchmarks and

Computational Fluid Dynamics (CFD) (128x128)

Molecular Dynamics (MDJ) (1000 cells)

1.6

execution time (seconds)

execution time (seconds)

Complex Matrix Multiplication (CMM) (256x256)

1 AFS DBS GSS STATIC

1.4 1.2 1 0.8 0.6 0.4

0.5 AFS DBS GSS STATIC

0.8

AFS DBS GSS STATIC

0.45 execution time (seconds)

2 1.8

0.6

0.4

0.2

0.2

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0

0 2

4

6

8 10 processors

12

14

16

0 2

4

6

8 10 processors

12

14

Synthetic Adjoint Convolution (SAC) 4x(64x64)

2

4

6

8 10 processors

12

14

16

Synthetic LU 4x(64x64)

1

2 AFS DBS GSS STATIC

0.8

AFS DBS GSS STATIC

1.8 execution time (seconds)

execution time (seconds)

16

0.6

0.4

0.2

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2

0

0 2

4

6

8 10 processors

12

14

16

2

4

6

8 10 processors

12

14

16

Fig. 3. Execution times with Loop & Functional Parallelism for DBS and Loop only for AFS, GSS, and STATIC.

each of the four scheduling policies, for execution on 2, 4, 8, and 16 processor configurations of the Origin 2000. In this case, DBS has the advantage of exploiting task-level parallelism unlike the other three (predominantly loop) policies; both CFD and MDJ codes expose small amounts of functional parallelism. On the other hand the LU kernel has no functional parallelism as is the case with CMM. In all five cases DBS outperformed AFS, GSS, and even static scheduling. In the case of CFD, DBS significant improvement in execution time approaching a speedup of 2 compared to static, and a 40% improvement over AFS, while for the MDJ code DBS outperformed AFS and static scheduling by a factor of approximately 3, and GSS by 58%. The case of CMM shows a border-case with a four-way functional parallelism and regular loop parallelism which is ideal for an even-handed comparison of the scheduling algorithms. As expected all four policies performed within a small range from each other for 16 processors. However, for a small number of processors DBS, AFS, and GSS outperform static; although somewhat counter-intuitive, this is due to the fact that both DBS and AFS take full advantage of static scheduling by equi-partitioning parallel loops among the available processors in their first phase of scheduling. Unlike static, however, the dynamic schemes have the ability to balance the load when imbalances arise due to false sharing, asynchronous processor interrupts and context switches etc. As the plots indicate, the overhead involved in load balancing is low and more than justifies the benefits of balanced loads. An important observation from Figure 3 is that DBS scales well with machine size and for all five benchmarks, albeit the number of processors used was up to 16. Nevertheless, this is not the case with any of the other three policies. Improved scalability offered by DBS is due, in part, to the additional (functional) parallelism exploited by

DBS as well as its aggressive on-demand load balancing.

Computational Fluid Dynamics (CFD) (128x128)

Molecular Dynamics (MDJ) (1000 cells)

1.6

execution time (seconds)

execution time (seconds)

Complex Matrix Multiplication (CMM) (256x256)

1 AFS DBS GSS STATIC

1.4 1.2 1 0.8 0.6 0.4

0.5 AFS DBS GSS STATIC

0.8

AFS DBS GSS STATIC

0.45 execution time (seconds)

2 1.8

0.6

0.4

0.2

0.2

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05

0

0 2

4

6

8 10 processors

12

14

16

0 2

4

6

8 10 processors

12

14

Synthetic Adjoint Convolution (SAC) 4x(64x64)

2

4

6

8 10 processors

12

14

16

Synthetic LU 4x(64x64)

1

2 AFS DBS GSS STATIC

0.8

AFS DBS GSS STATIC

1.8 execution time (seconds)

execution time (seconds)

16

0.6

0.4

0.2

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2

0

0 2

4

6

8 10 processors

12

14

16

2

4

6

8 10 processors

12

14

16

Fig. 4. Execution times for DBS, AFS, GSS and STATIC with Loop and Functional Parallelism

Figure 4 illustrates the results for the same set of experiments with the following important difference: in order to enable functional parallelism exploitation by AFS, GSS, and static policies, we encoded functional parallelism (parallel regions) in the form of parallel loops with each loop iteration corresponding to each of the scalar tasks. All three policies were thus successful at distributing different tasks of a parallel region to different processors. This resulted in a noticeable improvement of AFS for the CFD and LU benchmarks and to a lesser degree for the SAC code. However, overall the behavior of all five benchmarks in terms of both scalability and absolute performance remained the same. Nevertheless, uniform, stable behavior by DBS is also apparent in Figure 4. In order to measure a first-order-magnitude effect of locality of data, we profiled the per-processor cache misses for all four algorithms and for each of the five benchmarks. Figure 5shows measurements of cache misses for all methods. It follows that in the case of CFD and MDJ, i.e. the applications for which loop parallelism is not dominant, DBS presents a very sizeable reduction in the number of cache misses, up to about 45% in the CFD case, as compared with the best of the remaining policies (AFS) and up to about 85% in the MDJ case as compared to the best of the remaining policies which is GSS. In other cases DBS incurs approximately the smallest number of cache misses along with GSS and STATIC (synthetic LU, competing with STATIC), third and second (SAC, competing mainly with GSS) and third (CMM, in close competition with STATIC and GSS). At first glance, these measurements may appear counter-intuitive, since one would

Molecular Dynamics

Computational Fluid Dynamics 60

16 P Loop only 16 P Loop & Functional

100 50 0 DBS

GSS

AFF

50 16 P Loop only

40

16 P Loop & Functional

30 20 10 0

STATIC

DBS

GSS

AFF

STATIC

100

10

16 P Loop & Functional

5

Cache misses (x10 4)

120

20

16 P Loop only

16 P Loop only 16 P Loop & Functional

GSS

AFF

STATIC

Synthetic LU decomposition

25

15

9 8 7 6 5 4 3 2 1 0 DBS

Synthetic Adjoint Convolution

Cache misses (x10 4)

Cache misses (x10 4)

150

Cache misses (x10 4)

200 Cache misses (x10 4)

Complex Matrix Multiply

16 P Loop only

80

16 P Loop & Functional

60 40 20 0

0 DBS

GSS

AFF

STATIC

DBS

GSS

AFF

STATIC

Fig. 5. Aggregate Cache Misses, DBS with Loop & Functional Parallelism, AFS, GSS and STATIC with Loop only and Loop & Functional Parallelism for 16 pro cessors

expect the smallest number of cache misses (best locality of data) for the STATIC approach, and the worst cache behavior from fully dynamic schemes. The measured results are due to the synergy of two factors. Fist, static approaches (or dynamic approaches such as AFS that attempt to exploit the benefits of static scheduling off-start) are more amenable to false sharing. Due to the large cache block in the Origin 2000, a high cache miss rate due to false sharing is exacerbated. Secondly, unlike the other schemes, DBS favors locality of data by giving execution priority to locally-spawned tasks which are executed entirely by the processors that fire them - such tasks (see Figure 2) are never queued in the global queue. Our measurements make a strong case for functional parallelism which, although not in abundance in the majority of applications, is found in substantial amounts in real-world applications such as MDJ and CFD codes. The results indicate that DBS not only outperforms common loop scheduling schemes, but scales much better than any of these heuristics. As parallel machines proliferate toward the low-end spectrum, task-level parallelism exploitation (common in multimedia applications) will become increasingly important.

5 Conclusions and Future-Work Dynamic Bisectioning Scheduling was introduced as a novel, comprehensive scheduling policy for user-level thread scheduling. The proposed 2-level policy has embedded properties that promote locality of data and on-demand load balancing. Experiments with real-world applications and application kernels indicate that DBS achieves significant performance improvements compared to other common scheduling policies used

as default schedulers in today’s high-end parallel computers. Moreover, our experiments indicate that DBS is a cache-friendly dynamic scheduler less susceptible to false sharing than static or other dynamic approaches. This is the case not only for well-behaving applications with regular (loop) parallelism, but also for irregular codes with substantial amounts of functional parallelism. Our future work is oriented towards further investigation of the exploitation of functional parallelism by conducting more experiments in a well defined and comprehensible application suite.

6 Acknowledgements We would like to thank Constantine Polychronopoulos for his valuable support of this work, Dimitrios Nikolopoulos for his help in conducting the experiments, and the people in the European Center for Parallelism in Barcelona (CEPBA) for providing us access to their SGI Origin 2000 system.

References [Girk98] M. Girkar, M Haghighat, P. Grey, H. Saito, N. Stavrakos, C. Polychronopoulos, “Illinois-Intel Multithreading Library: Multithreading Support for Intel Architecture Based Multiprocessor Systems”, Intel Technology Journal, 1st Quarter’98, [Mark92] E. P. Markatos and T. J. LeBlanc. “Using processor affinity in loop scheduling an shred-memory multiprocessors.”, In Supercomputing ’92, pages 104-113, November 1992. [Mart96] X. Martorell, J. Labarta, N. Navarro and E. Ayguade, “A Library Implementation of the Nano-Threads Programming Model”, In 2nd International Euro-Par Conference, pp. 644649, Lyon, France, August 1996. [More95] J.E. Moreira, “On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors”, PhD. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, 1995. [Nano97a] ESPRIT Nanos Project. “Nano-Threads: Programming Model Specification”, Project Deliverable M1D1, http://www.ac.upc.es/NANOS/NANOS-delm1d1.ps, July 1997 [Nano98b] ESPRIT Nanos Project. “n-RTL Implementation”, Project Deliverable M2D2, http://www.ac.upc.es/NANOS/NANOS-delm2d2.ps, April 1998 [Niko98] D. S. Nikolopoulos, E. D. Polychronopoulos and T. S. Papatheodorou, , “Efficient Runtime Thread Management for the Nano-Threads Programming Model”, Proc. of the Second IEEE IPPS/SPDP Workshop on Runtime Systems for Parallel Programming, LNCS Vol. 1388, pp. 183–194, Orlando, FL, April 1998. [Poly87] Polychronopoulos C. D. and D. J. Kuck. “Guided Self-Scheduling: A practical scheduling scheme for Parallel Supercomputers”, IEEE Transactions on Computers. December 1987, pp. 1425-1439. [Poly98] E. Polychronopoulos, X. Martorell, D. Nikolopoulos, J. Labarta, T. Papatheodorou and N. Navarro, “Kernel-Level Scheduling for the Nano-Threads Programming Model”, 12th ACM International Conference on Supercomputing, Melbourne, Australia 1988.

This article was processed using the LATEX macro package with LLNCS style

Suggest Documents