Priority Queues Are Not Good Concurrent Priority Schedulers

Priority Queues Are Not Good Concurrent Priority Schedulers Andrew Lenharth Donald Nguyen University of Texas at Austin [email protected] ddn...
Author: Linda Beasley
3 downloads 0 Views 450KB Size
Priority Queues Are Not Good Concurrent Priority Schedulers Andrew Lenharth

Donald Nguyen

University of Texas at Austin [email protected] [email protected]

Abstract The problem of priority scheduling arises in many algorithms. There is a dynamic pool of tasks that can be executed in any order, but some execution orders are more efficient than others. To exploit such schedules, each task is associated with an application-specific priority, and a system must schedule tasks to minimize priority inversions and scheduling overheads. Parallel implementations of priority scheduling tend to use concurrent priority queues. We argue that this obvious solution is not a good one. Instead, we propose an alternative implementation that exploits a fact not exploited by priority queues: algorithms amenable to priority scheduling can tolerate some amount of priority inversion. To evaluate this scheduler, we used it to implement six complex irregular benchmarks. Our results show that our endto-end performance is 8–40 times better than the performance obtained by using state-of-the-art concurrent priority queues in the Intel TBB library. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: Parallel programming General Terms Priority scheduling, Concurrent priority queue Keywords Priority scheduling, Concurrent priority queue

1.

Introduction

The problem of priority scheduling is ubiquitous in computer systems, and it can be formulated abstractly as follows. There is a work-set W whose elements represent tasks that must be executed by some number of processors. The time to execute a task may be unpredictable and may be different for different tasks. When a task is executed, it may add new tasks to W. Tasks in W can be processed in any order; however, some orders may be more efficient than others—for example, the order may affect the time taken to process a given task, and it may even affect the total number of tasks created and executed. Therefore, each task has an associated priority that is a measure of its relative importance for early scheduling. The problem of priority scheduling is to assign tasks to processors in a way that optimizes overall performance metrics such as the total time required to execute all the tasks. In this paper, we focus on a particular instance of this problem that arises in the context of implementing sequential and parallel algorithms. In this context, each task represents an iteration of a loop;

[Copyright notice will appear here once ’preprint’ option is removed.]

1

Keshav Pingali [email protected]

if there are no ordering constraints between loop iterations, such as in a DO-ALL loop, the iterations can obviously be executed in any order. A more interesting example is the Galois unordered set iterator, which iterates over a work-set in an unspecified order. Iterations may read and write the same memory locations, so they may interfere with each other, but a parallel implementation of the set iterator can execute iterations concurrently provided each iteration is executed with transactional semantics [16]. Executing an iteration may add new elements to the work-set, so the work-set must be implemented by a dynamic data structure that permits efficient concurrent insertions and removals. Examples of algorithms amenable to this style of parallelization include algorithms for mesh generation, refinement and partitioning, maxflow, single-source shortest path, minimal spanning tree, points-to analysis, ray tracing and betweenness centrality [2, 5, 9, 11, 14, 16, 17]. Although all execution orders of tasks are legal in such algorithms, it is usually the case that some execution orders are much more efficient than others even on sequential computers. There are several reasons for this. One important reason is that the asymptotic complexity of some algorithms depends on the schedule. An example is the singlesource shortest-path (sssp) problem in graphs with non-negative weight edges, described in more detail in Section 2. The key operation in this algorithm is called an edge relaxation; depending on the schedule of edge relaxations, the total number of relaxations can range from O(|E| log |V |) (Dijkstra’s algorithm) to O(|V ||E|) (Bellman-Ford), where |E| and |V | are the number of edges and nodes in the graph. Therefore, controlling the number of relaxations by appropriate scheduling is very important for efficient processing of large graphs. Another reason why task scheduling matters is that some schedules may exploit locality better than others. Dense linear algebra computations are the classic example: in computations like matrix multiplication and Cholesky factorization, loop iterations can be performed in any order but some orders are much better at exploiting spatial and temporal locality than other orders. In this domain, dependences are known statically so scheduling of loop iterations can be performed at compile time by transformations like loop permutation, skewing, reversal, fusion and fission using polyhedral techniques [10]. Careful scheduling can improve locality in irregular algorithms as well. For example, in Delaunay mesh refinement, the refinement of one badly shaped triangle can create new badly shaped triangles. If the work-set is implemented as a stack, new work is processed right away, and the resulting temporal locality from LIFO scheduling may double performance compared to a schedule that uses random selection [11]. Work and locality concerns are important in parallel implementations as well, but there are additional issues. Exploiting task affinity is important for processor locality: if there is a substantial overlap in the working sets of two tasks, it may be desirable to schedule both of them on the same core or chip provided this does not cause load imbalance. In parallel implementations based on spec-

2011/11/18

ulative execution, tasks that share data should not be scheduled at the same time since this increases the likelihood of speculative conflicts, which lead to wasted work because of aborted computations. The most common approach for implementing such algorithms is priority scheduling: each task is associated with an integer called its priority that is assigned in an application-specific way so that priority ordering of tasks corresponds to one of the good orders for executing the tasks. For example, for sssp relaxations, the priority of a node can be defined to be the length of the shortest known path to that node; processing nodes in increasing priority order minimizes the total number of relaxations (Dijkstra’s algorithm) [5]. Similarly, for the preflow-push algorithm, each node is associated with an integer called its height (which changes dynamically as the algorithm is executed), and it is desirable to process nodes in decreasing height order [4]. Sequential implementation of priority scheduling is straightforward, and data structures like heaps have been developed for this purpose. Most concurrent implementations use concurrent priority queues that use either locks or lock-free approaches to synchronize threads. Concurrent priority queues are implemented in Intel’s TBB library and in the Java concurrent collections library. In this paper, we argue that concurrent priority queues do not necessarily make good concurrent priority schedulers. Since priorities are usually chosen heuristically, algorithms amenable to priority scheduling are usually robust to small deviations from a strict priority schedule. We show that by exploiting this seemingly innocuous fact, we can design concurrent priority schedulers that often outperform concurrent priority queues by large factors. The schedulers we describe are optimized along several dimensions. • Scheduling Overhead. In the applications of interest to us, the

scheduler may have to deal with millions of tasks, but each task is relatively short-lived since it may execute only a few thousand instructions. Therefore, it is imperative that scheduling itself be a low-cost operation. This problem is significantly different from the OS scheduling problem in which processes are relatively coarse-grain and may execute for milliseconds between context-switches. • Priority Inversion. If a substantial number of low priority tasks are scheduled for execution even though higher priority tasks are available, an algorithm may do more work. However, we show that some priority inversion may be actually be acceptable provided it helps reduce communication, synchronization and coordination between threads. • Synchronization and Communication Costs. The memory systems of current multicore architectures are hierarchical in organization. Communication between cores that do not share a cache is expensive. When a scheduler must communicate, it should do so in a way that exploits the hierarchical memory organization of the processor to minimize coherence traffic and expensive communication. For example, frequently queried data should be mostly read-only, so that it may be shared in multiple caches and allow the queries to be answered locally. In this paper, we exploit these ideas to design a novel scalable priority scheduler and evaluate its performance on a number of benchmarks that are known to benefit from priority scheduling. We show that it significantly outperforms more traditional priority schedulers such as those implemented with Intel’s TBB library. The rest of this paper is organized as follows. In Section 2, we describe a number of algorithms that benefit from priority scheduling and highlight some properties that can be exploited for efficient scheduling. In Section 3 we discuss the major issues that arise during concurrent priority scheduling these algorithms. In Section 4, we describe the new priority scheduler that attempts to address these issues in a scalable way. In Section 5, we compare

2

the performance of this scheduler with the performance obtained by using concurrent priority queues in Intel’s TBB library. As we show, the performance we obtain is 8–40 times better. We conclude the paper in Section 6 with a discussion of the main lessons learnt and of future directions of research.

2.

Algorithms

2.1

Single-source Shortest Path and Breadth-first Search

Given a weighted, directed graph G = (V, E), a weight function w : E → R mapping edges to real-valued weights, and a source node s ∈ V , the single-source shortest-path (sssp) problem is to find the weight of the shortest path from s to all other nodes in the graph. We assume that weights are non-negative. One algorithm that solves this problem is Dijkstra’s algorithm [5], which assigns each node a distance label that stores the length of the shortest path found so far to that node. The algorithm proceeds by visiting nodes in ascending distance order with nodes updating distance labels on their neighbors through a process called edge relaxation. This algorithm has complexity O(|E| log |V |). Another algorithm is the Bellman-Ford algorithm, which visits each node |V | times, performing edge relaxations. It has complexity O(|V ||E|). An asynchronous algorithm that solves the same problem visits nodes in any order and iterates until no nodes change their distance labels. This algorithm will terminate, but the number of iterations depends on the particular structure of the graph. The asynchronous algorithm can use a priority function such as the distance on nodes. Another useful priority is to use the distances scale by some fixed value. An explicitly parallel algorithm is the delta-stepping algorithm of Meyer and Sanders [13]. When the graph is unweighted, this problem becomes the same as computing the breadth-first search (bfs) number for a node. One may apply similar algorithms as in the weighted case, but there are specializations as well. For instance, instead of using a priority queue to find the next node to visit as in Dijkstra’s algorithm, a FIFO queue can be used. 2.2

Boruvka’s Algorithm

Given a weighted, undirected graph G = (V, E) and a weight function w : E → R mapping edges to real-valued weights, a spanning tree is a subset of edges T ⊆ E that forms a tree and connects all nodes in V . The minimal spanning tree (mst) problem is to find the spanning tree with the least weight. There are many algorithms for finding a minimal spanning tree; one algorithm that is often used in parallel settings is Boruvka’s algorithm [5]. The algorithm chooses a node v and finds its lightest weight neighbor u. It then creates a new node vu which represents the contraction of v and u into one node. The node vu will have the union of the edges of v and u, taking the lightest weight edge in the case of duplicates. Nodes u and v are removed from the graph, and the node vu is added. The algorithm continues until there are no more nodes in the graph. The set of contracted edges is a minimal spanning tree of G. One possible priority function for this algorithm is to process nodes in ascending number of neighbors. From an implementation standpoint, nodes with many neighbors have a higher overhead to process and contract because they have many edges that need to be examined and copied. 2.3

Betweenness Centrality

The betweenness centrality (bc) of a node is a metric that captures the importance of each individual node in the overall network structure. Betweenness centrality is a shortest path enumeration-based metric. Informally, it is defined as follows. Let G = (V, E) be a graph and let s, t be a fixed pair of graph nodes. The betweenness

2011/11/18

score of a node u is the percentage of shortest paths between s and t that include u. The betweenness centrality of u is the sum of its betweenness scores for all possible pairs of s and t in the graph. One algorithm for computing betweenness centrality is Brandes’s algorithm [2], which exploits the sparse nature of typical realworld graphs, and computes the betweenness centrality score for all vertices in the graph in O(|V ||E|+|V |2 log |V |) time for weighted graphs, and O(|V ||E|) time for unweighted graphs. This algorithm has parallelism at multiple levels. First, we can perform the shortest path exploration from each source node in parallel. Additionally, each of the shortest path computations itself can be done in parallel. An algorithm that only performs simultaneous explorations from different source nodes is an outer loop parallelization. An algorithm that only parallelizes each shortest path computation itself is an inner loop parallelization. The shortest path computation can use similar priorities to those used by sssp and bfs algorithms with additional modifications to account for the additional tasks that must be done to compute betweenness centrality scores. 2.4

Maximum Cardinality Bipartite Matching

Given a graph, a matching M is a subset of edges where no two edges share an endpoint. The cardinality |M | of M is the number of edges in M . Given an unweighted bipartite graph G = (V, E) V = (A, B), the maximum cardinality bipartite matching (matching) problem is to find a matching with maximum cardinality. Algorithms to solve this problem try to extend existing matchings by finding augmenting paths. When there are no more augmenting paths, the matching is a maximum matching. The Ford-Fulkerson algorithm [12] visits each node in A trying to find an augmenting path at each. It has complexity O(|V ||E|). The Alt-Blum-Mehlhorn-Paul (ABMP) algorithm [12] tries to find the shortest augmenting path first. After all the augmenting paths of less than a particular length are found or the number of unmatched nodes is reduced to a particular size, the algorithm applies the Ford-Fulkerson algorithm to complete the matching. The ABMP p algorithm has complexity O(|E| |V |). 2.5

Belief Propagation

Loopy belief propagation [15] (bp) is an algorithm for performing inference on arbitrary graphical models. Belief propagation is a message passing algorithm, which is a class of graph algorithms that iteratively passes messages between neighbors until the messages reach convergence. The particular order messages are applied has a significant impact on the rate of convergence. Synchronous implementations apply updates simultaneously across the entire graph. Asynchronous implementations apply updates immediately at each node. Among asynchronous implementations, there are many possible schedules such as random, fixed or round-robin order. Recently, Elidan et al. have developed the residual belief propagation (RBP) algorithm [6], which processes updates based on the magnitude of updates. The authors observe that RBP converges significantly more often than other methods and significantly reduces running time even when other methods do converge.

3.

Priority Scheduling

The algorithms described in Section 2 benefit from processing tasks in a particular priority order. That benefit may come from a change in algorithmic complexity, as in the case of the relaxation-based algorithm for sssp and the ABMP algorithm for bipartite matching, or it may come from an effective heuristic, as in the case of residual belief propagation. Faced with designing a parallel implementation, a programmer may adopt the strategy used by out-of-order processors: multiple tasks are issued simultaneously, but they must

3

commit in sequential priority order. Pingali et al. call algorithms that require strict priority order commit ordered algorithms [16]. In contrast, unordered algorithms may commit in any order. Unordered algorithms are also called asynchronous algorithms [1]. Unordered algorithms are an enticing target for parallelization because of their greater flexibility in scheduling, and it turns out that many ordered algorithms have unordered counterparts [8]. The approach we advocate is to start from an unordered algorithm and then use a scheduler to guide but not force execution towards the priority order. Our results in Section 5 show that this can be a scalable approach. Going back to the processor analogy, multiple tasks are issued in priority order, but they can commit in any order. Since the algorithm is unordered, it is robust to outof-order commits. The key to efficient implementation of this approach is understanding the trade-offs between strictly following priority schedules and following more loose ones. Fundamentally, the trade-off is between reducing the total amount of work, which is an algorithm-level issue, and reducing the overhead of the scheduler including synchronization costs, which is an implementationlevel issue. In the remainder of this section, we introduce four implementations of a priority scheduler for unordered algorithms and examine the trade-offs each makes with respect to the algorithmic and implementation issues. We use the unordered (asynchronous) algorithm for single-source shortest path to illustrate the important points, but similar issues occur with all the algorithms of Section 2. 3.1

Schedulers

A priority queue is a data structure that supports two operations: (1) adding an item to the priority queue (push) and (2) removing the highest priority item from the queue (pop). A priority queue may also have an operation to change the priority of an item already added to the queue, but we do not discuss such variations in this paper. The priority of items is defined by a user-supplied priority function that, in traditional priority queues, directly encodes the less-than relation between items. In this paper, we adopt the convention that the highest priority item is the least item according to the priority function. Although not their sole purpose, priority queues are commonly used as the basis for priority schedulers. For sequential execution, a priority queue can be directly used as a scheduler. Any sequential algorithm that, as its main loop, iterates over elements of a priority queue is an example of this. For parallel execution, the scheduler must also be a concurrent data structure. One solution is to use a concurrent priority queue. Another technique is to use per-thread priority queues. There are two common implementations of the per-thread technique, which mainly differ on how new tasks are assigned. The first assigns a new task to the thread that generated that task. We call this a local priority queue scheduler. The second assigns a new task based on a pre-determined assignment of tasks to threads. Assigning tasks based on partitioning nodes of a graph is one example. We call this a partitioned priority queue scheduler. When a per-thread queue becomes empty, a thread may use work-stealing to find new work in another queue. Local schedulers often use work-stealing to reduce load imbalance. Partitioned schedulers often don’t because many applications using partitioned schedulers use the assignment of tasks to threads as a basis for synchronizing access to an underlying data-structure, and workstealing would dynamically change this assignment. In the partitioned graph example, if only one thread is assigned to each partition, then work within a partition does not need to be synchronized among threads because at most one thread will process it. In the following sections, we evaluate three schedulers based on the concurrent priority queue provided by the Intel TBB library: a

2011/11/18

C

2

1

...

1 A

D

4

Figure 1. Example of how good, bad and empty work can occur in sssp.

Priority Processed

2

B

90 80 70 60 50 40

100

200

Iteration (Millions)

300

(a) ltbb scheduler. 65

400

WorkKind

300

Good

200

Bad

100

Empty

Priority Processed

Iterations (Millions)

500

0 obim

ltbb

ptbb

60 55 50 45 40 35

tbb 20

Figure 2. Number of iterations of different types of work generated by the sssp algorithm under different schedulers on machine m1 at 8 threads with the large input. Machine and input details are in Section 5.

40

60

80

Iteration (Millions)

100

120

(b) ptbb scheduler.

scheduler that is simply the concurrent priority queue, tbb, and local and partitioned priority queue schedulers using the TBB concurrent priority queue, ltbb and ptbb, respectively. The ltbb scheduler uses work-stealing when a per-thread queue becomes empty, while ptbb does not. We have also evaluated priority schedulers based on concurrent skip-lists [18], but we have found that the absolute performance of the TBB queue is substantially better. In addition, we evaluate our new data structure, which is specifically designed for concurrent priority scheduling of unordered algorithms: the ordered by integer metric scheduler, obim. We defer the full implementation details of obim until Section 4, but briefly, unlike ltbb and ptbb, obim tries to keep threads working on globally high priority work, and unlike tbb, it sacrifices maintaining exact priority order in exchange for reduced synchronization overhead. 3.2

Algorithmic Issues

Priority scheduling is primarily used to reduce the total amount of work done by an algorithm. That reduction may come from an improvement in algorithmic complexity or from an effective heuristic. If the priority order is a good one, the more a schedule differs from the priority order, the more work is done. That being said, the sensitivity of an algorithm to priority deviations on a particular input may vary significantly. For sssp, we can classify work into three categories. First, good work is an iteration than lowers the distance value of a node to its final value. Second, empty work is an iteration that is asked to lower a distance but that distance value has already been lowered. Empty work occurs because concurrent priority schedulers often do not allow updating the priority of a task after it has been added to the scheduler, so there may be multiple tasks that update the same node. For sssp, a sequential schedule that always applies the least value update will only have good and empty work, and a scheduler that allows the priority of a task to be updated after the task is added will only have good work. When the least value update is not always applied, there is a third type of work that can arise, bad

4

Priority Processed

70 60 50 40

20

40

60

80

Iteration (Millions)

100

120

140

(c) obim scheduler.

Figure 3. Priorities processed versus iterations executed for Figure 2. Multiple lines correspond to priority values processed for each thread. For reference, the priorities processed by a sequential execution using tbb are shown in black.

work. Bad work is when an iteration lowers the distance of a node to a non-final value. Figure 1 shows an example of how each type of work can occur. Good work occurs when node A updates its neighbor B. Empty work occurs when the update B → C and A → C are both added to the scheduler, and A → C is processed first. The update B → C becomes empty work. Bad work occurs when the update A → D is applied before A → C → D. In this case, the distance label on D is lowered twice. Figure 2 shows the breakdown of work for the sssp algorithm for one input. The amount of good work should be the same across schedulers for the same input, but the amount of bad and empty work varies according to the particular scheduler used. We can also characterize the instantaneous behavior of sssp by looking at the priorities processed over time by each thread. Figure 3 shows this

2011/11/18

80

Overhead (%)

data for the same configuration as Figure 2 using the total iterations executed as a proxy for time1 . Figure 3a shows that under ltbb, threads quickly diverge from processing the globally highest priority work as defined by the priorities processed by the sequential priority order. The processing of lower quality work generates more low quality work, which in turn leads to threads processing priorities that would have never been generated when following the sequential priority order. The additional lower priority work is exactly the bad work of Figure 2. The threads eventually converge to processing the higher priority work through work-stealing. Each of the drops in priorities processed corresponds to a thread stealing higher priority work from another thread. Figure 3b shows that ptbb is much better at keeping threads working on high priority work. While there is not exact agreement in priorities between threads, the priorities processed are similar and do not differ too much from the sequential priority order. The execution length of ptbb is shorter than the sequential priority order due measurement error. Also, recall that ptbb works by partitioning work between threads. Thus, another way to view these results is that the average highest priority among t partitions (where t is the number of threads) is close to the highest priority globally. Figure 3c shows how obim achieves the same effect of ptbb, reducing the amount of bad work, through a different mechanism. Instead of assuming that the average highest priority among partitions is close to the highest priority globally, the obim scheduler tries to keep all the threads working on the globally highest priority work. The priorities processed by each thread closely match the priorities processed in the sequential priority order. The curve is shifted to the right because obim slightly increases the amount of empty work at each priority. Because threads are working closely in the priority space, it is more likely that one thread will generate work that improves on work already added to the scheduler; since work is processed in priority order, the improved work is done first, and the other work becomes empty work. It is important to note that priority scheduling using ltbb even with the priority divergence measured in Figure 3a may be an improvement over not using priorities at all. Using a random scheduler on the same input as Figure 2 produces an execution that times out after two hours, while an execution using ltbb finishes in about 2 minutes with 8 threads (obim completes in 11 seconds). In other cases, priority scheduling is the difference between termination and non-termination. For example for belief propagation, in some cases, a priority scheduler (the RBP algorithm) produces an execution that converges while a random scheduling does not. It is also important to point out that these results are input dependent. As noted above, the ptbb scheduler is likely to do well when the average highest priority among partitions is close to the highest priority overall. Likewise, ltbb does well when high priority work is generated uniformly among threads. These properties are input and algorithm dependent. There are also inputs for which it is unlikely that we can achieve efficient parallel speedup, such as when the priority order (on that input) is a true total order. Solving sssp on a line graph is one example. The underlying assumption when parallelizing algorithms using priority scheduling is that the priorities encode a partial order on tasks, and it is possible to simultaneously execute multiple tasks.

60 40 20 0 obim

ltbb

ptbb

tbb

Figure 4. Scheduler overheads for Figure 2.

3.3

Implementation Issues

If it were only a matter of picking the scheduler that most closely follows the sequential priority order, then picking the best concurrent priority scheduler would be straightforward. Choose the one that minimizes the distance from the sequential priority order. Unfortunately, finding such a concurrent priority scheduler is similar to the relation between entropy and energy in physics. It takes very little energy to make threads process work in random order, but getting threads to process work in a specific order requires a significant increase in energy. In the context of priority scheduling, energy is the execution time to perform scheduling and the synchronization required to find the highest priority work. Figure 4 shows the overhead of schedulers relative to the total application time for sssp. The overhead was calculated by periodic stack sampling. Any thread not executing application code was marked as executing the scheduler. These overheads correspond to an iteration under single-threaded heap-based schedulers taking nearly twice as long as under obim (1.2 microseconds versus 0.6 microseconds). Overhead costs can be viewed as coming from two sources. The first source is simply the sequential cost of performing a scheduling operation. The tbb, ltbb and ptbb schedulers are based on a heap data structure, and the cost of heap operations is generally more than the simple indexing scheme used by obim. The second source is the synchronization cost from making the scheduler concurrent. Of all the priority queue-based schedulers, tbb requires the most synchronization because there is a single global queue that all threads push to and pop from. The local and partitioned variants do less synchronization because there are per-thread queues. The ltbb scheduler incurs a synchronization cost when another thread does work-stealing. The ptbb incurs a synchronization cost because multiple threads may push work to the same queue, although only one thread will pop it2 . The ltbb scheduler appears to have approximately the same overhead as tbb, but recall that the executions of the two schedulers are significantly different. In particular, under ltbb, there is more empty work processed, which takes less time to do than other types of work. This increases the relative proportion of scheduling costs. Comparing absolute times would also be problematic because each scheduler produces a different number of iterations. However, even with these caveats in mind, it is clear that scheduling overheads can be significant for many algorithms. For instance, on this input the average iteration times at 8 threads are 0.62, 8.64, 3.04 and 1.85 microseconds for obim, tbb, ltbb and ptbb, respectively. 1 There

are some technical issues mapping the concurrent execution of multiple threads onto a single time-frame. We use a sampling approach where we periodically sample priorities and compute the mean priority over a window. 2 Multiple producer-single consumer queues may eliminate synchronization instructions per se such as atomic instructions or barriers, but there is still the cost of scanning the multiple queues, which we take as just another form of synchronization cost.

5

2011/11/18

One solution to reduce the overhead of priority queue-based schedulers would be to use a more efficient priority queue implementation like a pairing heap [7], but we know that unordered algorithms can execute in any order, and the previous section showed that at least sssp can accept small deviations from the sequential priority order without significantly increasing the amount of bad work. Therefore, it is also reasonable to use a data structure that does not always return the highest priority work in exchange for reduced overhead or increased scalability. We describe one such data structure in Section 4. Synchronization costs can manifest themselves outside the scheduler as well. In many cases, concurrent priority scheduling is used to schedule tasks that may interfere with each other when executed simultaneously. In general, the problem of detecting at runtime when two tasks interfere with each other is called conflict detection. One approach to executing these tasks is to use coordinated scheduling to force the scheduler to schedule only non-interfering tasks simultaneously. Partitioned priority queues are one example of this approach. Another approach is to use autonomous scheduling and make the tasks check for interference themselves and to reschedule if necessary. A scalable algorithm needs both scalable scheduling and scalable conflict detection. 3.4

Discussion

Fundamentally, priority queues and concurrent priority schedulers for unordered algorithms are data structures designed to achieve two completely different goals. Priority queues are designed for efficient retrieval and update of the least element of a set. Concurrent priority schedulers are designed for low overhead, concurrent retrieval and update of high priority tasks. Priority queues are often parameterized by the less-than relation, while priority schedulers are often parameterized by real or integer valued priorities. The difference is especially revealing in how each is evaluated. Priority queues are evaluated by the performance of retrieving and updating the least element of a set. Concurrent priority schedulers are evaluated according to the performance of an algorithm using that scheduler. A concurrent priority scheduler may trade-off adherence to the priority order for reduced overhead, but a priority queue that did the same would not be a valid priority queue because it would no longer return the least element. Unordered algorithms are a large class of algorithms. Unless there are hard constraints on commits, such as being priority ordered, any parallel algorithm using a priority scheduler is effectively an unordered algorithm because it has to be robust to unbounded deviation from priority order. One must be careful, however, when applying priority orders developed for ordered algorithms to unordered ones. Since priority order commits are not guaranteed during unordered execution, many properties of the ordered algorithm do not hold for its unordered counterpart with the same priorities. In particular, the asymptotic complexity may differ. However, the results in Section 5 show that, as long as sufficient attention is paid towards maintaining priorities, unordered algorithms can achieve good performance in practice even compared to their ordered counterparts.

4.

A Scalable Priority Scheduler

The design of our ordered-by-integer-metric (obim) scheduler is informed by the considerations discussed in Section 3. It exploits two forms of hierarchy: logical and physical. The logical hierarchy comes from the structure of priorities as they are used in typical applications. Tasks with similar priorities often can be reordered with respect to one another without increasing the total amount of work or breaking the convergence of the algorithm. Thus, logically, the set of tasks can be viewed as a sequence of unordered sets of work. Tasks within the same set can be executed in any order. The phys-

6

ical hierarchy comes from the structure of multicore architectures. Each core is part of a CPU package, and multiple packages form a machine. Operations between cores on the same package often are more efficient than operations between cores on different packages. To take advantage of the logical hierarchy, we build our scheduler out of two data structures: one that implements the sequence and one that implements the set. To support sparse sequences, we use an ordered map. To implement the set, we use a bag. To take advantage of the physical hierarchy, we use the following design principle: global communication is expensive but intrapackage communication is relatively cheap. An important corollary is that frequent operations should be read-only or core-private. Read-only operations are cached at the package and core level, so they do not incur much communication. Retrieving and adding work to a bag will be the most frequent operation, so we make the average operation on it core-private. The next most frequent operation is using the ordered map to find the least bag to retrieve work from. This requires some communication between threads in order to find the global minimum. If we cannot make this operation entirely core-private, the next best option is to make it read-only on average. The bag is implemented as per-package lock-free queues or stacks that contain fixed-sized chunks of work of up to 16 to 256 items in our configurations. Whether to use a queue or stack depends on the particular algorithm. Without loss of generality, we assume a queue. Once a thread retrieves a chunk of work, it becomes private to that thread, and all subsequent requests for work on the bag will be satisfied from that chunk until it becomes empty. After a chunk becomes empty, a thread attempts to retrieve another chunk from its per-package queue. If that is empty, it attempts to steal a chunk from another per-package queue. New work is placed in a separate thread-private chunk until it reaches the maximum size, at which point the thread will push it onto its per-package queue, making it available to the other threads. This replication of queues on each package rather than a more traditional single global queue has a significant impact on scalability as we show in Section 4.1. The ordered map is implemented in two parts. Globally, the map is represented as a log-based structure which stores pointer-priority pairs representing insert operations on the logical global map. Each logical insert operation updates a global version number. The global map is never directly queried, rather threads copy versions of the map to their own local map and query that instead. Locally, each thread maintains both its own tree for easy sparse indexing and the version number of the last update it performed from the global log. To add a bag at a priority, a thread first checks its local tree to find the bag associated with that priority. If it cannot find it locally, it scans the global log from the point in the log indicated by its local version number and applies all updates from that point on. Updating from the global log is thus lock free. If the thread still cannot find the bag, it acquires the global lock, updates again, and, if needed, creates a new bag and updates the global structures before releasing the lock. To find a bag with high priority, a thread checks whether the last bag it popped from is empty. In most of our applications, priorities tend to increase monotonically, so it makes sense to exhaust the bag at the current priority before checking for a new high priority bag. If the bag is empty, the thread checks, without acquiring a lock, whether the global version number matches the local version number. If the bag is not empty, it updates its local tree just like in the case of adding a task. In either case, it then finds and returns the bag corresponding to the highest priority in its local tree. The average case for retrieving the next task to execute is that a thread has already found a bag, and it has already found a chunk within that bag, so the entire operation is thread-local. The next

2011/11/18



8 ●

7 ●

6

15



5 4



10



3

5



2 1

20



1

2

3

4

5

Threads

6

7

8

(a) On machine m1. ●

obim

obim.b

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5

10

15

Threads

20

(b) On machine m2. obim.bp

obim.p

Figure 5. Speedup (relative to the same baseline as Figure 7) of obim and its variants for sssp with large input. most common case is that the chunk becomes empty, and the thread attempts to get a new chunk from its per-package queue, which is a relatively inexpensive operation. Only when the per-package queue is empty will the thread try to steal from another package. It is only at this point that a thread will try to find a new bag in the ordered map. In cases where the priority function is monotonic, this strategy significantly reduces synchronization while largely following priority order. When a thread tries to find a new bag, it first checks the global version number. In the common case, which is the local tree is upto-date, this will not generate any additional coherence traffic as there will have been no writes to the global version number, and it will not have been invalidated (at least for coherence reasons) from the local cache. When adding a new bag, there are only a few invalidations at the cache level: the tail of the global log and the global version number. Each thread will suffer these misses only the first time it updates its local tree after the addition. This data structure is clearly not linearizable with respect to a standard priority queue. There are many cases where threads will not retrieve the highest priority work. There are also many cases where one thread will think the scheduler is empty when there still is work to be done. As stated earlier, unordered algorithms can tolerate the former. For the latter, we use a separate termination detection algorithm to determine when no thread can find work. 4.1

Evaluation of Optimizations

To test the importance of our design decisions, we implemented several de-optimizations to the obim scheduler.

bc bc bc bc bfs bfs bfs bfs bfs bfs bp bp matching matching matching matching mst mst sssp sssp sssp sssp sssp

Input Machine large m1 large m2 small m1 small m2 large m1 large m2 small m1 small m2 small m2 small m2 small m1 small m2 large m1 large m2 small m1 small m2 small m1 small m2 large m1 large m2 small m1 small m1 small m2

Baseline Kind Time obim 153.0 obim 253.3 obim 15.6 obim 28.6 obim 75.2 obim 130.5 obim 1.5 obim 2.6 obim 2.6 obim 2.6 obim 43.4 obim 80.5 obim 131.8 obim 216.7 obim 86.1 obim 139.8 obim 9.4 obim 14.9 obim 90.5 obim 156.0 tbb 2.1 tbb 2.1 ptbb 3.5

Kind tbb tbb tbb tbb tbb tbb ptbb ptbb ptbb ptbb ptbb ltbb ltbb ptbb tbb ptbb ptbb ptbb ltbb ltbb ptbb ptbb ptbb

Worst T Time 6 277.7 9 452.3 6 63.8 22 144.1 8 391.4 23 264.4 8 9.0 19 119.3 19 119.3 19 119.3 1 47.8 1 83.0 1 151.0 13 526.5∗ 1 94.6 22 438.2 1 33.2 1 54.0 7 179.4 1 294.8 7 14.3 7 14.3 19 249.2

Figure 6. Best schedulers with one thread (Baseline) and worst schedulers at any thread count (T). Times in seconds. (*) Several runs of matching timed out after 600 seconds and are not included in this this figure. At least one of them was observed to run for at least two hours (7200 seconds). visible changes in slope at the package boundaries on m2, which occurs every six threads. The scheduler with neither optimized bag nor ordered map implementation, obim.bp, does the worst of all the schedulers, and the scheduler with just the optimized map implementation, obim.p, fares only slightly better. The scheduler with an optimized map implementation but not an optimized bag performs better still, but it is only with both these implementations combined that we achieve linear scaling on both machines. Though the higher latencies of inter-package communication reduce scaling modestly on m1, this effect becomes more pronounced on m2 which has four packages of six cores each. Here, we see that after two packages (12 threads) scaling significantly decreases. An important point to recognize is that many implementations will perform similarly on machines with few packages, for instance obim and obim.b on m1, but on larger machines, much more attention needs to be paid to architectural details in order to reduce synchronization costs.

• obim.b: A scheduler similar to obim except that each bag uses

a single shared queue rather than per-package queues. This is obim without memory topology-aware bags. • obim.p: A scheduler using a shared lock-free concurrent skiplist to manage priority to bin mappings rather than a log-based structure. The bags are still memory topology-aware. • obim.bp: A scheduler using a shared lock-free concurrent skiplist and shared queues for bags. Figure 5 shows the speedup for sssp for these variants on two machines. Strikingly, we can clearly see the effect of machine parameters on the performance of the variants. On machine m1, which is a two processor, quad-core machine, there is a knee at four threads. We assigned the first four threads to run on one package and the last four threads to run on the other. Likewise, there are

7

5.

Evaluation

We evaluated several different priority schedulers using the algorithms described in Section 2. In this section, we refer to these algorithms as bc, an inner-loop parallelization of betweenness centrality; bfs, breadth-first search using the scaled distance priority; bp, belief propagation using the residual belief propagation heuristic; matching, maximum cardinality bipartite matching using the ABMP algorithm; mst, Boruvka’s minimal spanning tree algorithm using the number of neighbors priority; and sssp, the asynchronous algorithm for single-source shortest path using the scaled distance priority. For each algorithm, we chose one or two inputs to evaluate: one called the small input and an optional large input. For bc, the small

2011/11/18

bc

bfs

matching

sssp

15 large

10

Speedup

5 0











● ●























































● ●

● ●

● ●

● ●

● ●

● ●



1

2

3

4

5

6

7

8

15 small

10 5 0























● ●

● ●





● ● ●



1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

1

Threads



2



3

4









5

6

7



8

(a) On machine m1.

30 25 20 15 10 5 0

matching

sssp

●●

●●



●●

●●

●●● ●● ●● ●●●●●●●● ●● ● ●●●●●● ●●●●●●●

●●●●●●●●●●● ●●●●● ●●● ●● ● ●●

● ●●●●●●●●● ●●●●●● ●●●● ●● ●● ●● ● ●●●●●

●● ●●● ●● ●● ●●●●● ● ●●●● ●●● ● ●● ● ●● ●●●●●●●●

●●●●●● ●●●● ●● ●●● ●●● ●●● ● ●● ●●

●●●●●●●●●●●●●●●●●●●●●●●●

●● ●● ●●● ●● ● ● ●● ●●●● ●●● ●● ●● ●●● ●●●● ●● ● ●● ● ● ●● ● ●●●●

5

10

15

20

5

10

15

20

#

Threads

5

10

#

## #

15

small

30 25 20 15 10 5 0

bfs

large

Speedup

bc

20

5

10

15

20

(b) On machine m2. ●

ltbb

obim

ptbb

tbb

Figure 7. Speedup relative to best single-threaded scheduler time. “#” indicate runs where matching timed out after ten minutes. bc, large

bc, small

bfs, large

bp, small

matching, large

matching, small

mst, small

sssp, large



5

● ●

4 3 2 1

● ●





ltbbobimptbb tbb



● ● ● ● ● ● ●

● ● ● ● ● ● ●

ltbbobimptbb tbb





ltbbobimptbb tbb





ltbbobimptbb tbb



ltbbobimptbb tbb



● ● ●

ltbbobimptbb tbb









ltbbobimptbb tbb



ltbbobimptbb tbb

Figure 8. Iterations executed over all runs relative to best single-threaded scheduler.

8

2011/11/18

bp

partition are divided evenly into 104 groups. Each node in A has degree d = |E|/|A| and the edges out of a node in group i of A go to random nodes in groups i + 1 and i − 1 of B. The large input is a similarly constructed graph where each partition has 2 · 106 nodes, and there are 2 · 108 edges and the same number of groups. This is one kind of input evaluated in [12]. For mst, the small and only input is the road network used by bfs except with the original edge weights. For sssp, the inputs are the same as with bfs except the edge weights of the small input are the original edge weights of the graph, and the edge weights of the large input are randomly selected from [0, 100]. We evaluated the schedulers described in Section 3.1 on two machines, running each configuration 3 times:

mst

5 ● ●

Speedup

4 ●

3





● ●



2

● ●

1

● ●

1

2

3

4

5

6

7

8

1

Threads





2

3



4



5





● ●

6

7

8

(a) On machine m1. bp

Speedup

12 10 8 6 4 2

mst

● ● ●●● ●●●● ● ●● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●

5

10

15

• m1: a Sun Fire X2270 machine running Ubuntu Linux 10.04

LTS 64-bit. It contains two 4-core 2.93 GHz Intel Xeon X5570 processors. The two CPUs share 24 GB of main memory. Each processor has an 8 MB L3 cache shared among the cores. • m2: a machine running Ubuntu Linux 10.04 LTS 64-bit. In contains four 6-core 2.00 GHz Intel Xeon E7540 processors. The CPUs share 128 GB of main memory. Each processor has an 18 MB L3 cache shared among the cores.

●●●●●●●●●●●●● ● ●●●● ●●● ●● ● ●

20

5

Threads

10

15

20

(b) On machine m2. ●

ltbb

obim

ptbb

tbb

Figure 9. Speedup relative to best single-threaded scheduler time.

bfs, small 70 60 50 40 30 20 10

sssp, small

● ●







ltbb obim ptbb

tbb

ltbb obim ptbb

tbb

Figure 10. Iterations executed over all runs relative to best singlethreaded scheduler. input is scale-free graph produced by the R-MAT generator [3] with 220 nodes. The large input is random directed graph with 225 nodes and 4 · 225 edges. All the nodes are connected in a ring and each node selects three other nodes at random to be its neighbors. For bfs, the small input is a road network graph of the western USA from the DIMACS shortest paths challenge with all edge weights made unit. The large input is a random graph with 226 nodes and 4 · 226 edges generated like the large input for bc. For bp, the small and only input is a 300 × 300 spin Ising model with node potentials uniformly sampled from [0, 1) and the pairwise potentials φi,j (Xi , Xj ) are eλC when xi = xj and e−λC otherwise. We sample λ from [−0.5, 0.5]. The C parameter controls the difficulty of the inference task. In our experiments, C = 2.5. This is one kind of input evaluated in [6]. For matching, the small input is a bipartite graph G = (A, B, E) where |A| = |B| = 106 , and there are 108 edges. Nodes in each

9

Figure 6 gives the best time and scheduler for one thread, as well as the worst time and scheduler for any thread count. Figures 7 and 9 summarize the performance results across algorithms relative to the best single-threaded times. In most cases, obim is the fastest scheduler even single-threaded execution, and it is often always the best performing one across threads, frequently by a large margin. The obim scheduler is also significantly faster than tbb at full scale. For instance, at 24 threads on machine m2, on sssp with the large input, obim (6.4 seconds) is about 40 times faster than tbb (255.0 seconds). On mst, obim (4.9 seconds) is about 8 times faster than tbb (42.9 seconds). The differences in performance between schedulers stem from two competing factors. First, following the priority order reduces the total amount of work done. When a priority scheduler diverges from that order, the total amount of work can vary significantly. Figures 8 and 10 show how many iterations were executed under each scheduler relative to the best single-threaded scheduler. In Section 3.2, we showed one case where changing schedulers changes the relative types of work, so comparing iteration amounts across schedulers must be done with this caveat in mind, but there are several clear trends. The tbb scheduler does the same amount of work as the baseline scheduler, while the iterations executed by ltbb and ptbb fluctuate significantly. The obim scheduler is relatively more robust, though it too can cause significant increases in iterations. Second, even if the priority order is closely followed, if the overhead required to maintain that order is too high, then performance will suffer relative to a scheduler that is more efficient. The results for the tbb scheduler show that close adherence to the priority order does not result in good overall performance because the cost of maintaining that order is too high. The obim scheduler, on the other hand, shows good performance across the board, suggesting that it more successfully navigates the trade-off between priority order and overhead. The ltbb and ptbb schedulers make sacrifices to reduce synchronization costs, but in many cases, this also increases the total work done, negating the benefits of reduced synchronization. Overall, these experimental results point out that concurrent priority scheduling requires careful attention to both priority order and overheads. The ltbb and ptbb results show that under some configurations, these schedulers can do well depending on the particular algorithm, input and machine. The lower scale factor compared to obim is

2011/11/18

likely due to increased overhead costs. The inverse scaling is due to increased work when these schedulers diverge from the priority order. The discontinuities at 6, 12 and 18 threads on machine m2 are likely caused by the change in cost between intra and interprocessor communication. The results of matching with ptbb illustrate an interesting issue that arises with priority scheduling. The matching algorithm works by performing an initial pass over the graph that is priority scheduled and then it hands the resulting graph to a serial pass that completes solving the problem. Depending on the exact order of execution, the resulting graph may be easier or harder to solve by the serial pass. In some cases on the large input on m2, the resulting graph is too hard for the serial pass to solve in the time alloted, and the execution times out. In other cases, such as on m1, the resulting graph is easier to solve than normal, and ptbb achieves non-linear speedup. The ptbb scheduler may be particularly susceptible to this effect due to its a priori partitioning of work to threads. The iteration results confirm that the ptbb scheduler does do less work than the baseline. One may argue that this chaotic behavior is a point in favor of ptbb. Sometimes you lose; sometimes you win big. A counterargument is that the wide swings and potential dependence on partitioning function, machine and input are potentially unattractive to an end user. In any case, improving the priority function to always lead to these winning situations is a far more robust way of achieving performance. The matching algorithm differs from the other algorithms in another aspect. An average iteration of matching takes more time than in other algorithms. An iteration potentially performs a breadth-first traversal over the entire graph. The increased length of an iteration means that the relative cost of priority queue-based schedulers is less, which is why there is modest scaling for tbb. The behavior of ltbb and ptbb on the bp algorithm is slightly more mysterious. The obim scheduler is doing more iterations than ltbb and ptbb, which can account for some of the difference. The ltbb and ptbb schedulers may also benefit from locality. The bp algorithm proceeds in rounds, each iterating over the factor graph. Schedulers that assign tasks to the same threads across rounds can exploit significant locality benefits. The difference between ltbb and ptbb may be due to the work-stealing done by the former. More experiments need to be done to isolate the exact causes. Networks with large diameters like the road network used as the small input for bfs, sssp and mst are often a challenging input for parallel implementations of these algorithms because long paths in these graphs can become sequential bottlenecks. Lower scalability is expected for these graphs compared to graphs with small diameters like the random graphs used here. The mst algorithm is a coarsening application so the available parallelism decreases over time. Again, we expect less than linear speedup for this algorithm. One final thing to note is that in many cases even the worst performing priority scheduler is orders of magnitude better than a non-priority scheduler. Although the scaling may be worse, the reduction is work is often well worth the effort.

6.

Conclusion

Priority queues are often used in sequential graph algorithms. They provide a strict order to loop iterations that often allows improved bounds on algorithmic complexity. However, when moving to a parallel context, priority queues are hard to make both scalable and fast. In this paper, we have shown that allowing a small amount of deviation from priority order and paying attention to the cache structure of the machine allows for an efficient implementation of a priority scheduler for parallel loop iterations. We show that

10

this scheduler is efficient, taking less than 15% of the time per iteration for single-source shortest path, and that it deviates only slightly from the sequential scheduling order. Exploiting logical and physical hierarchy can provide consistently good performance and scalability across a range of algorithms.

References [1] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989. [2] U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2):163–177, 2001. [3] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In In Proceedings of the 2004 SIAM International Conference on Data Mining, 2004. [4] B. V. Cherkassy and A. V. Goldberg. On implementing push-relabel method for the maximum flow problem. In Proc. Intl. Conf. on Integer Programming and Combinatorial Optimization (IPCO), pages 157– 171, London, UK, 1995. [5] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, editors. Introduction to Algorithms. MIT Press, 2001. [6] G. Elidan, I. McGraw, and D. Koller. Residual belief propagation: Informed scheduling for asynchronous message passing. In in Proceedings of the Twenty-second Conference on Uncertainty in AI (UAI), 2006. [7] M. L. Fredman, R. Sedgewick, D. D. Sleator, and R. E. Tarjan. The pairing heap: a new form of self-adjusting heap. Algorithmica, 1:111– 129, January 1986. [8] M. A. Hassaan, M. Burtscher, and K. Pingali. Ordered vs unordered: a comparison of parallelism and work-efficiency in irregular algorithms. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP ’11, pages 3–12, New York, NY, USA, 2011. ACM. [9] G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48(1):96–129, 1998. [10] K. Kennedy and J. Allen, editors. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann, 2001. [11] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. SIGPLAN Not. (Proceedings of PLDI), 42(6):211–222, 2007. [12] K. Mehlhorn and S. Näher. LEDA: A Platform for Combinatorial and Geometric Computing. Cambridge University Press, 1999. [13] U. Meyer and P. Sanders. Delta-stepping: A parallel single source shortest path algorithm. In Proc. European Symp. on Algorithms (ESA), pages 393–404, 1998. [14] R. Nasre and R. Govindarajan. Prioritizing constraint evaluation for efficient points-to analysis. Code Generation and Optimization, IEEE/ACM International Symposium on, 0:267–276, 2011. [15] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. [16] K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI ’11, pages 12–25, New York, NY, USA, 2011. ACM. [17] A. Reshetov, A. Soupikov, and J. Hurley. Multi-level ray tracing algorithm. In ACM SIGGRAPH 2005 Papers, SIGGRAPH ’05, pages 1176–1185, New York, NY, USA, 2005. ACM. [18] N. Shavit and I. Lotan. Skiplist-based concurrent priority queues. In International Parallel and Distributed Processing Symposium/International Parallel Processing Symposium, pages 263–268, 2000.

2011/11/18