Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks

Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks Timothy Armstrong, Mike Wilde, Daniel Katz, Zhao Zhang, Ian Foster. Agni...
Author: Coleen Freeman
3 downloads 1 Views 278KB Size
Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks Timothy Armstrong, Mike Wilde, Daniel Katz, Zhao Zhang, Ian Foster. Agnieszka Podsiadło 23/02/12

Presentation

1

Abstract Many-task applications

Reducing time Efficient use of resources

23/02/12

Presentation

2

Introduction Introduction

Many-task application

Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment Results Discussion Conclusion

Comprices many independent tasks coupled with explicit I/O dependencies (tasks singlethreated or supporting parallelism within one node) ●



Focuses on high-performance

23/02/12

Presentation

3

Introduction Introduction

Trailing task problem

Fixed Node Count

Dynamic Allocation Tail-chopping Simulation



Increasing number of workers remain idle

Experiment Results Discussion Conclusion

Tail of some number of tasks continues to execute for some time ●

23/02/12

Presentation

4

Problem description Introduction

Workers

Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment

Tasks

Results Discussion Conclusion

23/02/12

Presentation

5

Problem description Introduction

Workers

Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment

Tasks

Results Discussion Conclusion

23/02/12

Presentation

6

Problem description Introduction

Constraints

Fixed Node Count

Dynamic Allocation

Workers allocated in a way fitting allocation policies ● Tasks scheduled on the available worker ● Tasks are not preemptable ●

Tail-chopping Simulation Experiment Results Discussion Conclusion

23/02/12

Presentation

7

Problem description Introduction

Constraints

Fixed Node Count

Dynamic Allocation

Workers allocated in a way fitting allocation policies ● Tasks scheduled on the available worker ● Tasks are not preemptable ●

Tail-chopping Simulation Experiment Results

Optimization

Discussion Conclusion

Minimizing: ● Time to solution ● Utilization u = (time spent on tasks) / (total allocated time)



23/02/12

Presentation

8

Algorithms for fixed worker counts Introduction Fixed Node Count



Dynamic Allocation Tail-chopping Simulation Experiment Results Discussion Conclusion

Fixed number of workers ●

Both goals are equivalent



NP-hard (bin-packing)



Simple approaches – queues ●

Random: (2 – 1/m) * OPT



Sorted: (4/3 – 1/m) * OPT

23/02/12

Presentation

9

Algorithms for fixed worker counts Introduction Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment Results

Factors causing long-tail ●



Variance in task duration Number of tasks not divisible by the number of workers

Discussion Conclusion

23/02/12

Presentation

10

Dynamic allocation Introduction Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment Results

Few less workers (works for sorted or short tasks) ●

Discussion Conclusion

Tail-chopping – after chopping smaller resources are allocated ●

23/02/12

Presentation

11

Tail-chopping assumptions Introduction Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment Results Discussion Conclusion

Only one partition of processors will be used for the target at a given time ●

No time limit – allocation requested for any duration ●

Constant time required to start and stop an allocation ●

Tasks cannot migrate – to move a task we need to cancel and restart the task ●

23/02/12

Presentation

12

Tail-chopping heuristics Introduction Fixed Node Count

Dynamic Allocation Tail-chopping

How many workers to allocate? ●

minimum task/worker ratio

Simulation Experiment Results Discussion

When to shrink the number of workers? ●

maximum fraction of idle workers

Conclusion

23/02/12

Presentation

13

Tail-chopping hypothesis Introduction Fixed Node Count

Dynamic Allocation

Tail-chopping will not completely solve the utilization problem ●

Tail-chopping Simulation Experiment Results Discussion Conclusion

Hard to achieve high utilization if the minimum allocation is high ●

Tail-chopping more beneficial for skewed distribution with much-longer-than-average-task-tail ●

Tail-chopping provides greater benefit for not sorted tasks – otherwise reallocating looses a lot of precious work done ●



No benefit combined with sorting if max_length / mean_length > task / worker

23/02/12

Presentation

14

Simulation Introduction Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment Results Discussion Conclusion

All tasks single threaded ● 12 different numbers of CPU cores ● First measured and then used: ● Time from request to manager reporting all partitions ready to go ● Time between requesting to terminate and the allocation finishing ● Control over 3 parameters: ● Scheduling order ● Task / worker ● Fraction of idle workers ●

23/02/12

Presentation

15

Simulation - results Introduction Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment Results Discussion Conclusion

Tail-chopping improved utilization for many sets of parameters ●

Increased time to solution (as expected) ●

23/02/12

Presentation

16

Simulation - results Introduction Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment Results Discussion Conclusion

23/02/12

Presentation

17

Simulation - results Introduction Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment Results Discussion Conclusion

23/02/12

Presentation

18

Simulation - results Introduction Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment Results Discussion Conclusion

Skewedness of the distribution is crucial for assessing the tail-chopping method ●

There is no better and worse fraction idle parameter ● 0.8 – aiming for quick solution ●



No further benefit on sorted tasks

23/02/12

Presentation

19

Experiment Introduction Fixed Node Count

Dynamic Allocation Tail-chopping



With and without tail-chopping



15,000 tasks



Task/worker = 5



Chopping when 50% of workers are idle

Simulation Experiment Results Discussion Conclusion

23/02/12

Presentation

20

Results Introduction Fixed Node Count

Dynamic Allocation Tail-chopping

Without tail-chopping

Simulation Experiment Results Discussion Conclusion

With tail-chopping when 50% workers are idle

23/02/12

Presentation

21

Discussion Introduction

Major problems:

Fixed Node Count

Dynamic Allocation Tail-chopping



Simulation Experiment Results



Discussion Conclusion ●

Time spent on waiting for new allocation Canceling the tasks when reallocated Current heuristics are not sophisticated

23/02/12

Presentation

22

Discussion Introduction Fixed Node Count

Dynamic Allocation Tail-chopping

Time spent on waiting for new allocation

Simulation Experiment Results Discussion Conclusion





Smaller processors alongside Ability of downsizing the current allocation

23/02/12

Presentation

23

Discussion Introduction Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment Results

Canceling the tasks when reallocated ●

Ability to migrate tasks

Discussion Conclusion

23/02/12

Presentation

24

Discussion Introduction Fixed Node Count

Dynamic Allocation Tail-chopping

Current heuristics are not sophisticated

Simulation Experiment Results Discussion



Heuristics using more information

Conclusion

23/02/12

Presentation

25

Conclusion Introduction Fixed Node Count

Dynamic Allocation Tail-chopping Simulation Experiment Results





Discussion Conclusion ●

Described trailing task problem Tail-chopping as a promising way to address the problem Several directions for further research

23/02/12

Presentation

26