Detecting Bottlenecks in Parallel DAG-based Data Flow Programs
Detecting Bottlenecks in Parallel DAG-based Data Flow Programs Björn Lohrmann Dominic Battré Matthias Hovestadt Alexander Stanik Daniel Warneke Email:...
Detecting Bottlenecks in Parallel DAG-based Data Flow Programs Björn Lohrmann Dominic Battré Matthias Hovestadt Alexander Stanik Daniel Warneke Email: {firstname}.{lastname}@tu-berlin.de Complex and Distributed IT-Systems Technische Universität Berlin
Why use clouds for data processing? ■ Fast and unlimited** scale-out ■ Pricing Model ♦ Pay-as-you-go ♦ 10 nodes for 1 day = 1 node for 10 days
■ No long-term obligations **almost 15.11.2010
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs
2
Introduction (2) Frameworks are required for effective use of clouds Job Modelling
?
Parallelization
Eucalyptus Hadoop Nephele etc.
VM Management
Job Monitoring
15.11.2010
Job Scheduling
Job Deployment
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs
3
Prerequisites ● Jobs modelled as directed acyclic graphs
Task 4
■ Vertices are tasks ■ Edges are communication channels
● Each task has 1..n parallel task instances ● Unidirectional and blocking communication
Task 2
Task 3
Task 1
15.11.2010
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs
4
Overview Key question of this talk: ● Given a DAG-shaped job, how many task instances should I assign to each task? Our approach ● Begin with 1 instance for each task ● Iteratively detect bottlenecks and add instances where necessary 15.11.2010
Task 5
Task 5
Task 3
Task 4
Task 2
Task 1
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs
Low resource utilization Time and money wasted Task 2
Task 1
15.11.2010
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs
6
Bottlenecks Types: ● CPU ■ Enough input available ■ Throughput limited by CPU ■ Lack of input for subsequent tasks
Task 3
CPU
Task 2
CPU
Task 1
CPU
Task 2
CPU
Task 1
CPU
● I/O ■ Transport infrastructure is overloaded (NICs, switches, etc) ■ Forces tasks to wait 15.11.2010
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs
7
Bottleneck Detection ● Monitor job at runtime: ● Continuously measure CPU load and I/O wait on task instances ● Aggregate to task statistics
● Continuously analyze task statistics: ■ Traverse task nodes in reverse topological order and check for CPU bottlenecks ■ If none found traverse edges in reverse topological order and check for I/O bottlenecks ■ If bottleneck found: Report it! 15.11.2010
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs
8
Implementation ● Based on Nephele framework ■ Java framework ■ 1 master, n workers ■ Task instance = Java thread
● Analysis of thread state statistics: ■ Threshold for CPU bottleneck: ♦ USR + SYS + BLK >= 90% time
■ Threshold for I/O bottleneck ♦ WAIT caused by sending on channel >= 90% time
15.11.2010
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs
9
Evaluation Demo Job PDF Writer
Index Writer
PDF Creator
Inverted Index
OCR File Reader
15.11.2010
Setup: ● Private compute cloud ● Hosts with two Intel Xeon 2,66Ghz, 32 GB RAM and 1GB Ethernet ● KVM guests with one virtual CPU and 2GB RAM ● Eucalyptus framework for VM allocation/deallocation
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs
10
Evaluation (2) Phase 1: Fine tuning
15.11.2010
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs
11
Evaluation (1) Phase 2: Scale-out
15.11.2010
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs
12
Conclusion ● Bottleneck detection is useful to scale out jobs in the cloud, while maintaining high resource utilization ● We presented a simple approach to gather and analyze relevant statistics ● Right now, manual adaptation and job re-runs are necessary to eliminate bottlenecks ● Future work: ■ Dynamically and automatically adjust parallelization at runtime
15.11.2010
Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs