Detecting Bottlenecks in Parallel DAG-based Data Flow Programs

Detecting Bottlenecks in Parallel DAG-based Data Flow Programs Björn Lohrmann Dominic Battré Matthias Hovestadt Alexander Stanik Daniel Warneke Email:...
Author: Lucas Stone
1 downloads 0 Views 958KB Size
Detecting Bottlenecks in Parallel DAG-based Data Flow Programs Björn Lohrmann Dominic Battré Matthias Hovestadt Alexander Stanik Daniel Warneke Email: {firstname}.{lastname}@tu-berlin.de Complex and Distributed IT-Systems Technische Universität Berlin

Introduction (1) IaaS clouds offer virtual machines on-demand

Why use clouds for data processing? ■ Fast and unlimited** scale-out ■ Pricing Model ♦ Pay-as-you-go ♦ 10 nodes for 1 day = 1 node for 10 days

■ No long-term obligations **almost 15.11.2010

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

2

Introduction (2) Frameworks are required for effective use of clouds Job Modelling

?

Parallelization

Eucalyptus Hadoop Nephele etc.

VM Management

Job Monitoring

15.11.2010

Job Scheduling

Job Deployment

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

3

Prerequisites ● Jobs modelled as directed acyclic graphs

Task 4

■ Vertices are tasks ■ Edges are communication channels

● Each task has 1..n parallel task instances ● Unidirectional and blocking communication

Task 2

Task 3

Task 1

15.11.2010

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

4

Overview Key question of this talk: ● Given a DAG-shaped job, how many task instances should I assign to each task? Our approach ● Begin with 1 instance for each task ● Iteratively detect bottlenecks and add instances where necessary 15.11.2010

Task 5

Task 5

Task 3

Task 4

Task 2

Task 1

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

5

Bottlenecks Negative effects of bottlenecks: ■ Input starvation ■ Output blockage

Low throughput of workflow

Task 5

Task 5

Task 3

Task 4

Low resource utilization Time and money wasted Task 2

Task 1

15.11.2010

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

6

Bottlenecks Types: ● CPU ■ Enough input available ■ Throughput limited by CPU ■ Lack of input for subsequent tasks

Task 3

CPU

Task 2

CPU

Task 1

CPU

Task 2

CPU

Task 1

CPU

● I/O ■ Transport infrastructure is overloaded (NICs, switches, etc) ■ Forces tasks to wait 15.11.2010

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

7

Bottleneck Detection ● Monitor job at runtime: ● Continuously measure CPU load and I/O wait on task instances ● Aggregate to task statistics

● Continuously analyze task statistics: ■ Traverse task nodes in reverse topological order and check for CPU bottlenecks ■ If none found traverse edges in reverse topological order and check for I/O bottlenecks ■ If bottleneck found: Report it! 15.11.2010

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

8

Implementation ● Based on Nephele framework ■ Java framework ■ 1 master, n workers ■ Task instance = Java thread

● Analysis of thread state statistics: ■ Threshold for CPU bottleneck: ♦ USR + SYS + BLK >= 90% time

■ Threshold for I/O bottleneck ♦ WAIT caused by sending on channel >= 90% time

15.11.2010

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

9

Evaluation Demo Job PDF Writer

Index Writer

PDF Creator

Inverted Index

OCR File Reader

15.11.2010

Setup: ● Private compute cloud ● Hosts with two Intel Xeon 2,66Ghz, 32 GB RAM and 1GB Ethernet ● KVM guests with one virtual CPU and 2GB RAM ● Eucalyptus framework for VM allocation/deallocation

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

10

Evaluation (2) Phase 1: Fine tuning

15.11.2010

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

11

Evaluation (1) Phase 2: Scale-out

15.11.2010

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

12

Conclusion ● Bottleneck detection is useful to scale out jobs in the cloud, while maintaining high resource utilization ● We presented a simple approach to gather and analyze relevant statistics ● Right now, manual adaptation and job re-runs are necessary to eliminate bottlenecks ● Future work: ■ Dynamically and automatically adjust parallelization at runtime

15.11.2010

Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs

13

Suggest Documents