Resource Management and Scheduling for the BioOpera Process Support System

Politecnico di Milano Facolt`a di Ingegneria di Como Corso di Laurea in Ingegneria Informatica Resource Management and Scheduling for the BioOpera Pr...
Author: Patrick Wilson
3 downloads 2 Views 3MB Size
Politecnico di Milano Facolt`a di Ingegneria di Como Corso di Laurea in Ingegneria Informatica

Resource Management and Scheduling for the BioOpera Process Support System

Relatore: Prof. Alfonso Fuggetta Correlatori: Prof. Gustavo Alonso Ing. Win Bausch Tesi di laurea di: Cesare Pautasso Matr. 622457

Anno Accademico 1999-2000

Politecnico di Milano Facolt`a di Ingegneria di Como Corso di Laurea in Ingegneria Informatica

Resource Management and Scheduling for the BioOpera Process Support System

Relatore: Prof. Alfonso Fuggetta Correlatori: Prof. Gustavo Alonso† Ing. Win Bausch† Tesi di laurea di: Cesare Pautasso Matr. 622457

Anno Accademico 1999-2000



Politecnico federale di Zurigo, Svizzera (ETHZ)

ii

Acknowledgments

I worked on this thesis in the framework of an exchange program (bilateral agreements) between Politecnico di Milano and the Swiss Federal Institute of Technology Zurich. First of all I would like to thank Prof. Alfonso Fuggetta, for his recommendation that helped me to get accepted at the Swiss Federal Institute of Technology, for being very flexible in arranging our meetings, and for his overall assistance. I would also like to express my sincere gratitude to Prof. Gustavo Alonso, for his willingness to accept me as a foreign “Diplomarbeiter” into his friendly research group, a most stimulating working environment, and for agreeing on such an interesting research topic. A special thanks goes, of course, to my supervising assistant Win Bausch, whom I shared the office with during this past half year, and who forced me to learn to write in LATEX. Many of the ideas in this thesis come from our conversations. I am also grateful to Ms. Vittoria Capriccioli, at the CRIFIC office, for her patience and support with my study abroad paperwork. Most of the information on bioinformatics comes from Mike Hallett, Ph. D., one of the leading experts in this field, of which I have had the opportunity of attending the Computational Biology lectures. Although this thesis has been read by several reviewers, all the remaining fallacies are solely under my own responsibility. Zurich, March 2000

iii

iv

Contents

Acknowledgments

iii

Estratto

ix

1 Introduction

1

1.1

Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Document structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2 Basic concepts

7

2.1

Resource Management and Scheduling Services . . . . . . . . . . . . . .

8

2.2

A Taxonomy of Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.3

Scheduling components

. . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.4

Job Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.5

Load indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.6

General issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3 BioOpera 3.1

3.2

15

The Opera project

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.1.1

Process Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.1.2

Conceptual representation . . . . . . . . . . . . . . . . . . . . . .

17

3.1.3

Textual representation . . . . . . . . . . . . . . . . . . . . . . . .

22

3.1.4

Advanced OCR Features . . . . . . . . . . . . . . . . . . . . . . .

25

3.1.5

System Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

28

The BioOpera project . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.2.1

Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.2.2

Tailoring Opera for bioinformatics . . . . . . . . . . . . . . . . .

33

v

Contents 4 Resource management 4.1 4.2

4.3

35

Resource description . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.1.1

Resource description language . . . . . . . . . . . . . . . . . . . .

36

Dynamic resource information . . . . . . . . . . . . . . . . . . . . . . . .

37

4.2.1

State of a resource . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.2.2

Collection mechanisms . . . . . . . . . . . . . . . . . . . . . . . .

39

Adaptive Load Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.3.1

Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.3.2

Load sampling strategies . . . . . . . . . . . . . . . . . . . . . . .

43

4.3.3

Modeling the strategies . . . . . . . . . . . . . . . . . . . . . . .

47

4.3.4

Modeling the comparison . . . . . . . . . . . . . . . . . . . . . .

51

5 Scheduling mechanisms 5.1

5.2

5.3

55

Job Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

5.1.1

Jobs and Job Lists . . . . . . . . . . . . . . . . . . . . . . . . . .

55

5.1.2

Priority scheduling . . . . . . . . . . . . . . . . . . . . . . . . . .

56

5.1.3

Producer and Consumer . . . . . . . . . . . . . . . . . . . . . . .

57

Placement mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

5.2.1

Placing a job . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

5.2.2

Dispatching a job . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

Addressing load balancing issues . . . . . . . . . . . . . . . . . . . . . .

60

6 Scheduling policies

61

6.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

6.2

Selection policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

6.2.1

Waiting job selection policies . . . . . . . . . . . . . . . . . . . .

62

Transfer policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

6.3.1

Overload conditions . . . . . . . . . . . . . . . . . . . . . . . . .

63

6.3.2

Availability condition . . . . . . . . . . . . . . . . . . . . . . . .

66

Placement policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

6.3

6.4

7 System integration 7.1

7.2

71

OCR Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

7.1.1

Resource description . . . . . . . . . . . . . . . . . . . . . . . . .

71

7.1.2

Scheduler hints . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

7.1.3

Task priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

7.1.4

Waiting state . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

Architectural changes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

vi

Contents

7.3

7.2.1

Starting a task . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

7.2.2

Program Execution Client . . . . . . . . . . . . . . . . . . . . . .

77

7.2.3

History space extensions . . . . . . . . . . . . . . . . . . . . . . .

78

Support tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

7.3.1

Log Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

7.3.2

Monitoring tool . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

8 Experiments 8.1

8.2

8.3

81

Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

8.1.1

Scheduler alternatives . . . . . . . . . . . . . . . . . . . . . . . .

81

8.1.2

Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

8.1.3

Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

8.1.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

8.1.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

Adaptive Load Sampling Simulations . . . . . . . . . . . . . . . . . . . .

91

8.2.1

Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

8.2.2

Experiment and Simulation . . . . . . . . . . . . . . . . . . . . .

91

8.2.3

Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . .

93

8.2.4

Comparison Results . . . . . . . . . . . . . . . . . . . . . . . . .

99

Running the All vs All . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 8.3.1

Process description . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8.3.2

The experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8.3.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

9 Conclusion

113

9.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

9.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Bibliography

117

Index

121

vii

Contents

viii

Estratto

Scheduling e gestione risorse per il sistema di supporto ai processi BioOpera Questa tesi unisce argomenti di diverse aree di ricerca. Da una parte, i sistemi di supporto ai processi applicati al campo della bioinformatica. Dall’altra parte, i cluster di computer. I cluster di computer sono oggi un’alternativa sempre pi´ u comune all’impiego di costosi supercomputer paralleli nel campo delle applicazioni di calcolo scientifico. Una parte importante di questi sistemi ´e il software che permette ai computer partecipanti al cluster di condividere e bilanciare il carico di lavoro fra di essi1 . I sistemi di supporto ai processi possono essere considerati come strumenti di metaprogrammazione, utilizzati per descrivere, analizzare e coordinare l’esecuzione di processi (metaprogrammi) in ambiente distribuito. Un processo consiste in una collezione di esecuzioni di programmi (attivit´ a), e dei relativi trasferimenti di dati, su di una rete di computer eterogenei. BioOpera2 ´e uno di questi sistemi di supporto ai processi applicato alla bioinformatica. In particolare, i processi in BioOpera rappresentano complessi algoritmi bioinformatici che analizzano dati di tipo biologico di grandi dimensioni, quali sequenze di DNA e sequenze di aminoacidi. I componenti di gestione risorse e scheduling progettati in questa tesi hanno consentito, una volta integrati nel sistema BioOpera, di eseguire i processi in modo efficiente ed affidabile, distribuendo il carico di lavoro entro un cluster di computer. Pi´ u in dettaglio, il componente di gestione risorse3 contiene la descrizione delle risorse computazionali che costituiscono l’ambiente di esecuzione dei processi, ed ´e responsabile dell’aggiornamento delle informazioni sullo stato di disponibilit´ a e di utilizzo delle risorse. Queste informazioni sono usate dallo scheduler per scegliere in modo ottimale su quale computer iniziare l’esecuzione delle attivit´ a. Tramite 1

cfr capitolo 2 a pagina 7. cfr capitolo 3 a pagina 15. 3 cfr capitolo 4 a pagina 35. 2

ix

Estratto un’opportuna strategia di campionamento adattativo dello stato di utilizzo, si ´e cercato un compromesso fra due obiettivi contrastanti: minimizzare il numero di campioni (misure dello stato di utilizzo da inviare al gestore delle risorse) e minimizzare lo scarto fra l’informazione presente al gestore delle risorse e lo stato effettivo della risorsa. La scelta della strategia ottimale ´e stata fatta con un metodo di ottimizzazione a molti obiettivi [20] basato su somme pesate4 a partire da risultati ottenuti tramite simulazioni5 . Inoltre, il componente di scheduling6 ha il compito di trovare un computer su cui eseguire le attivit´ a non appena queste sono pronte per essere eseguite, ovvero quando tutte le attivit´ a precedenti nel flusso di controllo sono completate. Se non pu´ o essere trovato un computer soddisfacente, che rispetti i vincoli associati all’attivit´ a, questa viene messa in lista d’attesa per successivi tentativi, quando, ad esempio, nuovi computer diventano disponibili o altre attivit´ a terminano la loro esecuzione. Se un computer si blocca (o per una qualsiasi ragione non ´e pi´ u disponibile) tutte le attivit´ a in esecuzione su di esso vengono riprese dallo scheduler, il quale le far´ a ripartire su altri computer del cluster. Le decisioni prese dal componente di scheduling possono essere suddivise in varie politiche7 di informazione, selezione, trasferimento e allocazione. Una politica di informazione determina il tipo di informazione a disposizione dello scheduler e le modalit´ a con cui questa viene aggiornata. Nel nostro caso questo compito ´e stato assegnato al componente di gestione risorse. Una politica di selezione specifica in quale ordine le attivit´ a in attesa sono elaborate dallo scheduler. L’ordinamento pu´ o dipendere dall’ordine di arrivo, da priorit´ a assegnate dall’utente, per fornire diversi livelli di servizio ad utenti diversi, oppure dalle dimensioni delle attivit´ a, per eseguire attivit´ a di breve durata prima di altre pi´ u lunghe, in modo da diminuire il tempo di risposta delle prime. Una politica di trasferimento definisce sotto quali condizioni un computer ´e considerato sovraccarico, e, conseguentemente, deve venire ignorato dallo scheduler. Ad esempio, una delle definizioni usate prevede un valore di soglia confrontato con l’indice di utilizzo misurato. Se questo supera la soglia, il computer viene considerato sovraccarico. Una politica di allocazione prescrive come decidere a quali computer assegnare le attivit´ a. Sono disponibili diversi algoritmi, a partire da una scelta casuale, sino alla scelta del computer meno utilizzato. Sono stati effettuati alcuni esperimenti per confrontare le prestazioni delle diverse 4

cfr cfr 6 cfr 7 cfr 5

sezione 4.3.4 a pagina 51. sezione 8.2 a pagina 91. capitolo 5 a pagina 55. capitolo 6 a pagina 61.

x

politiche8 . I risultati ottenuti indicano che il tempo medio di esecuzione dei processi pu´ o variare del 40% intorno alla media a seconda della politica usata. Il sistema BioOpera ´e stato usato con successo per eseguire una prima applicazione bioinformatica9 . Lo scopo dell’applicazione ´e di confrontare ciascuna delle 80’000 sequenze di aminoacidi contenute in un database con tutte le altre. Essendo ciascuna operazione di confronto indipendente dalle altre, l’algoritmo e il relativo processo possono sfruttare questo parallelismo eseguendo contemporaneamente tutte le attivit´ a di confronto. Il componente di scheduling ´e stato fondamentale per distribuire le attivit´ a sui 38 processori dei computer di un cluster condiviso con altri utenti. Il sistema ha eseguito questo processo per pi´ u di un mese. Durante questo periodo il numero di interventi di manutenzione ´e stato trascurabile. In conclusione10 , i componenti di gestione risorse e scheduling oggetto di questa tesi abilitano il sistema BioOpera a distribuire l’esecuzione delle attivit´ a all’interno di un cluster di computer. Nel caso in cui i computer siano occupati ad eseguire delle attivit´ a (o programmi di altri utenti), lo scheduler evita di sovraccaricarli. Man mano che nuovi computer diventano disponibili, se ci sono attivit´ a in attesa, lo scheduler tenta di assegnarle a questi nuovi computer. Il sistema, tuttavia, non ha modo di bilanciare il carico di lavoro, in modo che tutti i computer del cluster siano utilizzati al medesimo livello. Infatti questa funzionalit´ a ulteriore richiederebbe complessi meccanismi di migrazione dei programmi in esecuzione da un computer all’altro.

8

cfr sezione 8.1 a pagina 81. cfr sezione 8.3 a pagina 104. 10 cfr sezione 9.2 a pagina 114. 9

xi

Estratto

Glossario Segue un breve glossario di termini inglesi e relativa traduzione italiana. Activity

Attivit´ a.

Passo

elementare

di

un

processo:

l’esecuzione di un programma. Adaptive Load Sampling

Campionamento adattativo dello stato di utilizzo delle risorse.

First idle

Primo inattivo. Politica di allocazione che sceglie il primo computer non utilizzato.

Job

Compito. Attivit´ a pronta per l’esecuzione mentre viene elaborata dallo scheduler.

Load index

Indice di utilizzo di una risorsa.

Minimum load

Carico minimo. Politica di allocazione che sceglie il computer con il minimo indice di utilizzo.

Overhead

Costo dovuto ad attivit´ a amministrative.

Overload

Sovraccarico.

Placement policy

Politica di allocazione.

Process

Processo. Insieme di task e di interscambi di dati fra di essi. I task vengono eseguiti rispettando le dipendenze di flusso di controllo.

Process enactment

Esecuzione di un processo.

Process Support Systems

Sistemi di supporto ai processi. Usati per definire, eseguire ed analizzare processi.

Resource Management

Gestione risorse.

Scheduling

Assegnamento delle attivit´ a pronte per essere eseguite ai computer disponibili che soddisfano i vincoli associati alle attivit´ a stesse.

Selection policy

Politica di selezione.

Task

Componente di un processo. Pu´o essere un’attivit´ a o un sottoprocesso.

Threshold

Soglia

Transfer policy

Politica di trasferimento.

xii

1 Introduction

1.1

Context

This thesis brings together topics from different research areas. On the one side there are process support systems, which provide middleware functionality to coordinate the execution of processes in a distributed environment. In our case these processes model complex scientific computations from the bioinformatics application domain. On the other side there are cluster computing environments, which strive to emulate parallel supercomputers with commodity computers linked by high performance networks. Process Support Systems The notion of process may be used in a variety of application areas to model complex sequences of computer program executions and data exchanges controlled by a metaprogram (the process) [3]. A Process Support System provides the tools and mechanisms necessary to define, analyze and execute processes. Chapter 3 gives an overview of the BioOpera and Opera process support systems, into which the software designed during this thesis has been integrated. Cluster computing There are three ways to do anything faster [48]: Work harder, Work smarter, and Get help. In terms of computing technologies, working harder means using faster hardware, working smarter concerns doing things more efficiently with improved algorithms, and getting help refers to using multiple computers working on the same task. There are physical and economical limits against making the hardware of sequential processors arbitrarily fast. The alternative solution is to connect multiple processors

1

1. INTRODUCTION together and coordinate their computational efforts. The resulting systems are parallel computers, and they allow the sharing of a computational task among multiple processors. Clusters1 of computers are a cost effective kind of parallel computers, and are increasingly being used as an alternative to “massively parallel supercomputers”. Cluster computing research attempts to build such systems by adding appropriate software “glue” on top of existing and affordable networked computers. Some projects, such as the Berkeley NOW project, attempt to directly challenge high performance supercomputers [7, 8]. While others, such as the Condor project, insist on harvesting (or scavenging) idle cycles from workstations [44] to build high throughput systems. In this case the emphasis is to extract as much processing time as possible from the available resources over long periods of time. One of the key features of these systems is load balancing 2 . Without load balancing it would not be possible to efficiently use all the computers of the cluster as they become available. Resource management functionality is also needed, since the system should be aware of what resources are available, and monitor their state to prevent overloads and react to failures. Furthermore, the local operating system’s process scheduling needs to be integrated with a global scheduler, which decides on which computer programs should be executed, and whether it would be convenient to move processes from a highly loaded computer to a less loaded one. Bioinformatics Bioinformatics is a fast growing new field where bioscience meets information technologies [19]. In particular, the complete genomes of different organisms have been established and ongoing work promises to produce much more valuable data [13]. Such genetic data - in the form of DNA or amino acid sequences - has been collected in several heterogeneous data repositories located all over the world. Unlike during the early years of DNA analysis, where the main goal was to sequence genetic material, today’s focus has shifted to the analysis of proteins. This can be done in a laboratory with traditional methods, however the cost and the staggering amount of raw data has forced researchers to explore computer-based alternatives, under the assumption that “biologically meaningful results may be obtained from considering DNA as a one-dimensional character string, abstracting away the reality of DNA as a flexible three-dimensional molecule, interacting in a dynamic environment with protein 1

Also known as Networks of Workstations (NOW), Cluster of Workstations (COW), Piles of PCs, or Workstation Clusters. 2 See chapter 2 on page 7 for the detailed definition and references to literature.

2

1.2. OBJECTIVES and RNA” [37]. The goal of these efforts - which still lies in the distant future - is to use software tools to predict the function of a protein from its amino acid sequence data [12]. Furthermore, the ultimate grand goals of bioinformatic research are to help in the design of safer and more effective drugs, to find new cures for diseases and to improve the understanding on how evolution has slowly shaped and changed living organisms [46].

1.2

Objectives

The main objective of this work is to integrate scheduling and resource management functionality into an existing process support system in order to adapt it to the requirements of bioinformatics. The original problem statement may be quoted from [38, p.15]: “To speed up execution, the enactment service should execute programs in parallel whenever this is possible. Taking into account the dependencies implied by the process semantics, the resource requirements of the particular programs, and the actual load situation in the distributed system, the enactment service will act as a scheduler, trying to minimize execution times. This requires mechanisms for load detection as well as the deployment of appropriate load balancing policies.” More specifically, the problem may be subdivided into different parts: 1. Resource management. This is the problem of describing the available computational resources, so that information on their state of availability and utilization may be collected. This information is needed to make scheduling decisions. 2. Process model. The process modeling language3 of Opera needs to be extended to support scheduler hints, which describe task “sizes”, and may put constraints on task’s execution locations. 3. Scheduling. When all the control and data-flow dependencies of a task are satisfied, a task becomes ready to be started. It is at this point that the scheduler should act and decide the optimal location to execute a task. If such a location cannot be found tasks may be queued for later attempts. 3

see Section 3.1.1 on page 16

3

1. INTRODUCTION 4. Priority scheduling. The basic scheduling mechanism may be complicated by including priorities for different tasks and processes. These are useful, for example, to run processes for different users which require a different “quality of service”4 , or to make short tasks executed before longer ones, to improve their response time. 5. Scheduling policies. As presented in Section 2.3 on page 10 there are many possibilities when building a scheduling system. For example, a placement policy defines how tasks are allocated to computers for execution, this could be done randomly, or by choosing the (currently) least loaded machine. The performance of different policies should be evaluated under different workload conditions5 .

1.3

Document structure

The rest of the thesis is structured as follows. Chapter 2 (Basic Concepts) defines some basic concepts used throughout this thesis. A Resource Management and Scheduling system is used to efficiently distribute the execution of jobs among computers (Section 2.1 on page 8). A taxonomy of possible system configuration is presented in Section 2.2, and the components (information, transfer, selection, and placement policies) of a scheduling policy are defined in Section 2.3. In Section 2.4, scheduling systems are classified depending on the amount of information about the jobs, which is used to treat jobs differently, for instance, depending on their “size”. In Section 2.5 various definitions of a “load index” are discussed. Finally, in Section 2.6 we discuss some problems of Resource Management and Scheduling systems. These problems are: keeping the load information up to date, instability, and administrative overhead. Obviously, care has been taken to minimize the impact of these problems on the system described in this thesis. Chapter 3 (BioOpera) presents the BioOpera process support system. This system is currently under research and development at the Swiss Federal Institute of Technology in Zurich. As the name suggests the project is an attempt to tailor the previously existing Opera process support kernel to the bioinformatics application domain. In Section 3.1 the original Opera project is described, with the basic and advanced features (transactional properties, exception handling, and event signaling) of the process modeling language, and the basic system’s architecture. 4 5

Very important users’ processes may be assigned a higher priority. see Section 8.1 on page 81

4

1.3. DOCUMENT STRUCTURE Then in Section 3.2 the requirements for the new BioOpera system are presented. This system is the context into which this thesis’ work on resource management and scheduling has been integrated. Chapter 4 (Resource management) describes the features of the Resource Manager component. A resource is a model for a computer, workstation, or the like, which is part of the execution environment for processes’ tasks. In Section 4.1 it is shown how the resources are statically described, then Section 4.2 defines their dynamic state of availability and utilization, and Section 4.2.2 discusses how to collect this information. Finally, Section 4.3 (Adaptive Load Sampling) describes a mechanism to adaptively monitor the state of a resource. This scheme attempts a trade off between minimizing the cost of performing measurements and sending update messages, and minimizing the error of the information known by the Resource Manager. Chapter 5 (Scheduling mechanisms) describes the basic mechanisms employed by the Scheduler component. These mechanisms include facilities to manage jobs (Section 5.1) and to place them (Section 5.2), i.e., while jobs are in transit through the scheduler, they need to be queued, scheduled, placed and dispatched for execution. This is the infrastructure upon which different scheduling policies may be built. Chapter 6 (Scheduling policies) defines the various policies6 that may be employed by the Scheduler component. The selection policies are presented in Section 6.2, the transfer policies in Section 6.3, and the placement policies in Section 6.4. The information policy is considered as a part of the Resource Manager, thus it is entirely discussed in Section 4.2. Chapter 7 (System integration) discusses the integration of the Resource Manager and Scheduler components into the rest of the BioOpera system. This has required a number of extension to the process model (Section 7.1). The changes to the system architecture are briefly described in Section 7.2. Furthermore, Section 7.3 lists the external support tools that have been developed. Chapter 8 (Experiments) presents experimental results. In Section 8.1 the performance of different scheduling policies has been evaluated. The results of the comparison between different load sampling strategies are discussed in Section 8.2. And finally, the story of the first real world test of the BioOpera system is told in Section 8.3. 6

see Section 2.3 for the definition

5

1. INTRODUCTION Chapter 9 (Conclusion) concludes the work and discusses possible extensions and improvements (Section 9.2).

6

2 Basic concepts

Scheduling and Load Balancing have been widely studied during the past years in literature [50, 59]. Contributions range from local scheduling of processes on processors done by multitasking operating systems [9, 55] to scheduling of batch parallel jobs in high-performance supercomputers [26, 21], scheduling for real-time systems [52], and distributed scheduling for cluster computing [15, 16, 7, 11]. All of these approaches focus on integrating scheduling functionality into existing or new operating systems. The problem of scheduling tasks in a process support system is related to these areas, but with a different perspective. The closest topic is cluster computing and distributed operating systems, since both approaches deal with remote execution of programs, resource management, load sharing and load balancing, and strive to emulate powerful parallel computers in a cost-effective way [7, 31]. The difference is in the abstraction level. A process support system is a high level meta-programming environment, which coordinates the execution of tasks by networked computers, using distributed systems mechanisms and techniques [38]. On the other hand distributed operating systems hide the distribution of the resources from the user’s programs, in order to provide performance, transparency and reliability benefits [35].

Load sharing vs. load balancing The goal of a load-sharing algorithm is to maximize the rate at which a distributed system performs work when work is available [51]. To do so, unshared system states are to be avoided. In these states some hosts are idle, while jobs assigned to other hosts are forced to wait. Load-balancing algorithms also strive to avoid unshared system states, but additionally they also attempt to equalize the load on all computers. Therefore load sharing can be seen as a minimum requirement for all load distribution algorithms, while load balancing represents a more intelligent form of load distribution, which tries to avoid unshared states by equalizing the hosts’ load in advance, but also requires job transfer mechanisms.

7

2. BASIC CONCEPTS

2.1

Resource Management and Scheduling Services

Resource Management and Scheduling is the act of distributing applications among computers. This maximizes the applications’ throughput and efficiently utilizes the available resources. The services provided by Resource Management and Scheduling environments include [28]: Heterogeneous Support The computing environment consists of a number of computers with dissimilar hardware architectures and different operating systems. Scavenging idle cycles It is generally recognized that between 70% and 90% of the time most workstations are idle [44]. Resource management and Scheduling systems can be set up to utilize idle CPU cycles. For example, jobs can be submitted to workstations during the night or at weekends. This way, interactive users are not affected by external jobs and idle CPU cycles can be used. Minimization of the impact on users Running a job on public workstations can have a great impact on the usability of the workstations by interactive users. To minimize this impact, it is possible to reduce the job’s local scheduling priority, or to suspend the job, when the user is actually using the workstation. Suspended jobs can be restarted later, or migrated to other workstations. Process checkpointing and migration A checkpoint is a snapshot of an executing program’s state, which can be used to restart the program from the same point at a later time. Checkpointing is generally used as a means of providing reliability, or to implement process migration. A process being migrated is first suspended, then moved, and restarted (or resumed) on another computer [35, p.409]. Generally, process migration occurs when a computational resource has become too heavily loaded and there are other free resources, which can be utilized. Fault tolerance By monitoring its jobs and resources, a Resource Management and Scheduling system may provide various levels of fault tolerance. For example, jobs on a failed host can be restarted or rerun on a different host, thus guaranteeing that they will be completed. Load balancing Jobs can be distributed on all the hosts available in a particular organization. This will allow for the efficient and effective usage of all the resources, rather than a few which may be the only ones that the users are aware of. Process migration can also be part of the load balancing strategy, where it may be beneficial to move processes from overloaded systems to lightly loaded ones.

8

2.2. A TAXONOMY OF SCHEDULERS scheduling

local

global

static

dynamic

distributed

cooperative

centralized

non-cooperative

Figure 2.1: Classification of scheduling methods

2.2

A Taxonomy of Schedulers

Scheduling methods can be classified using the taxonomy in Figure 2.1 [18]. Global vs. Local Local scheduling is concerned with running jobs on a single host, typically using time slicing techniques. Global scheduling chooses which host a job should be assigned to. Static vs. Dynamic Static scheduling determines the assignment of jobs to hosts at an early stage - in the case of parallel programming, at compile time [42]. Dynamic scheduling makes scheduling decisions based on information obtained at run time, like the load on a host. Static scheduling is not applicable to our system, since it requires that the execution times of the jobs and the availability of hosts be known a priori, neither of which is possible in our approach. Distributed vs. Centralized In a centralized dynamic global scheduler all scheduling decisions are made on a single host, while in a distributed scheduler, decision making is physically distributed among the hosts. Cooperative vs. Non-cooperative Distributed dynamic global schedulers may either cooperate among themselves, or each one may act entirely autonomously and allocate its resources independently of the rest of the system. Considering this classification, the scheduler for BioOpera is a global dynamic centralized one when there is only one instance of the process support system running, and a global dynamic distributed (possibly cooperative) one when there is more than one

9

2. BASIC CONCEPTS instance of the process support system coordinating the execution of processes over the same cluster of hosts at the same time1 .

2.3

Scheduling components

A typical global dynamic scheduling algorithm is defined by these policies [50, p.309]: Information policy which specifies what load information is available to the scheduler and when and how that information is sent to the scheduler (e.g. on demand, periodically, or upon significant change). Transfer policy which determines the conditions under which a job should be transferred from a host to a different host, and new jobs should not be sent to the first host. (e.g. a host has become overloaded, or it needs to be shut down) Selection policy which chooses the suitable job to move, once the transfer policy decides a host has become overloaded. More specifically, a non-preemptive selection policy allows only to transfer jobs before they have been started, while a preemptive selection policy allows to suspend and migrate jobs while they are being executed (e.g. the choice could be made according to job sizes). Placement policy which defines how to find a host where to place a job (e.g. randomly, in a round robin fashion, or by optimal choice). The scheduling policies of the BioOpera system are described in Section 4.2 on page 37 (Information policy), Section 6.3 on page 63 (Transfer policy), Section 6.2 on page 62 (Selection policy), and Section 6.4 on page 67 (Placement policies).

2.4

Job Classification

Scheduling systems may be classified depending on how much information about the jobs [26] is available. This information may include the amount of parallelism of a job, its estimated running time, or its memory requirements. The level of available information goes from none to exact information about each job: No knowledge No information is available, therefore all jobs are treated the same. Workload While no information about specific jobs is known, there is a general knowledge about the overall distribution of job requirements. In this case the scheduling 1

Only the first scenario has been fully implemented.

10

2.5. LOAD INDICES policies may be tuned to fit the particular workload. For example, if it is known that most jobs run for a few seconds, but there may be some that last for a much longer time, the scheduler should include a mechanism to prevent short jobs from being delayed by long jobs. Class Each job is associated with a “class”, for which some key characteristics are known. On some systems each job class has its own queue Job The execution time of a job on any number of processors is known exactly. This information may be provided by the users, which should estimate jobs requirements before submitting them to the scheduler. These estimates are usually very difficult to obtain, and users may intentionally attempt to deceive the scheduler so that their jobs receive better treatment. To solve this problem information about jobs, if they are repeatedly executed, may be automatically generated from analysis of past execution traces [32, 45].

2.5

Load indices

A key issue in the design of resource management systems is how to measure the current load of a host. Generally, a load index is used as a quantitative characterization of the system load. A good load index should [27]: • reflect the user’s qualitative estimates of the current load on a host, e.g. if the user notices a slowdown in its workstation performance, the load index should quantify this. • be usable to predict the load in the near future, since the response time of a job will be affected more by the future load than by the present load; • be relatively stable; i.e., high frequency fluctuations in the load should be discounted, or ignored; • have a simple relationship with the overall performance of the host, so that its value can be easily translated into the expected performance of a job transferred to the host. • impose minimal overhead on the host being measured. A wide variety of load indices has been proposed [43]. Examples include:

11

2. BASIC CONCEPTS • The length of the waiting queue of the hosts, e.g. a host is loaded if there are many waiting jobs in its queue. • The utilization level of the hosts, e.g. a host is loaded if its CPU is 100% used. • The response time of the hosts, e.g. a loaded host responds more slowly than an idle host. One of the main results of this work [27, 22] is that simple load indices yield significant performance benefits, while more complex indices (e.g. weighted combinations of simple indices) do not provide further improvement. Moreover, average values taken over a short time interval are better than instantaneous values, but using too long intervals reduces the indices’ sensibility to changes in load. Finally, load indices based on resource queue lengths are found to perform better than those based on resource utilization.

2.6

General issues

Some of the most important problems of load distribution systems are [14]: Overhead caused by the scheduling system The scheduling system itself produces system load because it needs processing and communication resources to gather load information, it has to manage and start waiting jobs, and so forth [30, p.236]. The higher the load caused by the scheduler itself, the lower the benefits of its decisions. Therefore, the overhead should be kept as small as possible. Overloading of lightly loaded hosts A lightly loaded host is preferred for the assignment of new jobs. For this reason, it may happen that this target gets overloaded by jobs sent from different non-cooperative schedulers, which share the same load information. The same problem may happen with a centralized scheduler, which doesn’t wait for load information updates before scheduling other jobs on the lightly loaded host. Out-of-date load information If the state of a system changes fast and the time intervals after which the load information is updated are too long, the load evaluation component uses out-of-date state information which leads to wrong decisions. By using smaller update intervals the information is up-to-date2 , but the overhead caused by the load monitoring components increases. Therefore, the length of the 2

Network latency problems are not considered, since we assume the delays involved are much shorter than the chosen update interval.

12

2.6. GENERAL ISSUES update intervals should have approximately the same order as the changes in the system load. Instability If a system is heavily loaded, it may happen that many hosts try to migrate jobs to other hosts. As a consequence, these other hosts may also get overloaded, and in turn try to send away the newly arrived jobs. This unstable behavior, also known as thrashing, may continue so that no actual processing gets done, and a termination of the jobs cannot be guaranteed anymore. While designing the BioOpera scheduler, particular care has been devoted to address these problems as discussed in Section 5.3 on page 60.

13

2. BASIC CONCEPTS

14

3 BioOpera

3.1

The Opera project

Opera stands for “Open Process Engine for Reliable Activities”. This project was started in 1996 at the Information and Communication Systems Research Group of the Swiss Federal Institute of Technology Zurich. After two years it was split into the WISE (Workflow based Internet SErvices [2]) and the BioOpera (Process Support for Bioinformatics) projects, which continue the effort applying the generic Opera kernel to Business Workflows and to Scientific Process Support, respectively. The focus of the Opera project was to generalize existing concepts and ideas of workflow management systems [57], or process centered software engineering environments, by developing a process modeling language, called OCR (Opera Canonical Representation) and a process support kernel (the Opera system) with emphasis on quality aspects of distributed computing1 . These quality improvements involve: • process modeling, with advanced language constructs, such as exception handling, event signaling, transactional properties; • the process enactment service, with fault tolerance, process persistence and transactions; • the engine’s architecture, with robustness and scalability. The Opera system has been designed as an extensible process support kernel to be tailored to specific application domains. The kernel provides generic functionality needed by all potential applications, and can be extended to adapt it to the requirements of a specific application domain. The particular requirements of the bioinformatics domain are presented in Section 3.2.2 on page 33. 1

More information on the Opera project can be found in [38]

15

3. BIOOPERA

3.1.1

Process Modeling

As part of the Opera project, a process modeling language has been developed. This is the Opera Canonical Representation (OCR). This language is used internally by the Opera kernel to represent templates, i.e., process definitions [58], and instances: active processes that are currently being enacted. Furthermore OCR has two facets: the conceptual representation is a model to describe processes and their components. This can be expressed in terms of a class structure. In addition, a textual representation is needed for practical purposes. This is generated by process modeling tools and parsed by the Opera system into the appropriate set of objects and references.

Language requirements The requirements from which the OCR language has been originated are: Flexibility The canonical representation needs to be general enough to allow mapping of many application-specific process modeling languages. Extensibility It should be possible to enhance the process representation whenever this is required by the application. This requirement complements the first one, whenever the language is not flexible enough, the system design should include the necessary extension mechanisms. The key2 to satisfy the extensibility requirement is to use a framework of classes representing the various components of the process. Simplicity A complicated representation would produce unwanted complexity in the kernel design, be hard to extend and to store persistently. Support for run-time analysis All the relevant information should be readily available during process enactment for monitoring purposes. Support for persistence The language constructs need to support efficient persistent storage, so that the overhead involved in making processes persistent is minimized. 2

This approach allows to easily adapt the model by either modifying existing classes or by adding new (inherited) classes.

16

3.1. THE OPERA PROJECT   

   !

EGFIHJ

"$#&%('*) +-,.0/-12

KMLANO*PDQ

354768 9;:=A@*BDC RTS UWVX

Y[ZI\] ^I_ `a oqp;rkstuv

bdceWf*g;hWikjIlnm

wqx;yWzk{W|}

Figure 3.1: Class diagram of the various task types

3.1.2

Conceptual representation

The main entities of the process representation are tasks, and programs. The order of the execution of the various tasks is specified by control flow dependencies among them, while the information exchanged between the various tasks is modeled by a data flow relationship between the tasks’ input and output parameters. Task types Tasks are entities that can be executed. This includes whole processes, as well as activities (the basic execution steps), and blocks (structured groups of activities). The class diagram in Figure 3.1 shows the hierarchy of different task types: BasicTask This class at the root of the hierarchy represents all tasks. Its attributes are inherited by all task types. These generic attributes are the input and output “boxes”: lists of parameters used to model data exchanges between tasks. ComplexTask This class represents tasks which can contain other tasks. Its attribute is the whiteboard used as a communication area for its component tasks. Task This class represents tasks with control flow dependencies. This dependency is represented with a guard condition, used to specify when the task is to be executed with respect to other tasks or external events. Activity This class represents the basic execution step of a task. Each Activity has an associated Program object, which specifies what has to be executed.

17

3. BIOOPERA 

   !#"%$ &(')*

¬­€®W¯ °i±x²6³8´ µ”¶·K¸¹ º(»¡¼R½¿¾ ÀÁG ÃÄÅ0ÆÇ

+-,.0/-13246587 9;: @?A B#C@DEGFIHKJLNMOPRQ

SUTWVYX[Z-\3]_^`8ab c

vxwzy|{}_~€ ‚ ƒ

dfe%gihjlkim n(oqprs tu

„(… †‡ˆ ‰#Ši‹Œ0Ži‘ ’”“•%–— ˜™@š›@œxž8Ÿ¡  ¢¡£¥¤¦f§l¨f©¡ª «

Figure 3.2: Class diagram of various program types

Process This class, derived from the ComplexTask class, represents all “top level” processes. SubProcess This class introduces nesting capabilities into the process model. Particularly it supports late binding, i.e., the referenced Process is instantiated only when the SubProcess is to be executed. Block This class is important for structuring large process descriptions. A Block is also used to specify transactional properties, such as atomicity.

Programs Program classes are used to represent the external binding of Activities, i.e., the “real world” action to be performed when the activity has to be executed. These actions are generally thought as program invocations, although it may be that a program represents a human action, or a SQL query sent to a database system. The distinguishing characteristic of a Program is that, from the Opera kernel perspective, it is an indivisible black box. A Program class encapsulates all information needed to start the external program, pass information to it, and receive its output data. The type of this information can differ widely, depending on the program’s execution environment. To cope with this heterogeneity, OCR uses a Program class hierarchy. The root class defines a set of generic attributes, such as unique identifiers, and input and output parameter lists. The example in Figure 3.2 on page 18 shows subclasses for describing Human actions, UNIX programs, SAP workflows, and CORBA method invocations.

18

3.1. THE OPERA PROJECT Control flow A rule-like mechanism is used to specify control flow. A guard, attached to each Task, can be seen as a description of when the Task has to be executed. A guard is a tuple (A, C), where A is the activator and C is the condition. The activator is a predicate over the state of the process, and its components. When it evaluates positively, a task has to be considered for execution. The condition part of the guard is a predicate over the data objects visible to the task, and is mostly used to model conditional branching. The advantage of separating activators from conditions is efficiency, since it is much faster, when a state-change occurs, to evaluate only the state-based activators, and to check the data-based conditions only if the activators trigger positively. Data flow To enable data transfers between programs each Task has its own input and output parameter lists. When a task becomes ready for execution, data is copied into its input parameters from other tasks’ output parameters, or the parent process’ whiteboard and input parameters. When the task’s execution is finished, data may be copied out of its output parameters into the process’ whiteboard. These two mechanisms allow the specification of a wide variety of data exchanges inside a process. The whiteboard is the OCR equivalent of local variables in programming languages. It is a set of data objects accessible from all of the component tasks of a process. This provides a convenient mechanism for the storage and exchange of temporary data. Task Instances When a process is started, a process instance is created together with instances for all the process’ component tasks. The information contained by the task instances reflects the one in their corresponding templates, with the addition of unique instance identifiers, and execution state information. The task state diagram given in Figure 3.3 describes the execution states of a task instance. All state transitions (from state S to state S 0 ) are summarized in Table 3.1 on page 21. The first set of transitions is used to start the execution of a task. The second set of transitions describes the possible outcomes of the program’s execution (success, failure, exception, or abort request). The third group of transitions deals with failure recovery, while the last two transitions are used for restartable tasks.

19

20

3. BIOOPERA

ž_Ÿx %¡S¢¤£ ¥

.0/!13246587!90:#?A@CB@?BA CED/F

' (

)* +-,/.0

   

Figure 5.1: Jobs moving through the scheduler

When the waiting list is empty the Scheduler sleeps because it has nothing to do. The Scheduler also sleeps when it has attempted to process once all the items in the waiting list. Therefore there are two events that wake up the scheduler, and force it to start again scanning the waiting job list : a A new job is added to the waiting list, or an existing job is removed from the

running list4 . b New information is collected about the state of the hosts by the resource manag er5 . 4

This indicates that a host has finished a job and may be ready to execute another one, if the waiting list is not empty. 5 This event only affects the scheduler if the waiting job list is not empty.

57

5. SCHEDULING MECHANISMS

5.2

Placement mechanisms

The following mechanisms are used by the scheduler to place and dispatch a job.

5.2.1

Placing a job

This mechanism requires first to build a list of hosts from the job and resource manager information, then to use the placement policy to select the best host from it, and finally to perform the actual placement. Building the host list This step uses the job information to access the resource manager and retrieve the appropriate host list from it. The job information includes: • The name of the group g of hosts associated with the program. • The subsystem s required to properly dispatch the program. Furthermore the list is filtered, so that it only includes hosts that are: • Enabled: it makes sense to choose only among hosts that have been enabled by the user; • Available: it is useless to try to place a job to a host that is known to be unavailable; • Underloaded: depending on the current transfer policy6 , only non overloaded hosts with respect to some threshold are left in the list. The resulting list may be empty: this means that given the current host state information it is not possible to place the job. Calling a placement policy A placement policy receives a list of hosts and calculates the best host using its particular7 selection algorithm. A placement policy may not necessarily choose a host, if no host in its input list fits its algorithm’s requirements. Some policies may require some other parameters to make a decision. Placing a job to a host Once the placement decision has been made, and a host is selected for a job, the resource manager is requested to update the selected host state information. If the host is still available and not overloaded, then the following steps are carried out to finalize the placement decision: 6 7

see Section 6.3 on page 63 for the definition. see Section 6.4 on page 67 for more details.

58

5.2. PLACEMENT MECHANISMS 1. A reference to the selected host is included into the job. 2. The number of jobs assigned to the host is increased. 3. The selected host is temporarily disabled, so that it can only be used again by the scheduler when some updated state information is received. This is very important to avoid overloading a host because its state takes some time to change after a job has been placed on it. 4. The job is removed from the waiting job list. 5. The job is sent to the appropriate subsystem to be remotely executed. If the host was not available the placement is not completed successfully, and the job must wait for the next scheduler iteration.

5.2.2

Dispatching a job

Once the job has been placed it needs to be dispatched for execution: the job’s program needs to be started at the host, and when it terminates some steps must be carried out to properly complete the job. Starting to execute a job

Once the job has been placed it is sent to the appropriate

subsystem. At this point the subsystem contacts the host and tries to remotely execute the job. If the host is available and successfully starts to execute the job: 1. The job is inserted into the running jobs list. 2. The job’s activity state becomes running. The probability of the host being available is very high, since the resource manager has just checked this fact before the placement decision was completed. Nevertheless for reliability reasons, also the other unlikely case needs to be considered. Therefore if the information on the host availability was inaccurate, the host is not responding, or cannot start to execute the job: 1. The host problem is reported to the resource manager which sets the host’s state as unavailable. 2. The placement of the job is undone: the host reference is removed from the job, and the job is put back into the waiting list. This failure due to inaccurate information of the resource manager does not affect the activity, which remained in the waiting state all along.

59

5. SCHEDULING MECHANISMS Completing a job As it can be inferred from the state diagram in Figure 3.3 on page 20, a job completes when its corresponding task’s state exits the running state, and becomes one of these other states: finished if the program terminates successfully, failed

if the program returned an error,

aborting

when the user decided to abort the program’s execution, or

exception

when a pre-defined failure was detected and the task is about to handle it.

In every case the resource manager needs to be notified that a job has completed so that its host’s job count can be decreased. Furthermore if a host becomes unavailable while there are jobs running on it, exceptions are raised for these jobs’ corresponding tasks, so that the jobs will be eventually resumed and rescheduled to be run on different hosts.

5.3

Addressing load balancing issues

As presented in Section 2.6 on page 12 scheduling systems may suffer from various problems. The Resource Management and Scheduling mechanisms for BioOpera have been designed to attempt to address these problems: Overhead Considering the typical heavy duty computational job managed by the BioOpera system, the overhead involved by scheduling decisions and load information gathering is indeed very small; Overloading of lightly loaded hosts A specific scheduler mechanism8 , which temporarily disables a host after a job has been sent to it, is used to solve this problem; Out of date load information The adaptive load sampling scheme9 attempts to address the trade off between keeping the load information up-to-date and the overhead of the measurements and messages required; Instability Without a job migration mechanism the instability problem does not occur, although there can still be local instability, i.e. thrashing, if too many jobs are assigned to a host and it runs out of free memory, but steps are taken to ensure enough free memory is available when placing a job.

8 9

see Section 5.2.1 on page 58. see Section 4.3 on page 42.

60

6 Scheduling policies

6.1

Overview

This chapter defines with more detail the selection, transfer and placement policies that may be plugged in the Scheduler component. The information policy is presented in Section 4.2 on page 37 as part of the Resource Manager component.

 "!$#&% '( )*,+ -$./02143

}~€‚ƒ…„‡† ˆx‰Šd‹Œ GIHJLKNMPORQ S T$UVW2X4Y

Z\[

Suggest Documents