Universiteit van Amsterdam

Universiteit van Amsterdam MSc Artificial Intelligence Machine learning methods to model Grid behaviour 15th October 2010 Author: Gabriele Modena 5850...

Author: Meryl Carpenter

3 downloads 0 Views 934KB Size

Report

Download PDF

Recommend Documents

Erfahrungsbericht: : Universiteit van Amsterdam

Universiteit van Amsterdam

KNOWLEDGE AND NEWS. Teun A. van Dijk Universiteit van Amsterdam

ARIE SCHIPPERS Universiteit Amsterdam

AMSTEL INSTITUTE, UNIVERSITEIT VAN AMSTERDAM Amsterdam, Mathematics, Science and Technology Education Laboratory

Erfahrungsbericht Vrije Universiteit in Amsterdam

ANNUAL REPORT 2001 INSTITUTE FOR LOGIC, LANGUAGE AND COMPUTATION. Amsterdam, June 2002 UNIVERSITEIT VAN AMSTERDAM

PI Research Vrije Universiteit Amsterdam,

UNIVERSITEIT VAN PRETORIA

FAKULTEITE VAN DIE UNIVERSITEIT VAN PRETORIA

INTERNSHIPS BIOLOGICAL PSYCHOLOGY VRIJE UNIVERSITEIT AMSTERDAM

Citation for published version (APA): Aerts, L. A. M. (1999). Sponge-coral interactions on Caribbean reefs Amsterdam: Universiteit van Amsterdam

Universiteit van Amsterdam. Anton Pannekoek instituut. Blowing in the wind. black hole binary system. Supervisor:

CHAPTER?? Realizability. A. S. Troelstra 1 Faculteit Wiskunde en Informatica Universiteit van Amsterdam Plantage Muidergracht TV AMSTERDAM (NL)

Master Electives Conservatorium van Amsterdam

A Groenewald (Universiteit van Pretoria) ABSTRACT

ANTON DE KOM UNIVERSITEIT VAN SURINAME

Cardiovascular Research Vrije Universiteit Amsterdam - VUmc - M Cardiovascular Research

MA Oncology Vrije Universiteit Amsterdam - VUmc - M Oncology

# 2009 Universiteit van Suid-Afrika. Alle regte voorbehou. Gedruk en uitgegee deur die Universiteit van Suid-Afrika Muckleneuk, Pretoria

The EM Interpreter. Eddo de Groot Leo van den Berge Dick Grune Faculteit Wiskunde en Informatica Vrije Universiteit, Amsterdam ABSTRACT

Research and review articles (2) Departement Rekeningkunde Universiteit van Johannesburg

Wit-Gele Kruis van Vlaanderen - Katholieke Universiteit Leuven Post Print

UNIVERSITEIT VAN PRETORIA Fakulteit Veeartsenykunde Faculty of Veterinary Science

Universiteit van Amsterdam MSc Artificial Intelligence Machine learning methods to model Grid behaviour 15th October 2010 Author: Gabriele Modena 5850614 Supervisor: dr. Maarten van Someren Committee: prof. dr. Pieter W. Adriaans dr.ir. Leo Dorst

Defense date: 22nd October 2010

2

Abstract This thesis proposes machine learning techniques to discover events on the Grid and model behavioural knowledge about components interaction. The dynamic, volatile, non-homogeneous nature of the Grid provides an interesting scenario for novel machine learning and data mining approaches. Possible applications can be identified in automated tools that would provide network administrators, researchers and scientists an aid in proactively identifying possible situations of malfunctioning and cross-domain correlation of components. Focusing on a set of job centric log files that record the status of the EGEE Grid 1 , we propose a method for inferring network topology and components behaviour given qualitative and quantitative properties and observing their variation over time. We model components behaviour in a vector space setting in terms of events clusters. Clustering is performed using Affinity Propagation to discover possibly meaningful situations happening on the network. Characterisation of discovered classes is obtained via feature selection on clusters using Recursive Feature Elimination. Finally we embed predictive capabilities by training a multiclass Support Vector Machine to classify unseen data instances on discovered event classes. We present our results and conclude with a discussion on further improvements of the method to better capture complex systems nature and extend it to other domains.

1 The dataset used in this work has been provided by the Grid Observatory.The Grid Observatory is part of the EGEE-III EU project INFSO-RI-222667, an open project that collects, publishes and analyses data on the behaviour of the European Grid for E-science (EGEE).

4

Contents 1 Introduction

7

1.1

Autonomic Computing . . . . . . . . . . . . . . . . . . . . . . . .

8

1.2

Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.3

Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2 Grid computing 2.1

11

Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.1.1

Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.1.2

Resource brokers . . . . . . . . . . . . . . . . . . . . . . .

14

2.1.3

Computing and Storage elements . . . . . . . . . . . . . .

15

2.1.4

Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.2

Topology properties . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.3

Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3 Model

19

3.1

Problem domain . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.1.1

Related work . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.2

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.3

Grid Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

3.3.1

Topology representation . . . . . . . . . . . . . . . . . . .

25

3.3.2

Time constraints . . . . . . . . . . . . . . . . . . . . . . .

27

3.3.2.1

Time representation . . . . . . . . . . . . . . . .

27

Model construction . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.4

6

CONTENTS

4 Method

29

4.1

Similarity Metric . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4.2

Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

4.2.1

Affinity propagation . . . . . . . . . . . . . . . . . . . . .

32

4.2.2

Feature selection . . . . . . . . . . . . . . . . . . . . . . .

33

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.3

5 Evaluation

37

5.1

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

5.2

Clustering experiment . . . . . . . . . . . . . . . . . . . . . . . .

40

5.2.1

43

Clusters evaluation . . . . . . . . . . . . . . . . . . . . . .

5.3

Classification experiment

. . . . . . . . . . . . . . . . . . . . . .

45

5.4

Baseline clustering algorithms . . . . . . . . . . . . . . . . . . . .

46

6 Representation

49

6.1

π-calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

6.2

A π-calculus grammar of the Grid . . . . . . . . . . . . . . . . .

52

6.2.1

53

Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 Conclusion

55

References

59

A Appendix

63

A.1 Component features . . . . . . . . . . . . . . . . . . . . . . . . .

63

A.2 Clusters output . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

A.2.1 Result set . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

Chapter 1

Introduction Grid is a form of computing in which geographically distributed sites take part in collaborative computation. It is the combination of computer resources from multiple administrative domains applied to a common task, usually to a scientific, technical or business problem that requires a great number of computer processing cycles or the need to process large amounts of data. One of the main strategies of grid computing is using software to divide and apportion pieces of a program among several computers, sometimes up to many thousands. Grid computing is distributed, large-scale cluster computing, as well as a form of network-distributed parallel processing. The size of a grid may vary from being small confined to a network of computer workstations within a corporation, for example to being large, public collaboration across many companies and networks. The notion of a confined grid may also be known as an intra-nodes cooperation whilst the notion of a larger, wider grid may thus refer to an inter-nodes cooperation. This inter-/intra-nodes cooperation across cyber-based collaborative organisations are also known as Virtual Organisations [23] The complexity of Grid topologies makes monitoring and system administration difficult. Crossdomain knowledge may be difficult (or even impossible) to achieve due to security and privacy related limitations and given the number of technologies that play a role in the system correlation of factors may be a key point in understanding the source of malfunctioning of unexpected behaviours and thus enable a solution. These operations are complicated by the high volumes of data that are generated by the system over short amounts of time and that hide structures and events under a level of noise. To approach this problem in an efficient way ideas about automating administration and monitoring tasks have emerged and are currently subject of study by the so called Autonomic Computing movement.

8

CONTENTS

1.1

Autonomic Computing

Autonomic Computing is an initiative started by IBM in 2001 [7]. Autonomic computing refers to the self-managing characteristics of distributed computing resources, adapting to unpredictable changes whilst hiding intrinsic complexity to operators and users. In a self-managing Autonomic System, the human operator does not control the system directly. He defines general policies and rules that serve as an input for the self-management process. IBM has defined four functional areas for this process: 1. Self-Configuration: Automatic configuration of components; 2. Self-Healing: Automatic discovery, and correction of faults; 3. Self-Optimization: Automatic monitoring and control of resources to ensure the optimal functioning with respect to the defined requirements; 4. Self-Protection: Proactive identification and protection from arbitrary attacks. IBM defined five evolutionary levels, or the Autonomic deployment model, for its deployment [7]: Level 1 is the basic level that presents the current situation where systems are essentially managed manually. Levels 2 - 4 introduce increasingly automated management functions, while level 5 represents the ultimate goal of autonomic, self-managing systems. A precursor of the Autonomic Computing movement is known since the eighties with the name of Adaptive System Management; the idea comes from natural sciences where complex system are often studied in term of statistics theory and has been applied to computer systems, in particular large scale networks, that are characterised by large amounts of data being generated in short time frames [8] whose analysis goes beyond feasibility by manual intervention. Large computer networks share common properties with complex biological systems such as large volumes of data generated from multiple sources in short time frames, heterogeneous nature of components belonging to the systems, dynamicity and volatility of the system over time. We refer to these as complex systems; which is systems composed of interconnected parts that as a whole exhibit one or more properties (behaviour among the possible properties) not obvious from the properties of the individual parts [6]. In a seminal work[22] the author outlined adaptive management as beginning with the central tenet that management involves a continual learning process that cannot conveniently be separated into functions like research’ and ongoing regulatory activities,’ and probably never converges to a state of blissful equilibrium involving full knowledge and optimum productivity. The author characterised adaptive management by the following properties

1.2. LEARNING FRAMEWORK

9

representing existing knowledge in models of dynamic behaviour that identify assumptions and predictions so experience can further learning; representing uncertainty and identify alternate hypotheses; designing policies to provide continued resource productivity and opportunities for learning.

An adaptive management system consists of two elements: a monitoring system to measure key indicators and the current status of things, and a response system that enables modifying key indicators Autonomic Computing extends this approach to the ideal limit of having a self managing systems; research in the field is active but many more steps will be required before fully achieving this goal. Though inspired by an autonomic view of computing, the goals underlaying Adaptive System Management are closer to the nature and outcome of this project and better describe our approach. We will present a model and framework that embeds the monitoring system element in terms of analysing, inferring and correlating information directly and indirectly - observable from Grid systems and that aims at providing a response in terms of proactive feedback about the observed structure.

1.2

Learning Framework

In this thesis we present a learning framework to discover and characterise events and behaviour of Grid components so to assist the automation of monitoring and administration tasks.

!"#"$%#

&'($#%)

&'"$$*+*,"#*-.

Figure 1.1: Learning Framework

We infer jobs and components information from a job centric dataset of Grid logs and use this knowledge to model a representation of the Grid. Topology is reconstructed in form of a Graph and then jobs and components information is modelled in a vector space. We characterise Grid behaviour in terms of events by clustering vectors of jobs and components that share similar properties and selecting features that can describe them. Finally we use clusters and selected features to train a classifier to predict events evolution. The learning framework is presented as a multi step strategy as shown in Figure 1.1

10

CONTENTS

Looking back at the definition of adaptive systems managements; our learning methods consists on a set of indicators that carry quantitative informations and are represented by the vectors describing components statuses. The response system is here represented by steps 2 and 3.

1.3

Thesis outline

The remainder of this thesis is organised as follows. Chapter 2 describes the characteristics of Grid systems we are interested to model and identifies inherent challenges that make them an interesting material for machine learning study and research. Chapter 3 presents the general operational framework of this paper and the data used to build and evaluate the model. We state the problem’s domain, describe the data preparation phase and propose our learning strategy. Chapter 4 describes in detail methods and steps previously introduced. We propose a similarity metric between jobs and a clustering solution based on the Affinity Propagation algorithm to discover classes of events as well as feature selection and classification approaches to respectively explain discovered classes and allow prediction over unseen job instances. In chapter 5 we present, analyse and evaluate the performance of our method on Grid log data. Finally chapters 6 and 7 conclude this thesis by summarising the content and providing suggestions and considerations about further developments and research.

Chapter 2

Grid computing Grid computing is usually referred to as a form of distributed computing whereby a virtual computer is composed of a cluster of networked loosely coupled machines acting in concert to perform very large tasks. Although this statements can give an idea of the overall concept that drives this computing paradigm; it is difficult to define Grid computing in a precise way. Grid operating in different application domains may present peculiar deployments - both in terms of hardware and software resources - to best fit the particular use cases of the domain it is employed. As a result we can be tempted to bind the definition of Grid computing to particular - data driven - scenarios. Nonetheless IBM - one of the big players in the enterprise Grid network providers- defines grid computing as the ability, using a set of open standards and protocols, to gain access to applications and data, processing power, storage capacity and a vast array of other computing resources over the Internet. A grid is a type of parallel and distributed system that enables the sharing, selection, and aggregation of resources distributed across multiple administrative domains based on their (resources) availability, capacity, performance, cost and users’ quality-of-service requirements. [20]

2.1

Components

From a high level perspective this environment can be represented by a distributed topology of networks exploited by users belonging to the multitude of domains taking part to the global network to perform intensive computational tasks. From a practical point of view we can identify a Grid system as a set of middleware software packages that enable nodes of the network to take part in the computation. Multiple such middleware exists and are deployed in real world environments; each of these middlewares has its own peculiarities and terminology for describing nodes and dynamics. Nonetheless, the concepts they implement can be generalised across software packages. In the remainder of this section we will describe grid components using terminology adopted by

12

CONTENTS

Figure 2.1: Grid network

the gLite project 1 one of the most famous and deployed 2 Grid middlewares and, moreover, the source of the dataset we used as a testbed for our learning method. A term that will recur often this paper, and we naively already used, is component; in this context we refer to it as: Definition: a component any entity (human being, network node or software) that takes an active role in the collaborative computational process. Definition: a network node is a physical machine (single or cluster) that hosts a Grid component. Definition: a resource is a particular capability offered by a component, usually computing or storage elements (offering cpu time or storage space). We start by distinguishing a set of relevant such components: Virtual Organisation (VO) a grouped user community (department, company) Computing Element (CE) is some set of computing resources located at a site (i.e. a cluster, a computing farm) Storage Element (SE) provides uniform access to data storage resources. Workload Management System (WMS) is to accept user jobs, to assign them to the most appropriate Computing Element, to record their status and retrieve their output. Resource Broker (RB) is the machine where the WMS services run User Interface (UI): enables a user to authenticate and authorised to use resources (ie: submit jobs)

Figure 2.1 shows a graphical representation of Grid network highlighting the components just described. 1 http://www.glite.org 2 As

reported by the Grid Observatory project, http://www.grid-observatory.org

2.1. COMPONENTS

13

In the remainder of this chapter we will provide a description of the relevant components addressed in our study and their representation, from an high level perspective, that can be incorporated in our method.

2.1.1

Users

User interaction with the Grid is expressed in terms of jobs submission to the system through an interface, monitoring job statuses at given moments in time and retrieving an output when computation terminates. In our work we use the term User to refer both to the human being within a given department (Virtual Organisation) making use of the system and the software client interface that allows him to manage and monitor jobs on the Grid. Upon submission jobs are handled by a resource brokers that given an estimate of the resources required will schedule the job on computing elements. On most Grid middleware systems jobs are enriched by a set of requirements a user want to embed to drive computation on the Grid. For instance a user may specify a maximum clock time a job is allowed to be present on the Grid before cancellation or a retry count limit for cases in which a job gets resubmitted to a broker. When resources are not available or the broker is not able to properly relate the user needs to the pool of Computing Elements associated to itself the job can be deferred or forwarded to another broker within the user domain or to other domains the user is authorised to access. The outcome of this process is usually the output of the job when it was completed successfully and a message that describe the status of the job from the Grid point of view. In case computation has terminated successfully, the network was able to allocate the necessary resources and proceed with computation, the job will be marked as done. This status does not guarantee that the output is correct - or even existing - rather than it feasible for the Grid to process the job. On the other hand it may be the case that computation was not possible. In this eventuality we distinguish two families of events that may have occurred:

1. the job has been aborted by the system or cancelled by the user who submitted it;

2. the job terminated with an error status.

In case of 2 multiple situations may have occurred; the job may have exceeded the resources it was allocated, for instance storage space, or the granted cpu time. In the remainder of this work we will refer to both situations as failures and we will focus on them as a source of information to identify anomalies on the network.

14

CONTENTS

2.1.2

Resource brokers

A resource broker occupies in the Grid world what is a process scheduler’s role in an operating system. The broker is the component that takes care of evaluating a user job’s needs and allocate resources for the computation. Brokers bring together users and computing elements and allow for cross domain job migration by establishing routes between virtual organisations. For the sake of our work will we focus mainly of four of the operations that characterise the behaviour of resource brokers: 1. accept/discard jobs submitted by users; 2. match jobs to computing elements (match making); given the job requirements; 3. handle events: resubmission, transfer, abort, cancellation; 4. record user jobs status and retrieve their output; We briefly discussed the first operation in the previous section; the Broker’s task is to allow authorised users access to resources (via job submission). Upon access granted the second step is matching incoming jobs to available resources. One can think of this operations as find a compromise between user’s requirements and the most appropriate computing and storage elements the Broker can redirect jobs to. Another operation relevant to our study is the process of events handling. By event we mean one of the following situations: resubmission: a job associated to a Computing Element is sent back to the Broker with a failed status and requires re matching and scheduling. transfer: a Broker is not able to satisfy user’s job requirements but is aware of other Brokers in its own of other domains he is authorised to access that can allocate the needed resources. He proceeds therefore to transferring the job to those Brokers and sits back waiting for a result or status update abort: a job is terminated by the system before it can reach a failure status cancellation: a users sends a job cancellation request

Finally the last operation that concerns us is job status recording and output retrieving. Within the computation time the Broker acts as the connection point between a users and the resources available on the Grid. Associated to input capabilities Brokers provide output capabilities, namely providing users with final outcomes of their job, record job status for monitoring purposes and failure reasons so to provide feedback.

2.2. TOPOLOGY PROPERTIES

2.1.3

15

Computing and Storage elements

We can identify Computing and storage elements as the resources providers available on the Grid. Traditionally the components were single machines within a domain whose access was granted to authorised users to perform computational tasks. In modern Grid setups these components are clusters of machines. As the naming suggests Computing Elements provide generally computational resources (CPU time, high ram memory setups) whereas Storage Elements provide storing space, usually in terms of shared filesystem or database systems. In the remainder of this work we will couple both of them under the term of Computing Elements. These are the parts of the Grid where the job is actually computed; Computing Elements interact with Resource Brokers to notify them which resources are available at a given point of time and report back the status of assigned jobs. Computing Elements are the components of the Grid that show the characteristic of volatility the most. Within a domain resources may appear, disappear or change location at a given time. Think for example of a database system that need to be pulled out a Computing Element cluster for maintenance and is later on reintroduced in a new location (a new machine within the network).

2.1.4

Time

We refer to time as been a component of Grid systems because, though not being a mechanical hardware of software part, of the role it plays both on the network topology variations over the short and long term period and on a job lifetime. From a network topological point of view we can observe that the network shape commonly changes over time as a result of new nodes being being pulled in and out the network. This can be the consequence of new users joining or leaving the network or new brokers and computing elements being moved, shut down or added as a consequence of upgrades, maintenance processes or failures. Time is also a tightly bound the a job lifetime; though average life time varies on the kind of process being computed and on the type of service provided by the Grid itself it has been observed that the order of magnitude of time elapsed from submission to completion can be expected to be in the terms of hours or days. Such a long lifetime means both long computation time, in terms of long permanence on Computing Elements, and multiple submissions to Brokers and Computing Elements with eventual transfers to other Virtual Organisations.

2.2

Topology properties

The topology resulting from interactions between components is characterized by a set of properties peculiar of Grid computational networks. These influence the overall system’s behaviour both in terms of networking, for instance how link between machines underlaying components are established and vary over time, functionalities deployments (ie: which software middleware enable computation and communication between nodes) and the kind of tasks a Grid network can be used for.

16

CONTENTS Grid properties can be summarised as: 1. Dynamic 2. Autonomy of nodes 3. Diversity of tasks 4. Adaptation

Dynamic in the sense that available resources on nodes may vary in time, moreover, nodes location and availability can change. As we previously discussed that it is the case that Users, Brokers and Computing Elements may join, leave or be unavailable at given points of time. This property is a consequence of the autonomy of nodes; Brokers and Computing Elements are independent and autonomous from each other and each domain has its own zone of control. In this contest autonomy means both that given nodes may be managed by different administrator thus embodying a variety of policies and that each node is autonomous from an operational point of view: multiple operating systems, software packages and network gears can be deployed on a given node without explicit knowledge of other components taking part to the Grid. As we previously stated; the kind of service provided by the Grid reflects both on its topology and resource availability, allocation and deployment. Even within a given type of Grid, various tasks consume different resources; this property takes the name of diversity and has impact, for instance, on the number of domains involved, the number and quality of Computing (and Storage) Elements as well as the criteria for authorising and authenticating users. Diversity of tasks also means diversity in job lifetime and resource needs. To cope with autonomy functions in Grids are realised through middle services; this aspect of takes the name of adaptation in the sense that is necessary to adapt local environments. to be able to communicate with other members of the Grid. This process is enforced by means of software middleware that run Grid components functionalities and by standard formats to drive operations such as authentication, job submission and communication between the components.

2.3

Challenges

The complexity of Grid topologies makes monitoring and system administration hard tasks to achieve. Cross-domain site knowledge may be difficult (or even impossible) to achieve due to security and privacy related limitations and given the number of technologies that play a role in the system correlation of factors may be a key point in understanding the source of malfunctioning of unexpected behaviours and thus enable a solution. In particular we would like to build a model able to represent and allow the study of: 1. a dynamic and continuously changing topology

2.3. CHALLENGES 2. job paths 3. allocation and reservation of resources (resource brokers) 4. performance (or malfunction) prediction

17

18

CONTENTS

Chapter 3

Model In previous chapters we presented an overview of what is intended by grid computing in the context of our work. In this chapter we will better contextualise the description discussed so far to the goal and framework of our study and we will bridge it to the machine learning world. We will introduce the model we designed and implemented in order to construct a framework that enables analysis of events on a Grid network by taking into account observable data and inferring status and behavioural characteristics of the components involved in job processing. First we introduce the problem domain giving a definition of event and the kind of problems our method addresses. We then proceed by describing the dataset our method has been built and evaluated upon. Section 3.3 discusses choices we have made for modelling Grid dynamics. Finally in Section 3.4 we show how a model of the Grid is built.

3.1

Problem domain

In [12] the authors give multiple formal definitions of behaviour types. The work is driven by the need for characterising complex network systems not only in terms of analysis of transmission facilities, structure, protocol design and network performance - recurring topics emphasised in network research - but from a macroscopical cognitive angle. A general definition of behaviours given by Lu et. all is the following: Definition: Behaviours are the modalities or characteristics of some purposeful, distinguishable and measurable process executed by entities. Any entity has the attributes of executing various behaviours. Network behaviour is the embodiment of the capabilities or functions of active entities, which can directly or indirectly affect entity states in network space. [...] Network behaviours can make system states change from one state to another. According to different analysis goals, system states may be security states, control states or confrontation states, etc.. We can can give a definition of Grid behaviour as the variation of components statuses and interconnections (topology) as a result of changes in computational load and parties configurations and requirements over time.

20

CONTENTS

We propose a model to characterise behaviour in terms of events happening on the network. We call event a particular situation derived from the status and interaction of components processing a given job, whose identification and prediction can lead to three possible benefits: 1. can provide a source of knowledge to a system administrator or Grid user; 2. can be used as a source of feedback to proactively make decisions about Grid management; 3. arises from specific components behaviour. Practically an event is a configuration of components and jobs that share similar characteristics at a given moment in time. In machine learning terms events are discovered by clustering information inferred from log data according to a domain specific similarity metric. Clusters of events are the core subject of our network behaviour analysis. Conventional network monitoring and intrusion prevention system solutions defend a network’s perimeter by using packet inspection, signature detection and blocking. These tools operate at very specific low level protocols and are constrained within the topology an admin has full access to. Behavioural solutions, on the other hand, continuously watch what is happening inside the network from a higher perspective, often disregarding operating system and protocol specific quirks, collecting less low level details and aggregating data from many points to support analysis. In a distributed setting such as the Grid, events may be caused by interaction between components located in geographically and authoritative distinct domains. This problem is also aggravated by the speed at which things happen on the network, on the extensive amounts of generated data and application specific usage of the network. From an Artificial Intelligence perspective this scenario poses interesting challenges in terms of data driven modelling and knowledge discovery to determine usage and traffic patterns and resources availability patterns that provide a source of feedback to help network monitoring and improve usage. We identify the possible audience of a practical implementation of our method via two use cases: network administrators and users. Operational problems for a system administrator consist of manually analysing large log files and correlating events in order to isolate and understand anomalies or malfunctioning on the network. Knowledge that can be accessed is usually limited to the network an administrator is in charge of and this may introduce overhead that can result in large time invested in debugging problems. Given that this is extremely time - both human and computational - consuming, a network administrator will usually look at logs when an anomaly has already happened.

3.1. PROBLEM DOMAIN

21

A user is an active player on the Grid; he develops applications, specifies requirements and monitors the status of jobs he has submitted. Common patterns in Grid traffic are generated by user related issues such has wrongly specified requirements, incorrectly submitted applications, wrongly reported network bugs. Properly characterising user patterns would also remove monitoring overhead from administrators by discriminating human mistakes from hardware failures. Both figures could benefit from a behavioural model of the Grid; administrator could gain a cross-domain, higher level overview of what is happening on the network, observe trends in network traffic and submitted jobs, identify possible anomalies correlate events with information provided by conventional monitoring tools. On the other hand a user would have a feedback on the impact his jobs have on the network, self debug job specific problems and isolate broken components so to report bugs more accurately.

3.1.1

Related work

Applying machine learning techniques to infer behaviour about Grid components dynamics is a novel field of interest. To our knowledge the resources available in literature are still modest in quantity. Relevant to our study are the works presented in [4], [27], [14]. In [4] the authors suggests a symptoms based model to diagnose malfunctioning of the resources involved in the Grid. The method consists in identifying a set of indicators of possible failure sources (nodes who run out of space, access denied to resources) and from these extract a set of rules to match observations from log files and diagnose situations of eventual malfunctioning (symptoms). The authors investigate and compare several unsupervised and semi supervised methods to predict behaviours. This method reports good examples of events that may interesting and feasible to discover in Grid log data. As a drawback the type of analysis performed is static and focused on domain specific job information that not be feasible to observe in a cross-domain setup. Authors approach is more similar to conventional traffic and system analysis; it is focused on configuration specific aspects of Grid sites rather than the global network. [27] mines the logging and bookkeeping database using a clustering algorithm. The goal of the authors is to group failed jobs into a given number of clusters, each of which represents a possible reason for failure. Again we find in this paper examples of failure classes as possible events to discover. Authors describe an algorithm able to refine similarities between paths over subsequent iterations. This method though does not take components status and variation of topology over time into account. Both the algorithm and the experiment setup are aimed at discovering similarity by analysing job hops without considering the status of exit and entrance nodes. Similarly, [14] aims at finding associations in Grid Monitoring data by using a rule extraction based approach. The authors do not focus on jobs only but try to extract hidden information form the various components of the Grid. Rules are extracted by first analysing jobs life cyle on short period of times and are then refined to obtain association and meaningful information given a domain specific

22

CONTENTS

metric. Rules are automatically mined from the observed data over variable amounts of time. Rules consists in variation of job statuses given the network components they traverse; despite taking into account variation of job statuses no real notion of grid topology changes is provided and no characterisation of traversed components is given. Another difference with this approach lies in the fact that no analysis is directly performed on the clusters and no semantic is given to discovered classes and relies on human specific knowledge and feedback. Rules are used to trace back failing components The method we propose is a novel approach in mining Grid data. Similar to [14], starting from a set of job centric log data, we infer hidden components information given job transactions. inference results both in a reconstruction of the network topology at given moments of time and a characterisation of components status in the network. We represent this information in a vector space and cluster job vectors in order to discover events. We then construct a classifier for the clusters and use human experts to supply labels.

3.2

Data

The dataset used in this work has been provided by the Grid Observatory 1 . 2 , an open project that collects, publishes and analyses data on the behaviour of the European Grid for E-science (EGEE). The project community is divided into 12 federations, consisting of over 70 contractors and over 30 non-contacting participants covering both scientific and industrial applications. The two current pilot application domains are the Large Hadron Collider Computing Grid supporting physics experiments and the other is Biomedical Grids. The network is run on top of the gLite middleware, a software suite that implements all components previously described. The gLite middleware deployed on the EGEE infrastructure integrates the sites’ computing resources through the Workload Management System (WMS). The WMS is a set of middleware-level services responsible for the distribution and management of jobs. The site computational resources present a common interface to the WMS, the Computing Element (CE) service. A WMS is de facto an implementation of a Resource Broker and as such is referred to in our work. In addition it provides a facility, Logging & Bookkeeping, that tracks jobs in terms of events (important points of job life, e.g. submission, finding a matching CE, starting execution etc.) gathered from various Resource Broker components as well as sites. It comes in the form of a relational (SQL) database whose most important tables are events, short fields and long fields. Upon creation each job is assigned a unique, virtually non-recyclable job identifier and its interactions with components are recorder as sequential time-stamped entries in events. For each such record related entries in short and long fields contains detailed information about a given event. In this table we can find the host a job 1 www.grid-observatory.org 2 The

Grid Observatory is part of the EGEE-III EU project INFSO-RI-222667

3.2. DATA

23 !

"#$%&'$()!*+$,-#&"("'+&!

!"#$: "##$%&&%#'!!

./01234!567!

! !

! Figure 3.1:%&'()$*+,*-.$*/&0$121/$*30*"*435* job lifecycle in gLite

! 4.1.3. The LB Database 9(4! :;! /3! 6! .4
./)*+,-

'(

!"#$

5,6*)79

5,6*)78 ./

'(

&%$3?16=% ))))))6>1%&@,*$

&%$3?4+3>1

5,6*3&%$ '(

5,6*)7:

./

,