An Brief Introduction to Workflow Scheduling and Workflow Mining

Outline > Background – Workflow and Workflow Management System – Grid Computing, Cloud Computing – Business Workflow vs. Scientific Workflow > Workflo...
Author: Alyson Bruce
2 downloads 0 Views 11MB Size
Outline > Background – Workflow and Workflow Management System – Grid Computing, Cloud Computing – Business Workflow vs. Scientific Workflow > Workflow Scheduling

An Brief Introduction to Workflow Scheduling and Workflow Mining

– Classification – Representative Scheduling Algorithms – Research Issues Related to Workflow Temporal Verification

Xiao Liu {[email protected]}

> Workflow Mining – Representative Mining Algorithms

CS3, Swinburne University of Technology Melbourne, Australia

– ProM Mining Tool – Research Issues Related to Workflow Temporal Verification 2

Workflow and Workflow Management System

Workflow in a Business Process

> The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules. > A Workflow Management System is a system that provides procedural automation of a business process by managing the sequence of work activities and by managing the required resources (people, data & applications) associated with the various activity steps. > -- [Workflow Management Coalition]

4

3

WFMS Components

Outline > Background

Process Definition Tools

– Workflow and Workflow Management System – Grid Computing, Cloud Computing

Interface 1 Interface 5

Process Definition Import/Export

Workflow Enactment Service

Administration & Monitoring Tools

– Business Workflow vs. Scientific Workflow

Other Workflow Enactment Service(s)

> Workflow Scheduling – Classification

Workflow Engine(s)

Workflow Engine(s)

– Representative Scheduling Algorithms – Research Issues Related to Workflow Temporal Verification

Interface 2 Client Apps

Interface 3 Worklist Handler

> Workflow Mining

Interface 4 - Interoperability

– Representative Mining Algorithms

Tool Agent Invoked Applications

– ProM Mining Tool

Legacy, Desktop, etc

– Research Issues Related to Workflow Temporal Verification 5

6

1

Computing Infrastructures: Grid/Cloud Computing

What is Grid

> “Computer Utilities” :Vision: Implications of the Internet

> A type of parallel and distributed system that enables the sharing, selection, & aggregation of geographically distributed “autonomous” resources:

> 1969 – Leonard Kleinrock, ARPANET project

– Computers – PCs, workstations, clusters, supercomputers, laptops, notebooks, mobile devices, PDA, etc;

– “As of now, computer networks are still in their infancy, but as they grow up and become sophisticated, we will probably see the spread of ‘computer utilities’, which, like present electric and telephone utilities, will service individual homes and offices across the country”

– Software – e.g., ASPs renting expensive special purpose applications on demand;

> Computers Redefined

– Catalogued data and databases – e.g. transparent access to human genome database;

– 1984 – John Gage, Sun Microsystems: “The network is the computer”

– Special devices/instruments – e.g., radio telescope – SETI@Home searching for life in galaxy.

– 2008 – David Patterson, U. C. Berkeley : “The data center is the computer. There are dramatic differences between of developing software for millions to use as a service versus distributing software for millions to run their PCs”

– People/collaborators.

– 2008 – Rajkumar Buyya: “Cloud is the computer”

> depending on their availability, capability, cost, and user QoS requirements.

Some slides in this section are borrowed from Dr. Buyya’s presentations 8

7

Grid Architecture

Prominent Grid Drivers: Emerging e-Science and e-Business Apps > Next generation experiments, simulations, sensors, satellites, even people and businesses are creating a flood of data.

Science

> e-Science refers to the large scale science that will increasingly be carried out through distributed global collaborations enabled by the Internet.

Commerce



MPI

Engineering

ExcellGrid

Grid Brokers:

Gridscape

Collaboratories Workflow

Workflow Engine

Nimrod-G





Grid Applications

Portals

X-Parameter Sweep Lang.

User-Level Middleware (Grid Tools)

Gridbus Data Broker

~PBytes/sec Globus

Unicore

Alchemi

High Energy Physics

Grid Economy

BELLE

Brain Activity Analysis

NorduGrid

.NET Windows



XGrid

JVM Solaris

Grid Storage Economy

Condor Linux

Grid Exchange & Federation

Grid Bank

PBS

SGE

AIX

IRIX

Libra

Core Grid Middleware

Grid Market Directory

Tomcat Mac

OSF1

G R I D S I M

Newswire & data mining: Natural language engineering Digital Biology

Life Sciences

Grid Fabric Hardware

CDB

Astronomy

Grid Fabric Software

PDB

Quantum Chemistry

Finance: Portfolio analysis Worldwide Grid

Internet & Ecommerce

9

On Demand Assembly of Services: Interaction Between Grid Components

What is Cloud > Over 20 definitions: http://cloudcomputing.sys-con.com/read/612375_p.htm

Application Code Explore data

1 Data Source

Visual Application Composer

Data

> "A Cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualised computers that are dynamically provisioned and presented as one or more unified computing resources based on service-level agreements established through negotiation between the service provider and consumers.” -- Dr Rajkumar Buyya

10

2 (Instruments/dis tributed sources)

Data Catalogue Data Replicator (GDMP)

5 6

Grid Resource Broker

4

Grid Info Service

12

3

ASP Catalogue

9

> Keywords: Virtualisation (VMs), Dynamic Provisioning (negotiation and SLAs), and Web 2.0 access interface

Grid Market Directory

7

8 Grid Service (GS) (Globus)

10

> Cloud Services:

Alchemi

Bill GS

– Infrastructure as a Service (IaaS): CPU, Storage: Amazon.com, Nirvanic, GoGrid….

11

Cluster Scheduler

CPU or PE

PE Grid Service Provider (GSP)

(e.g., CERN)

GSP (e.g., IBM)

GTS Cluster Scheduler

PE GSP (e.g., UofM)

PE GSP (e.g., VPAC)

– Platform as a Service (PaaS): Google App Engine, Microsoft Azure, Manjrasoft Aneka..

Gridbus GridBank GSP (Accounting Service)

11

– Software as a Service (SaaS): SalesForce.Com 12

2

Benefits of (Public) Clouds

Challeges: Dealing with too many issues and offerings

> No upfront infrastructure investment – No procuring hardware, setup, hosting, power, etc..

> On demand access

Storage

Billing

– Lease what you need and when you need..

> Efficient Resource Allocation Reliability

– Globally shared infrastructure, can always be kept busy by serving users from different time zones...

Utility Management

Scalability

> Nice Pricing – Based on Usage, QoS, Supply and Demand, Loyalty, …

> Application Acceleration

Web 2.0

– Parallelism for large-scale data analysis, what-if scenarios studies…

Uhm, I am not quite clear…Yet another complex IT paradigm?

> High Availability > Supports Creation of 3rd Party Services & Seamless offering

Software Eng. Complexity

– Builds on infrastructure and follows similar Business model as Cloud

14

13

Market-oriented Cloud Architecture: QoS negotiation and SLA-based Resource Allocation

Cloud Architecture User level

Apps Hosting Platforms

Core Middleware

QoS Negotiation, Admission Control, Pricing, SLA Management, Monitoring, Execution Management, Metering, Accounting, Billing Virtual Machine (VM), VM Management and Deployment

Adaptive Management

Cloud programming: environments and tools Web 2.0 Interfaces, Mashups, Concurrent and Distributed Programming, Workflows, Libraries, Scripting

Autonomic / Cloud Economy

User-Level Middleware

Cloud applications Social computing, Enterprise, ISV, Scientific, CDNs, ...

Cloud resources System level

15

Grid vs. Cloud

16

Outline > Background

>

Too many to say (if you focus on technique details)

>

Yet, too few to say (if you focus on the general items)

– Workflow and Workflow Management System

>

Grid Computing: technique models; Cloud Computing: business models

– Grid Computing, Cloud Computing

>

Some truths (my personal view):

– Business Workflow vs. Scientific Workflow

– Grid Computing is going down and Cloud Computing is rising up

> Workflow Scheduling

– Grid Computing failed to achieve what it promised a decade ago

– Classification

– Grid Computing focus on facilitating various resources and dedicate to solve a single task, especially scientific task

– Representative Scheduling Algorithms – Research Issues Related to Workflow Temporal Verification

– Cloud Computing: IaaS, PaaS, SaaS; utility and market oriented.

> Workflow Mining

– Money! Saving money for individual users and enterprise users

– Representative Mining Algorithms

– Money! Saving terrible waste on IT (hardware) investments by big companies

– ProM Mining Tool

– Money! Research needs to promote business profit

– Research Issues Related to Workflow Temporal Verification 17

18

3

E-Business vs. E-Science

Business Workflow vs. Scientific Workflow Generally speaking, a scientific workflow can be characterised with the four aspects:

Utility computing

High-performance computing

Collaborative design Financial modeling

> Large scale and complex workflow processes BELLE

> Data and computation intensive activities

Collaborative data-sharing High-energy physics

E-Business

> Dynamic and distributed high performance infrastructure > Higher automation and less human interaction

Drug discovery

Life sciences

Business workflow is relatively (not in all cases) less intensive on the above aspects.

Data center automation

From a research respective:

E-Science

> Except if your research is on a specific eBusiness or eScience application, whether using business workflow or scientific workflow scenario depends on the how significant your work will be in each scenario

Natural language processing

Business Intelligence (Data Mining) 19

Demo

20

Questions and Comments? > Any questions or comments so far?

> Kepler: A Scientific Workflow Management System – Kepler Software

> GOGRID: – http://www.gogrid.com/

21

Outline

22

Workflow Scheduling

> Background

> Workflow scheduling is one of the key issues in the workflow management

– Workflow and Workflow Management System

> A scheduling is a process that maps and manages the execution of interdependent tasks on the distributed resources. It allocates suitable resources to workflow tasks so that the execution can be completed to satisfy objective functions imposed by users. Proper scheduling can have significant impact on the performance of the system.

– Grid Computing, Cloud Computing – Business Workflow vs. Scientific Workflow > Workflow Scheduling – Classification

> In general, the problem of mapping tasks on distributed services belongs to a class of problems known as NP-hard problems.

– Representative Scheduling Algorithms

> In this presentation, we focus on workflow scheduling algorithms in heterogeneous distributed system environments, e.g. grid workflow scheduling algorithms.

– Research Issues Related to Workflow Temporal Verification > Workflow Mining – Representative Mining Algorithms

Ref: J. Yu and R. Buyya, "Workflow Scheduling Algorithms for Grid Computing," Technical Report GRIDS-TR-2007-10, The University of Melbourne, Australia, May 31, 2007.

– ProM Mining Tool – Research Issues Related to Workflow Temporal Verification 23

24

4

Grid Workflow Scheduling

Driving Theme: Community Grids vs. Utility Grids

> Many heuristics have been developed to schedule inter-dependent tasks in homogenous and dedicated cluster environments. However, there are new challenges for scheduling workflow applications in a Grid environment:

Type Community Grids

Utility Grids

Best effort

Contract/SLA

Not considered /

Usage, QoS level, Market supply and demand

Feature User QoS

– Resources are shared on Grids and many users compete for resources. – Resources are not under the control of the scheduler.

Service Pricing

– Resources are heterogeneous and may not all perform identically for any given task.

free access

– Many workflow applications are data-intensive and large data sets are required to be transferred between multiple sites.

Example Workflow Systems

> Therefore, Grid workflow scheduling is required to consider non-dedicated and heterogeneous execution environments. It also needs to address the issue of large data transmission across various data communication links.

Triana, MyGrid, Askalon, DAGMan, Pegasus, GrADS Kepler

Gridbus Grid Workflow Engine

25

Driving Theme: Community Grids vs. Utility Grids

26

Classification of Grid workflow scheduling algorithms

> Scheduling on Community Grids – Minimise the execution time based on best effort (ignores factors such as monetary cost of resource access and various users’ QoS satisfaction levels.) > Scheduling on Utility Grids – Focuses on mapping workflow tasks on services to satisfy users’ QoS constraints (e.g. deadline, the quality of produced data). – Supports negotiation and establishment of SLA as a contract between users and providers – Optimise performance under most important QoS constraints imposed by users. • Minimise execution cost while meeting a specified deadline. • Minimise execution time while meeting a specified budget. – Support SLA-based allocation of resources so that multiple competing demands from users can be managed with the aim of enhancing providers profit.

27

Best-effort based workflow scheduling

QoS-constraint based workflow scheduling

> Best-effort based workflow scheduling algorithms are targeted towards Grids in which resources are shared by different organisations, based on a community model (known as community Grid).

> Many workflow applications require some assurances of quality of services (QoS). > However, completing the execution within a required QoS not only depends on the global scheduling decision of the workflow scheduler, but also depends on the local resource allocation model of each execution site. It is required that the scheduler can negotiate with service providers to establish a service level agreement (SLA) which is a contract specifying the minimum expectations and obligations between service providers and consumers.

> In the community model based resource allocation, monetary cost is not considered during resource access. Best-effort based workflow scheduling algorithms attempt to complete execution at the earliest time, or to minimize the makespan of the workflow application.

> Users normally would like to specify a QoS constraint for entire workflow. The scheduler needs to determine a QoS constraint for each task in the workflow, such that the QoS of entire workflow is satisfied. > In general, service-oriented Grid services are based on utility computing models. Users need to pay for resource access and service pricing is based on the QoS level and current market supply and demand. Therefore, unlike the scheduling strategy deployed in community Grids, QoS constraint based scheduling may not always need to complete the execution at earliest time. They sometimes may prefer to use cheaper services with a lower QoS that is sufficient to meet their requirements.

29

30

5

Outline

Representative Workflow Scheduling Algorithms >

> Background

Best-effort Based – Heuristics

– Workflow and Workflow Management System

• Myopic

– Grid Computing, Cloud Computing

• Min-Min and Max-Min

– Business Workflow vs. Scientific Workflow

• Sufferage • Heterogeneous-Earliest-Finish-Time (HEFT)

> Workflow Scheduling

• Hybrid heuristic

– Classification

• TANH – Metaheuristics

– Representative Scheduling Algorithms

• Greedy Randomized Adaptive Search Procedure (GRASP)

– Research Issues Related to Workflow Temporal Verification

• Genetic Algorithms (GAs)

> Workflow Mining

• Simulated Annealing (SA) • Ant Colony Optimisation (ACO)

– Representative Mining Algorithms >

– ProM Mining Tool

Dynamic Scheduling Techniques

– Research Issues Related to Workflow Temporal Verification 32

31

Representative Workflow Scheduling Algorithms >

Myopic algorithm

QoS-constraint based workflow scheduling

Myopic algorithm is a type of individual task scheduling which is the simplest scheduling method for scheduling workflow applications and it makes schedule decision based only on one individual task. The algorithm schedules an unmapped ready task to the resource that is expected to complete the task earliest, until all tasks have been scheduled.

– Deadline constrained scheduling • Heuristics – Back-tracking – Deadline distribution (TD) • Metaheuristics – GA – Budget constrained scheduling • Heuristics – LOSS and GAIN • Metaheuristics – GA

34

33

List Scheduling: Min-Min, Max-Min, Sufferage

Min-Min, Max-Min

> A list scheduling heuristic prioritises workflow tasks and schedules the tasks based on their priorities. There are two major phases in a list scheduling heuristic, the task prioritising phase and the resource selection phase. – The task prioritising phase sets the priority of each task with a rank value and generates a scheduling list by sorting the tasks according to their rank values. – The resource selection phase selects tasks in the order of their priorities and map each selected task on its optimal resource.

35

36

6

Sufferage

List Scheduling Algorithms: Comparison

> Instead of using minimum MCT and maximum MCT, the Sufferage heuristic sets priority to tasks based on their sufferage value. The sufferage value of a task is the difference between its earliest completion time and its second earliest completion time.

>

The Min-Min heuristic schedules tasks having shortest execution time first so that it results in the higher percentage of tasks assigned to their best choice (which can complete the tasks at earliest time) than Max-Min heuristics. Experimental results have proved that Min-Min heuristic outperform Max-Min heuristic.

>

However, since Max-min schedule tasks with longest execution time first, a long execution task may have more chance of being executed in parallel with shorter tasks. Therefore, it might be expected that the Max-Min heuristic perform better than the Min-Min heuristic in the cases where there are many more short tasks than long tasks.

>

On the other hand, since the Sufferage heuristic consider the adverse effect in the completion time of a task if it is not scheduled on the resource having with minimum completion time, it is expected to perform better in the cases where large performance difference between resources. The experimental results shows that the Sufferage heuristic produced the shortest makespan in the high heterogeneity environment among three heuristics. However, some argue that the Sufferage heuristic could perform worst in the case of data-intensive applications in multiple cluster environments.

37

Heterogeneous-Earliest-Finish-Time (HEFT) algorithm

(1)

38

Hybrid heuristic: dependency mode and batch mode

(3) (1) (2)

(3)

(4) (2) 39

TANH Algorithm

(4) 40

Greedy Randomized Adaptive Search Procedure (GRASP)

> Cluster based scheduling and duplication based scheduling are designed to avoid the transmission time of results between data interdependent tasks. The cluster based scheduling clusters tasks and assign tasks in the same cluster into the same resource, while the duplication based scheduling use the idling time of a resource to duplicate some parent tasks, which are also being scheduled on other resources. > TANH: a task duplication based scheduling algorithm for network of heterogeneous systems

41

42

7

GRASP: Challenges

Genetic Algorithm (GA)

> Local search? – How to define the neighbouring solutions?

Fundamentals for GA based Scheduling 1. Encoding/Decoding 2. Genetic Operators: Crossover, Mutation and Selection. 3. Fitness Evaluation Function

43

GA: Challenges

44

Simulated Annealing (SA)

> Random generation of valid initial solutions?

Fundamentals for SA based Scheduling 1. Initial solution 2. Annealing process 3. Metropolis algorithm

> How to ensure the generation of valid solutions after Crossover or Mutation? – Valid solutions: • Correct precedence relationships defined by input workflow process models (the most different yet challenging issue compared with other non-workflow scheduling problems) > Overheads?

45

SA: Challenges

46

Ant Colony Optimisation (ACO)

> Generation of valid initial solutions? > “At each cycle, it generates a new solution by applying random change on the current solution”, how?

Fundamentals for ACO based Scheduling 1) Initialization of algorithm 2) Initialization of ants 3) Solution construction 4) Local pheromone updating 5) Global pheromone updating 6) Terminal test, passed then stop, failed then go to step 2).

> Overheads?

47

48

8

ACO: Challenges

Comparison of Best-effort Workflow Scheduling Algorithms

> Definition of Pheromone and Heuristic Information – e.g. time greedy, cost greedy, overall greedy •v is the number of tasks •m is the number of resources •g is the number of tasks in a group

> Construction of Solution Schedules > Pheromone Management – Pheromone updating

49

Dynamic Scheduling Techniques

50

Deadline constrained scheduling

> The heuristics assume that the estimation of the performance of task execution and data communication is accurate. However, it is difficult to accurately predict execution performance in Grid environments due to its dynamic nature.

> Back-tracking: – The heuristic assigns available tasks to least expensive computing resources. – If there is more than one available task, the algorithm assigns the task with the largest computational demand to the fastest resources in its available resource list. The heuristic repeats the procedure until all tasks have been mapped.

> Therefore, the workflow scheduler must be able to adapt the resource dynamics and update the schedule using up-to-date system information. In general, two approaches, task partitioning and iterative re-computing, have been proposed to allow these scheduling approaches to allocate resources more efficiently in a dynamic environment.

– After each iterative step, the execution time of current assignment is computed. If the execution time exceeds the time constraint, the heuristic back-tracks the previous step and remove the least expensive resource from its resource list and reassigns tasks with the reduced resource set. If the resource list is empty the heuristic keep back-tracking to the previous step, reduces corresponding resource list and reassign the tasks.

– Task partitioning partitions a workflow into multiple sub-workflows which are executed sequentially. – Iterative re-computing (workflow rescheduling )keeps applying the scheduling algorithm on the remaining unexecuted partial workflow during the workflow execution.

51

Deadline constrained scheduling

52

Budget constrained scheduling > LOSS and GAIN:

> Deadline distribution (TD)

– LOSS and GAIN scheduling approach adjusts a schedule which is generated by a time optimised heuristic and a cost optimised heuristic to meet users’ budget constraints, respectively.

– Instead of back-tracking and repairing the initial schedule, the TD heuristic partitions a workflow and distributes overall deadline into each task based on their workload and dependencies. After deadline distribution, the entire workflow scheduling problem has been divided into several sub-task scheduling problems.

– If the total execution cost generated by time optimised schedule is not greater than the budget, the schedule can be used as the final assignment; otherwise, the LOSS approach is applied. The idea behinds LOSS is to gain a minimum loss in execution time for the maximum money savings, while amending the schedule to satisfy the budget.

– Once each task has its own sub-deadline, a local optimal schedule can be generated for each task. If each local schedule guarantees that their task execution can be completed within their sub-deadline, the whole workflow execution will be completed within the overall deadline.

– If the total execution cost generated by a cost optimized scheduler is less than the budget, the GAIN approach is applied to uses surplus to decrease the execution time. The idea behinds GAIN is to gain the maximum benefit in execution time for the minimum monetary cost, while amending the schedule.

– Similarly, the result of the cost minimization solution for each task leads to an optimized cost solution for the entire workflow.

53

54

9

GA and many other metaheuristics algorithms

Comparison of QoS Workflow Scheduling Algorithms

> Capable of optimising both time and budget – Minimum time within budget – Minimum budget within deadline > Controlled by the evaluation function

56

55

Outline

Research Issues Related to Workflow Temporal Verification

> Background

> Grid/Scientific Workflow Scheduling

– Workflow and Workflow Management System

– Minimum completion time within budget

– Grid Computing, Cloud Computing

> Constraint setting : fine-grained temporal constraints for each individual workflow activity

– Business Workflow vs. Scientific Workflow > Workflow Scheduling

> Temporal adjustment: handling temporal violations

– Classification

– Compensating occurring time deficits by rescheduling subsequent activities

– Representative Scheduling Algorithms

– GRASP, GA, SA, ACO, PSO and so on, which is the best one for handling temporal violations?

– Research Issues Related to Workflow Temporal Verification > Workflow Mining

• GA – ICSP’09, submitted to TSE

– Representative Mining Algorithms

• ACO- submitted to ASE’09

– ProM Mining Tool – Research Issues Related to Workflow Temporal Verification 57

58

Questions and Comments?

Outline

> Any questions or comments so far?

> Background – Workflow and Workflow Management System – Grid Computing, Cloud Computing – Business Workflow vs. Scientific Workflow > Workflow Scheduling – Classification – Representative Scheduling Algorithms – Research Issues Related to Workflow Temporal Verification > Workflow Mining – Representative Mining Algorithms – ProM Mining Tool – Research Issues Related to Workflow Temporal Verification 59

60

10

What is Workflow Mining or Process Mining

Process Mining Overview 2) process model

3) organizational model

4) social network

S tar t

R egister or der

P r epar e shipm ent

(R e)send bill

S hip goods

C ontact custom er

R eceive paym ent

A r chive or der

E nd

5) performance characteristics

1) basic performance metrics 6) auditing/security

If …then … From www.processmining.org

Some slides in this section are borrowed from Prof. van der Aalst’s presentations 61

Outline

62

The Starting Point

> Background – Workflow and Workflow Management System – Grid Computing, Cloud Computing – Business Workflow vs. Scientific Workflow > Workflow Scheduling – Classification – Representative Scheduling Algorithms – Research Issues Related to Workflow Temporal Verification > Workflow Mining – Representative Mining Algorithms – ProM Mining Tool – Research Issues Related to Workflow Temporal Verification 63

The α Algorithm

64

The α Algorithm α-algorithm - Ordering Relations >,→,||,#

> Direct succession: x>y iff for some case x is directly followed by y. > Causality: x→y iff x>y and not y>x. > Parallel: x||y iff x>y and y>x > Unrelated: x#y iff not x>y and not y>x. 65

case case case case case case case case case ...

1 2 3 3 1 1 2 4 2

: : : : : : : : :

task task task task task task task task task

A A A B B C C A B

A>B A>C B>C B>D C>B C>D E>F

ABCD ACBD EF

B||C C||B

A→B A→C B→D C→D E→F

66

11

The Basic Ideas

Basic α Algorithm

67

Step by Step

α-algorithm - Insight ABCD ACBD EF

B||C C||B

68

A→B A→C B→D C→D E→F

B A

D C

E

F

70

69

Step by Step

α-algorithm – Log properties + target nets > If log is complete with respect to relation >, it can be used to mine SWF-net without short loops > Structured Workflow Nets (SWF-nets) have no implicit places and the following two constructs cannot be used:

Choice and synchronisation should never meet

71

Synchronisation directly preceded by an OR-joint not allowed

72

12

α-algorithm – No short loops B>B and not B>B implies B→B (impossible!)

α-algorithm – Common Constructs Why no short loops?

Why not invible tasks?

Why no duplicate tasks?

One-length

Two-length

A>B and B>A implies A||B and B||A instead of A→B and B→A

>

Why noise-free logs? No invisible tasks, non-free-choice or duplicate tasks

>

No noisy logs

73

α-algorithm

74

α-algorithm

75

α-algorithm

76

Summary

77

78

13

Evaluation Dimensions*

Evaluation Dimensions

>

Fitness. The first dimension is fitness, which indicates how much of the observed behaviour is captured by (i.e., “fits”) the process model. For example, the model in Figure 2(c) is only able to reproduce the sequence ABDEI, but not the other sequences in the log. Therefore, its fitness is poor.

>

Precision. The second dimension addresses overly general models. For example, the model in Figure 2(d) allows for the execution of activities A – I in any order (i.e., also the sequences in the log). Therefore, the fitness is good, but the precision is poor. Note that the model in Figure 2(b) is also considered to be a precise model, although it additionally allows for the trace ACGHDFI (which is not in the log). Because the number of possible sequences generated by a process model may grow exponentially, it is not likely that all the possible behaviour has been observed in a log. Therefore, process mining techniques strive for weakening the notion of completeness (i.e., the amount of information a log needs to contain to be able to rediscover the underlying process ). For example, they want to detect parallel tasks without the need to observe every possible interleaving between them. *A. Rozinat, A.K. Medeiros, C.W. Gunther, A.Weijters, and van der Aalst, Towards an Evaluation Framework for Process Mining. Algorithms. BPM Center Report BPM-07-06, BPMcenter.org, 2007

>

Generalization. The third dimension addresses overly precise models. For example, the model in Figure 2(e) only allows for exactly the five sequences from the log5. In contrast to the model in Figure 2(b), which also allows for the trace ACGHDFI, no generalization was performed in the model in Figure 2(e). To determine the right level of generalization remains a challenge, especially when dealing with logs that contain noise (i.e., distorted data). Similarly, in the context of more unstructured and/or flexible processes, it is essential to further abstract from less important behaviour (i.e., restriction rather than generalization). In general, abstraction can lead to the omission of connections between activities, which could mean lower precision or lower fitness (e.g., only capturing the most frequent paths). Furthermore, steps in the process could be left out completely. Therefore, abstraction must be seen as a different evaluation dimension, which needs to be balanced against precision and fitness.

>

Structure. The last dimension is the structure of a process model, which is determined by the vocabulary of the modelling language (e.g., routing nodes with AND and XOR semantics). Often there are several syntactic ways to express the same behaviour, and there may be “preferred” and “less suitable” representations. For example, the fitness and precision of the model in Figure 2(e) are good, but it contains many duplicate tasks, which makes it difficult to read.

79

80

Overview

Quality of the Mined Model: Soundness and Completeness*

A (fully) complete workflow is such that all traces in the log at hand are compliant with some instance of it, whereas a (fully) sound workflow is such that all of its possible enactments have been actually registered in the log.

*G. Greco, A. Guzzo, L. Pontieriand D. Sacca, Discovering Expressive Process Models by Clustering Log Traces, IEEE TKDE, VOL. 18, NO. 8, AUGUST 2006

82

81

Multi-Phase Miner

Multi-Phase Miner

83

84

14

Genetic Mining

Genetic Mining

Soundness, Completeness, Precision, Generalization, Structure, and more

Genetic Mining

Region-Based Approaches

Region-Based Approaches

Region-Based Approaches

15

Language-Based Regions

Summary – Representative Process Mining Algorithms

Extensions

Balance

Relevance

Learn from Maps

16

Fuzzy Miner

Discover Other Perspectives

Conformance and Extension

Fitness

Fitness

Questions and Comments? > Any questions or comments so far?

102

17

People in Process Mining

Outline > Background – Workflow and Workflow Management System – Grid Computing, Cloud Computing – Business Workflow vs. Scientific Workflow > Workflow Scheduling – Classification – Representative Scheduling Algorithms – Research Issues Related to Workflow Temporal Verification > Workflow Mining – Representative Mining Algorithms – ProM Mining Tool – Research Issues Related to Workflow Temporal Verification 104

ProM

Demo

> ProM

106

Outline

Research Issues Related to Workflow Temporal Verification

> Background

> Process model is the basis for workflow management

– Workflow and Workflow Management System

> Mining temporal information is the focus for temporal verification

– Grid Computing, Cloud Computing

– Activity duration distribution models

– Business Workflow vs. Scientific Workflow

– Statistical information on routing (choice, iteration)

> Workflow Scheduling

– System bottlenecks, critical activities

– Classification – Representative Scheduling Algorithms

> Setting temporal constraints: where?

– Research Issues Related to Workflow Temporal Verification

> Temporal violations: temporal violation patterns? > Other mining techniques: time-series mining, association mining

> Workflow Mining – Representative Mining Algorithms – ProM Mining Tool – Research Issues Related to Workflow Temporal Verification 107

18

Major Conferences

Summary

> Workflow Management Systems and Workflow Mining

> Workflow Management Systems

– BPM: http://www.uni-ulm.de/in/iui-bpm09.html (A)

– Grid/Cloud workflows

– CAiSE: http://caise09.thenetworkinstitute.eu/ (A)

– eBusiness/eScience workflows

– CoopIS: http://www.onthemove-conferences.org/index.php/coopis (A)

> Workflow Scheduling

– APWEB: http://apweb-waim09.suda.edu.cn/ (B) > Grid/Cloud Computing, Workflow Scheduling

– Classical NP hard problems

– CCGrid: http://grid.sjtu.edu.cn/ccgrid2009/ (A)

– Major solution for handling temporal violations

– eScience: http://grid.sjtu.edu.cn/ccgrid2009/ (A)

– New strategies are required in upcoming cloud age

– ICPP: http://www.cse.ohio-state.edu/~icpp2009/ (A)

> Workflow/Process Mining

– IPDPS: http://www.ipdps.org/ (A)

– Business intelligence, hot research area in business process management, workflow management systems

> Temporal Verification and others – ICSE (A+), FSE (A), ASE (A)

– Workflow mining is not all about mining process

– BI, KDD and Data Mining Conferences / Workshops http://www.kmining.com/info_conferences.html 109

Major Journals

– Prof. Hai Jin’s Homepage: http://grid.hust.edu.cn/call/cfp.jsp

110

Some Useful Links

> TSE: IEEE Transactions on Software Engineering (A*)

> Workflow Management Systems

> TOSEM: ACM Transactions on Software Engineering and Methodology (A*)

– http://is.tm.tue.nl/staff/wvdaalst/

> TPDS: IEEE Transactions on Parallel and Distributed Systems (A*)

– http://is.tm.tue.nl/staff/wvdaalst/workflowcourse/

> JPDC: Journal of Parallel and Distributed Computing (A*)

– http://is.tm.tue.nl/staff/wvdaalst/BPMcenter/

> FGCS: Future Generation Computer Systems (A)

> Grid/Cloud Computing

> CCPE: Concurrency and Computation: Practice and Experience (A)

– http://www.buyya.com/

> TKDE: IEEE Transactions on Knowledge and Data Engineering (A)

– https://kepler-project.org/

> ESA: Expert Systems With Applications (A)

– http://www.gogrid.com/

> JIS: Journal of Information Systems (A) > DKE: Data & Knowledge Engineering (B)

> Workflow Scheduling: http://www.buyya.com/

> TAAS: ACM Transactions on Autonomous and Adaptive Systems (B)

> Workflow Mining: http://prom.win.tue.nl/research/wiki/

> T-ASE: IEEE Transactions on Automation Science and Engineering

> Research Work of WT in CS3, Swinburne: – Prof. Yun Yang: http://www.ict.swin.edu.au/ictstaff/yyang – Dr. Jinjun Chen : http://www.swinflow.org/ 111

Questions and Comments?

112

The End, Thanks!

> Any questions or comments so far?

113

114

19