Quantifying the Impact of Platform Configuration Space for Elasticity Benchmarking

Quantifying the Impact of Platform Configuration Space for Elasticity Benchmarking Study Thesis Nikolas Roman Herbst At the Department of Informatics...
Author: Hillary Grant
4 downloads 0 Views 4MB Size
Quantifying the Impact of Platform Configuration Space for Elasticity Benchmarking Study Thesis

Nikolas Roman Herbst At the Department of Informatics Institute for Program Structures and Data Organization (IPD), Informatics Innovation Center (IIC)

Reviewer: Advisor: Second advisor:

Prof. Ralf Reussner Dr.-Ing. Michael Kuperberg Dipl.-Inform. Nikolaus Huber

Duration: April 14th, 2011 –

August 31st, 2011

KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

www.kit.edu

iii Disclaimer The measurements and the results presented in this study thesis have been obtained using prototypic implementations of research ideas, deployed in a non-productive experimental environment. They are neither representative for the performance of IBM System z, nor can they be used for comparison or reference purposes. Neither IBM nor any other mentioned hardware/software vendors have sanctioned or verified the information contained in these slides. Any reproduction, citation or discussion of the results contained herein must be accompanied by this disclaimer in complete and untranslated form. Usage of these results for marketing or commercial purposes is strictly prohibited.

Declaration of Originality

I declare that I have developed and written the enclosed Study Thesis completely by myself, and have not used sources or means that do not belong to my intellectual property without declaration in the text. Karlsruhe, 2011-08-26

Contents

Abstract

1

1. Introduction 1.1. Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 4

2. Scalability 2.1. Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Definition of Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 7 8

3. Elasticity 3.1. A Definition of Elasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Elasticity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Direct and Indirect Measuring of Elasticity Metrics . . . . . . . . . . . . . . 3.4. Extraction of Provisioning Time using Dynamic Time Warping Algorithm (DTW) for Cause Effect Mapping . . . . . . . . . . . . . . . . . . . . . . . . 3.5. Interpretation of Provisioning Time Values based on Extraction by DTW Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6. A Single-valued Elasticity Metric? . . . . . . . . . . . . . . . . . . . . . . .

11 12 13 13

4. Elasticity Benchmark for Thread Pools 4.1. Variable Workload Generation . . . . . . . . . . 4.2. Experiment Setup . . . . . . . . . . . . . . . . . 4.3. Extraction of Elasticity Metrics . . . . . . . . . . 4.4. Experiments and Platforms . . . . . . . . . . . . 4.5. Results and Elasticity Metric Illustrations . . . . 4.5.1. Experiment 1 - Platform 1 . . . . . . . . 4.5.2. Experiment 3 - Platform 1 . . . . . . . . 4.5.3. Experiment 5 - Platform 1 . . . . . . . . 4.6. Observations and Experiment Result Discussion .

21 21 21 24 26 27 27 32 37 42

5. Elasticity Benchmark for Scaling Up of 5.1. Experiment Setup . . . . . . . . . 5.2. Workload Generation . . . . . . . . 5.3. Results . . . . . . . . . . . . . . . .

z/VM . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

17 19 20

Virtual Machines 45 . . . . . . . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . . . . . . 47 . . . . . . . . . . . . . . . . . . . 47

6. Future Work 49 6.1. Scale Up Experiment using CPU Cores as Resource . . . . . . . . . . . . . 49 6.2. Elasticity Benchmark for Scaling out the Number of z/VM Virtual Machines Instances in a Performance Group . . . . . . . . . . . . . . . . . . . . . . . 49 7. Conclusion

51

vi

Contents

8. Acknowledgement

53

Bibliography

54

Appendix A. Experiment 2 and 4 on Platform 1 A.1. Experiment 2 - Platform 1 A.2. Experiment 4 - Platform 1 B. Experiments 1-5 on Platform 2 . . B.1. Experiment 1 - Platform 2 B.2. Experiment 2 - Platform 2 B.3. Experiment 3 - Platform 2 B.4. Experiment 4 - Platform 2 B.5. Experiment 5 - Platform 2

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

57 57 57 61 65 65 69 73 77 81

Abstract Elasticity is the ability of a software system to dynamically adapt the amount of the resources it provides to clients as their workloads increase or decrease. In the context of cloud computing, automated resizing of a virtual machine’s resources can be considered as a key step towards optimisation of a system’s cost and energy efficiency. Existing work on cloud computing is limited to the technical view of implementing elastic systems, and definitions of scalability have not been extended to cover elasticity. This study thesis presents a detailed discussion of elasticity, proposes metrics as well as measurement techniques, and outlines next steps for enabling comparisons between cloud computing offerings on the basis of elasticity. I discuss results of our work on measuring elasticity of thread pools provided by the Java virtual machine, as well as an experiment setup for elastic CPU time slice resizing in a virtualized environment. An experiment setup is presented as future work for dynamically adding and removing z/VM Linux virtual machine instances to a performance relevant group of virtualized servers.

1. Introduction The technical report Defining and Quantifying Elasticity of Resources in Cloud Computing and Scalable Platforms [1] was an early result of the research work in our team consisting of J´oakim von Kistowski, Michael Kuperberg and myself as authors. Some contents and figures that have already been published in this technical report are reused in sections 1-4 of this study thesis.

1.1. Context In Cloud computing [2, 3], resources and services are made available over a network, with the physical location, size and implementation of resources and services being transparent. With its focus on flexibility, dynamic demands and consumption-based billing, cloud computing enables on-demand infrastructure provisioning and “as-a-service” offering of applications and execution platforms. Cloud computing is powered by virtualization, which enables an execution platform to provide several concurrently usable (and independent) instances of virtual execution platforms, often called virtual machines (VMs). Virtualization itself is a decades-old technique [2, 4], and it has matured over many generations of hardware and software, e.g. on IBM System z mainframes [5, 6]. Virtualization comes in many types and shapes: hardware virtualization (e.g. PR/SM), OS virtualization (e.g. Xen, VMware ESX), middleware virtualization (e.g. JVM) and many more. Virtualization is a technique which has become very popular in industry and academia, leading to a large number of new products, business models and publications. Mature cloud computing platforms promise to approximate performance isolation: a struggling VM with saturated resources (e.g. 100% CPU load) should only minimally affect the performance of VMs on the same native execution platform. To implement this behaviour, a virtual execution platform can make use of a predefined maximum allowed share of the native platform’s resources. Some platforms even provide facilities for run time adaptation of these shares, and Cloud computing platforms which feature run time and/or demand-driven adaptation of provided resources are often called elastic. Platform Elasticity is the feature of automated, dynamic, flexible and frequent resizing of resources that are provided to an application by the execution platform. Elasticity can be considered as a key benefit of the cloud computing. Elasticity carries the potential for

4

1. Introduction

optimizing system productivity and utilization, while maintaining service level agreements (SLAs) and quality of service (QoS) as well as saving energy and costs. While the space of choices between cloud computing providers for each domain (public, private and hybrid clouds) is getting wider, the means for comparing their features and service qualities are not yet finally developed. Several security issues for virtualised systems running on the same physical hardware cannot be answered completely, as well as guaranteeing performance independence of so called “noisy neighbours”. Scalability metrics have already been proposed by M. Woodside et al. in 2000 [7] by evaluating a systems productivity at different levels of scale. For considerations on how “elastic” a system is, these proposed metrics are insufficient because of not taking the temporal aspect of automated scaling actions into account. The viewpoint of dynamic scalability demands further observation and metrics on how often, how fast and how significantly scaling of a system can be executed. The actuality and relevance of researching this topic can be confirmed by the Gartner study “Five refining Attributes of Public an Private Cloud Computing” from May 2009 [8], where elasticity of virtual resources is stated as one central highlight of the modern cloud computing technology. The term “elasticity” itself is already often used in advertising of cloud infrastructure providers. Amazon even names their infrastructure “Elastic Compute Cloud - EC2”. As an example for automated resizing actions, Amazon offers a automated scaling API for their EC2 clients, which is described in detail at [9]. The client can control the number of virtual machine instances via policies, which observe the average CPU usage in an scaling group of virtual machine instances. The high utility of these features is demonstrated in a number of use cases within this manual. Anyhow, Amazon has not published yet any figures on how fast scaling actions are executed. The scaling automatism is restricted to only control the number of VMs in a group by observing CPU usage. Perhaps a higher granularity of automated programmable scaling actions could be useful to the Amazon cloud clients, like resizing a VM’s virtual resources at run time.

1.2. Contribution In this study thesis I outline the basic idea of elasticity in the context of cloud computing. I discuss the term of scalability as the enabling feature of a system for elastic behaviour. Furthermore, I explain several key metrics, that characterize resource elasticity, and discuss possible ways towards direct and indirect measurements of these elasticity metrics. An appropriate definition for resource elasticity in the context of cloud computing could not yet be found in research literature, for example in [10, 11, 7, 12]. A clear definition of resource elasticity is provided within this study thesis, as well as an detailed outline of key metrics characterizing elastic behaviour of a system. The basic concepts of resource elasticity metrics are transferable to other (also non-IT) contexts, where changing resource demands are supposed to fit to provided resources. As validation for the concept of elasticity metrics, the following three experiments, that are based on different elastic resource pools, are presented in detail, followed by an interpretation of the conducted measurement’s results. Configuration parameters will be highlighted that influence the elasticity behaviour directly.

1.2. Contribution

5

Experiments: • Dynamic growing and shrinking of Java Thread Pools • Variable CPU time slice distribution to virtual machines • Dynamically adding/removing virtual machine instances to a performance relevant group of virtualized servers The conducted experiments aim towards the development of a dedicated benchmark to measure resource elasticity metrics. By validating the approach on a reference cloud environment, namely an IBM System z, resource elasticity could probably be added as an intuitively comparable parameter for differentiation between cloud computing offerings.

2. Scalability An execution platform that is not scalable within certain boundaries cannot be elastic in any case. Scalability is the basis for elasticity of a system. Due to this, scalability is introduced and discussed first in this chapter, whereas elasticity is then defined in the following chapter.

2.1. Problem Description Scalability is a term that can be applied both to applications and to execution platforms. In this section, I introduce a more precise terminology for scalability and discuss existing approaches of measuring it. Scalability of computing systems is a crucial point already at the design phase of software components. A lot of research has already been done on how to achieve good scalability of software or hardware systems. This includes approaches on how to avoid bottlenecks already in the system’s architecture, that can possibly slow a system down even with when more resources are assigned to it. As it is outlined in the technical report [1] a system is considered as scalable when response time and through put of jobs processed by the system, lay in certain acceptable intervals, even if the number of users is increasing over time. A scalable system, that is adapted by a system administrator, can achieve this only if the workload of the system is somehow predictable. For a scalable system the productivity should stay more or less constant even if the number of users varies significantly. The system’s overhead to organize more sessions should grow at least linear to the number of users and not in a higher polynomial or even exponential function class. Due to system hardware or software design issues, scalability can only be provided within certain boundaries. To be aware of these boundaries is important when scaling a system automatically. For example when a web page becomes popular, the changes in the number of users can vary in orders of magnitude and also at high speeds in both directions. In this case manual resizing of a system would be ineffective due to high hardware investments and administration costs. Amazon’s EC2 offers a Auto Scaling API [9] to their clients that features to automatically adapt the number of virtual machine instances that share the same workloads. Amazon has not yet published figures on how quick these changes are executed,

8

2. Scalability

just states the upper bound of 21 virtual machines as the limit where communication and synchronisation overheads start to increase. Means for describing the temporal aspects of automated scaling behaviour may be helpful when compare different cloud platforms.

2.2. Definition of Scalability The following definitions have been presented in the technical report [1] in section 2. Application scalability is a property which means that the application maintains its performance goals/SLAs even when its workload increases (up to a certain upper bound). This upper workload bound defines a scalability range and highlights the fact that scalability is not endless: above a certain workload, the application won’t be able to maintain its performance goals or SLAs. Application scalability is limited by application’s design and by the use of execution platform resources by the application. In addition to a performance-aware implementation (efficient resource use and reuse, minimization of waiting and thrashing, etc.), application scalability means that the application must be able to make use of more additional resources (e.g. CPU cores, network connections, threads in a thread pool) as the demanddriven workload increases. For example, an application which is only capable of actively using 2 CPU cores will not scale to an 8-core machine, as long as the CPU core is the limiting (“bottleneck”) factor. Of course, to exercise its scalability, the application must be “supported” by the execution platform, which must be able to provide the resources needed by an application. This means that when establishing metrics for quantifying scalability, the scalability metric values are valid for a given amount of resources and a given amount/range of service demand. Determination of speedup for the same service demand with additional resources and efficiency is then detectable, as a measure of how good the application is using the additionally provided resources. Correspondingly, platform scalability is the ability of the execution platform to provide as many (additional) resources as needed (or explicitly requested) by an application. In our terminology, an execution platform comprises hardware and/or software layers that an application needs to run. An example application can be an execution system (where the execution system comprises hardware and possibly a hypervisor) or a web shop (where the execution platform comprises middleware, operating system and hardware). Other examples of execution platforms include IBM System z (for running z/OS and other operating systems), cloud environment (e.g. IBM CloudBurst [13]) or a “stack” encompassing hardware and software (e.g. LAMP [14]). The execution platform can be visible as a single large “black box” (the internal composition is not visible from outside) or as a set of distinguishable layers/components. There are two “dimensions” of platform scalability, and a system can be scalable in none, one or both of them: Scale vertically or scale up means to add more resources to a given platform node, like additional CPU cores, bigger CPU time slices shares or memory in a way that the platform node can handle a larger workload. By scaling up a single platform node, physical limits that impact bandwidth, computational power etc. are often reached quite fast. Scale horizontally or scale out means adding new nodes of (e.g. virtual machine instances or physical machines) to a cluster or distributed system in a way that the entire

2.2. Definition of Scalability

9

system can handle bigger workloads. Depending on the type of the application, the high I/O performance demands of the single instances that work on shared data often increase communication overheads and prevent the emergence of substantial performance gains, especially when adding nodes at bigger cluster sizes. In some scenarios, scaling horizontally may even result in performance degradation. Note that this definition of platform scalability does not include the temporal aspect: when scaling means that additional resources are requested and used, the definition of scalability does not specify how fast, how often and how significantly the needed resources are provisioned. Additionally, scalability is not a constant property: the state of the execution platform and the state of the application (and its workload) are not considered. In fact, scalability can depend on the amount of already provided resources, on the utilization of these resources, and on the amount of service demand (such as number of requests per second). Fig. 2.1 sketches a simplified, synthetic example of how a fixed allocation of available resources (i.e. no platform scalability) means that the response time rises with increasing workload intensity and diminishes when the workload intensity does the same. In Fig. 2.1, the response time rises monotonically with the workload intensity (and falls monotonically while workload intensity diminishes). The reality may be more complicated.

Figure 2.1.: Schematic example of a scenario where execution platform is not scalable: the (idealized) correlation between application workload intensity and application response time

Fig. 2.2 in turn sketches how, on a scalable execution platform and with a fixed workload, the performance metric (response time) should be affected positively by additional resources. Note that in Fig. 2.2, the response time decreases monotonically with additional resources. Woodside and Jogalekar [7] established a scalability metric based on productivity. To define scalability, they use a scale factor k (which is not further explained in [7]), and observe the three following metrics: • λ(k): throughput in responses/second at scale k, • f (k): average value of each response, calculated from its quality of service at scale k, • C(k): costs at scale k, expressed as a running cost (per second, to be uniform with λ)

10

2. Scalability

Figure 2.2.: Schematic Example of a Scenario with a fixed Application Workload and scalable execution platform: the (idealized) correlation between amount of resources provided by the platform and application response time

Productivity F (k) is calculated in [7] as the throughput delivered per second, divided by the cost per second: F (k) = λ(k) ∗ f (k)/C(k). Then, Woodside and Jogalekar postulate that “if productivity of a system is maintained as the scale factor k changes, the system is regarded as scalable”. Finally, the scalability metric ψ relating systems at two different (k2 ) scale factors is then defined as the ratio of their productivity figures: ψ(k1 , k2 ) = FF (k . 1) While this definition of scalability allows to compare the scalability from the workload (or application) view, it is not possible to compare execution platforms, as the metric is specific for a given workload. Additionally, the definition assumes a “stable state” and does not consider the actual process of scaling, where resource allocations are adapted. Therefore, the provided definition of elasticity will not be based on the scalability definition from [7]. In the next section, a platform-centric definition of resource elasticity is presented which considers temporal and quantitative aspects of execution platform scaling.

3. Elasticity The following definitions can also be found similarly in our publication [1]. When service demand increases, elastic cloud platforms dynamically add resources (or make more resources available to a task). Thus, elasticity adds a dynamic component to scalability - but how does this dynamic component look like? On an ideal elastic platform, as application workload intensity increases, the distribution of the response times of an application service should remain stable as additional resources are made available to the application. Such an idealistic view is shown by the synthetic example in Fig. 3.1, which also includes dynamic un-provisioning of resources as the application workload decreases. Note that the dynamic adaptation is a continuous (non-discrete) process in Fig. 3.1.

Figure 3.1.: Schematic Example of an (unrealistically) ideal elastic System with immediate and fully-compensating elasticity:

However, in reality, resources are actually measured and provisioned in larger discrete units (i.e. one processor core, processor time slices, one page of main memory, etc.), so a continuous idealistic scaling/elasticity cannot be achieved. On an elastic cloud platform, the performance metric (here: response time) will rise as workload intensity increases until a certain threshold is reached at which the cloud platform will provide additional

12

3. Elasticity

resources. Until the application detects additional resources and starts making use of them, the performance will recover and improve - for example, response times will drop. This means that in an elastic cloud environment with changing workload intensity, the response time is in fact not as stable as it was in Fig. 3.1. Now that we have determined a major property of elastic systems, the next question is: how to quantify elasticity of a given execution platform? The answer to this question is provided by Fig. 3.2 - notice that it reflects the previously mentioned fact that the performance increases at certain discrete points. When quantifying to define and to measure elasticity, it will be necessary to quantify the temporal and quantitative properties of those points at which performance is increased.

Figure 3.2.: Schematic Example of an Elastic System

3.1. A Definition of Elasticity Changes in resource demands or explicit scaling requests trigger run time adaptations of the amount of resources that an execution platform provides to applications. The magnitude of these changes depends on the current and previous state of the execution platform, and also on the current and previous behaviour of the applications running on that platform. Consequently, elasticity is a multi-valued metric that depends on several run time factors. This is reflected by the following definitions, which are illustrated by Fig. 3.3: Elasticity of execution platforms consists of the temporal and quantitative properties of runtime resource provisioning and un-provisioning, performed by the execution platform; execution platform elasticity depends on the state of the platform and on the state of the platform-hosted applications. Reconfiguration point is a time point at which a platform adaptation (resource provisioning or un-provisioning) is processed by the system. Note that a reconfiguration point is different from (and later than) an eventual triggering point (which is in most cases system internal and starts the reconfiguration phase), and also take into consideration that the effects of the reconfiguration may become visible some time after the reconfiguration point, since an application needs time to adapt to the changed resource availability. Besides this it is important to know that while reconfiguration points and the time point of visibility of effects may be measurable, the triggering points may not be directly observable.

3.2. Elasticity Metrics

13

Figure 3.3.: Three aspects of the Proposed Elasticity Metric

3.2. Elasticity Metrics There are several characteristics of resource elasticity, which (as already discussed above) are parametrised by the platform state/history, application state/history and workload state/history: Effect of reconfiguration is quantified by the amount of added/removed resources and thus expresses the granularity of possible reconfigurations/adaptations. Temporal distribution of reconfiguration points describes the density of reconfiguration points over a possible interval of a resource’s usage amounts or over a time interval in relation to the density of changes in workload intensity. Provisioning time or reaction time is the time interval between the instant when a reconfiguration has been triggered/requested until the adaptation has been completed. An example for provisioning time would be the time between the request for an additional thread and the instant of actually holding it. An example matrix describing elasticity characteristics of an execution platform is given in Fig. 3.4 for a hypothetical cloud platform. Each resource is represented by a vector of the three aforementioned key elasticity characteristics.

Figure 3.4.: Elasticity Matrix for an Example Platform

3.3. Direct and Indirect Measuring of Elasticity Metrics In general, the effects of scalability are visible to the user/client via changing response times or throughput values at a certain scaling level of the system. On the other hand, the elasticity, namely the resource resizing actions, may not be directly visible to the client due to their shortness or due to client’s limited access to an execution platform’s state and configuration.

14

3. Elasticity

Therefore, measuring resource elasticity from a client’s perspective by observing throughput and response times, requires indirect measurements and approximations, whereas measuring the resource elasticity on the “server side” (i.e. directly on the execution platform) can be more exact, if the execution platform provides information about held resources and resource provisioning and un-provisioning. If the execution platform is not aware of its recently held resources, it is necessary to develop tools for measuring them in a fine granular way. The ways of getting the resource amount a system is holding at the moment in time can differ strongly between platforms and types of resources. This is one reason why a portable elasticity benchmark is hard to develop. For elasticity measurements on any elastic system, it is necessary to fill the system with a variable intensity of workloads. The workload itself consists of small independent workload elements that are supposed to run concurrently and designed to stress mainly one specific resource type (like Fibonacci calculation for CPU or an array sort for memory). “Independent workload element” means in this case that there is no interdependency between the workload elements that would require communication or synchronisation and therefore induce overheads. It is necessary to stress mainly the “resource under test”, to avoid bottlenecks elsewhere in the system. Before starting the measurement on a specific platform, the basic workload elements should be calibrated without any concurrency on the resources and in a way that one workload element needs approximately one resource entity for a certain time span. This workload element calibration is also necessary to provide comparability between different systems later on. While measuring, we keep track of the workload intensity as well as of the number of provided resource entities by the system. Having these values logged for a single experiment run, we calculate further indirect values representing the key elasticity metrics. As the concepts of resource elasticity are validated in the following chapter using Java thread pools as virtual resources provided by a Java Virtual Machine, I would like to introduce them already as running example. Java thread pools are designed to grow and shrink dynamically in size, while still trying to reuse idle resources. In general we differentiate between the following orders of values (and can be applied to the Java thread pool example) 1st order values - client side are all workload attributes and workload events that can directly be measured and logged precisely even on the client-side like timestamps, number of workload elements submitted and finished and their waiting and processing times. For the example of Java thread pools, client side 1st order values would be: • attributes: a workload element’s start time, waiting time and processing time • events: time stamp and values of any change in numbers of submitted or finished workload elements 1st order values - execution system side include the above mentioned events and attributes, that are visible at the client side too, and in addition the direct exact measurement of provided resource entities at a certain point in time. The execution system should be able to accurately provide these values, otherwise an additional measurement tool must run on the execution platform. In addition to the above mentioned client side 1st order values, we measure:

3.3. Direct and Indirect Measuring of Elasticity Metrics

15

• resource events: time stamp and values of any change in numbers of provided thread resources 2nd order values - execution system side are values that can directly be calculated from the 1th order values. They include amount and time point of increase / decrease of workload or resource entities as well as the rate of reconfigurations in relation to the rate of workload changes. Even though these values are not directly visible to a client, they can be considered as precise due to their direct deduction from measurable values. For the Java thread pool example, 2nd order values are: • workload intensity: time stamp and value of right now executing workload elements as the difference of submitted minus finished ones. • resource events: time interval of two successive measurement samples and amount of added or removed thread resources • resizing ratio: number of changes in thread pool size in relation to changes in workload intensity within a certain time interval 3rd order values - execution system side are values that cannot be directly calculated from 1st or 2nd order values. These values need to be approximated in a suitable way. The system’s provisioning time is a delay between an internal trigger event and the visibility of a resource reallocation. Time points of internal trigger events and information about execution of such an event are system internal and often not logged. Suitable ways for approximation are discussed in detail in the next section. For the Java thread pool example, 3rd order values are: • approximation for the provisioning time of a thread pool size change. 4th order values - execution system side are values that can be derived from system internal log files, that represent the state of the system, and can contain time point and reason for a reconfiguration trigger event, information on resource provisioning actions or workload rejections. When using these values, intrinsic knowledge about the system implementation is necessary. Portability of those metrics is not given. Access to these values would be important for trustworthy validation of our approach. For the Java thread pool example, 4th order values are: • time points of trigger events and thread pool state. Since we know details about a Java thread pool implementation, a trigger event for a new thread resource is thrown if the waiting queue is full and no thread is idle. A thread resource stays idle for a given timeout parameter and dies after that time, if it could not be reused. Response times of the basic workload elements should be logged and split up in waiting and processing times. As explained before, response time should stay within an acceptance interval given for example by a SLA. We define the waiting time of a workload element as the duration from feeding a workload task to the execution system until the system finally schedules the workload task and begins its processing. This definition of waiting time does not include any waiting times of interruptions by hardware contention or system scheduling. Waiting and processing times are intuitively interpretable if the size of tasks is not too small so that the processing time values do not show high variability due to system scheduling influences. Processing times should not show high variations, when a workload element is at least one order of magnitude

16

3. Elasticity

larger than scheduling time slices and the workload’s maximum concurrency level can still be directly mapped to physically concurrent resources. Waiting and processing time of the workload tasks mainly depend on recent workload intensity, the system’s scheduling and the actual available level of physical concurrency of the resource under test. When taking response time as a performance metric, it is necessary to exercise elasticity measurements only on elastic platforms, where the level of concurrency within the workload can be covered by physically provided resources with no hardware contention at the maximal possible system configuration (upper bound for scaling). This approach is illustrated in an idealized way in the following Fig. 3.5 on the left side. A suitable performance metric for the resource under test (like response time for CPU bound tasks) declines, the system automatically reacts by provisioning of more resources and then the performance metric recovers again. The provisioning time could be approximated by the time interval from the time points where the performance metric declines until system reacts.

Figure 3.5.: Two different Approaches for Measuring Provisioning Times [15] If we measure elasticity of a virtual resource that shares physical resources with other tasks, no precise view on a correlation between cause and effect will be given anymore when trying to interpret the characteristics of waiting and processing times. In this case observing response times does not allow direct and exact extraction of elasticity metrics. The provisioning times of a elastic system, which were defined as the delay between a trigger event and the visibility of a resource reallocation, cannot be measured directly without having access to system internal log files and setting them in relation with measured resource amounts. If a new workload task is added and cannot be served directly by the system, we assume that a trigger event for resizing is created within the execution platform. Not every trigger event results in a resource reallocation. In addition, the measurement log files on execution platform side must be enabled to keep track of any changes in resource amounts. As illustrated in Fig. 3.5 on the right side, it is possible to intuitively find a mapping between a trigger event and its effect manually. In reality this graph can look much more complicated, as we see in the results section in the following chapter. Manually finding of mappings intuitively can still be done for most cases, but several special cases occur, like the resource reuse (visualized in Fig. 3.6 in the left part). If the execution system works with waiting queues and these queues are longer than two elements, the trigger events and effects can overlap. This results in ambiguous mappings as it can be seen in Fig. 3.6 on the right side.

3.4. Extraction of Provisioning Time using Dynamic Time Warping Algorithm (DTW) for Cause Effect Mapping 17

Figure 3.6.: Finding the right correlation between Trigger and Effect event for provisioning time extraction This leads to the need for an exact algorithmic approach to find a plausible cause-effect mapping by establishing several rules that must be fulfilled. E.g. any reallocation of x resource entities must have at least x trigger events in advance without execution of a change since the trigger’s release. This algorithmic approach is discussed more detailed in Joakim von Kistowski’s bachelor thesis [15]. We also researched another approach for finding a suitable approximation for a cause-effect mapping by using already existing algorithms: Dynamic Time Warping (DTW) algorithms minimize the sum of distances between two time series [16, 17]. A time series is a table with two columns, the first contains a time stamp, the second column a value that has been measured at that time. Our measurement log file will contain a time stamp, a resource amount value as well as a workload intensity value per entry. Therefore we can consider these data as two time series, whereas it is important to have only one entry per change in number of tasks in execution for the first time series or in number of resources for the second. The workload time series influences the resource amount time series. We consider only changes in workload intensity as triggers for changes in the resource amount time series. We are looking for a cause-effect mapping that enables us to extract provisioning times of these two time series, filled with change events. How this could be possibly done by the DTW algorithm is discussed in the following section.

3.4. Extraction of Provisioning Time using Dynamic Time Warping Algorithm (DTW) for Cause Effect Mapping The Dynamic Time Warping (DTW) algorithm offers the feature to calculate a similarity metric by computing the minimum sum of distances between two time series. The two time series may vary in time or speed in a non-linear way. The DTW algorithm outputs the induced mapping of measurement samples(timepoints) and the sum of distances between these mappings as a distance/similarity metric. Having this similarity metric (called DTW distance) is very useful in the field of automated speech recognition, to match differently stressed words even at variable levels of talking speeds. The DTW algorithm is an example for an algorithm using dynamic programming and backtracking [16, 17]. This algorithm is able to solve such a one-dimensional optimisation problem in polynomial time (normally with a complexity of O(n2 ) in memory and time). A second dimension can be added to a time series by having two y-values per measurement sample. For any two-dimensional time series the described optimisation problem of finding the smallest distance between them is already NP-complete (also called planar warping)[16, 17]. Note that the DTW algorithm just takes the timestamps of two time series into account

18

3. Elasticity

and no additional values that may be related to a time stamp. In the case of elasticity measurements, every time stamp relates to a change event of workload intensity or a change in the number of resource entities. The following Fig. 3.7 from [17] illustrates an example DTW mapping. We see on the x-axis the number of measurement samples. The y-axis is not shown in this illustration. To understand this illustration it helps to imagine two different y-axis, whereas the second y-axis has a constant offset be able to make the mapping visible. Unlike it will be in our time series, there are still measurement samples, where the y-value does not change until the following sample. Note that the time stamp values itself are irrelevant to the result of the DTW algorithm, just y-values of the measurement point. There is no restriction, that the two time series have to be of the same length of samples, even though they actually are in the illustrated case in Fig. 3.7. The mappings do not cross other mappings. The time series length has to be finite for applying the DTW algorithm. Due to this assumption, every sample of the one time series is at least mapped to one sample in the other time series. The DTW algorithm is deterministic.

Figure 3.7.: Illustration of a Dynamic Time Warping Result [17]

Figure 3.8.: A Cost Matrix for Fig. 3.7 with the Minimum-Distance Warp Path traced through it. [17]

3.5. Interpretation of Provisioning Time Values based on Extraction by DTW Algorithm 19 To explain Fig. 3.8 I would like to cite from [17], section 2.1: “[The figure] shows an example of a cost matrix and a minimum-distance warp path traced through it from D(1,1) to D(|X|,|Y|). The cost matrix and the warp path [. . . ] are for the same two time series [. . . ]. The warp path is W = (1,1), (2,1), (3,1), (4,2), (5,3), (6,4), [. . . ] (15,15), (16,16). If the warp path passes through a cell D(i, j) in the cost matrix, it means that the ith point in time series X is warped to the j th point in time series Y. Notice that where there are vertical sections of the warp path, a single point in time series X is warped to multiple points in time series Y, and the opposite is also true where the warp path is a horizontal line. Since a single point may map to multiple points in the other time series, the time series do not need to be of equal length. If X and Y were identical time series, the warp path through the matrix would be a straight diagonal line.” In [16, 17] it is also explained in detail how a linear complexity can be achieved, when just a good approximation for the minimal distance and not the very best mapping for every case is necessary. The normal DTW algorithm calculates the minimal distance in a complexity O(n2 ). The algorithm achieving linear complexity is called fast DTW algorithm (fDTW) and useful if the data sets are massive or the processing should not consume too much energy or time. The fDTW algorithm has a radius as additional input parameter. Only within this radius of measurement samples the algorithm searches for the best mapping. By having the radius as a constant input, that is not growing with the input size, the linear complexity is achieved, whereas the optimum may not be found within this radius. The fDTW is still deterministic. Now we need to discuss whether a DTW or even a fDTW algorithm can be appropriate for our case of extracting a mapping where one series represents the causes and the other the effects. A normal DTW algorithm can have a distance function as input parameter, which adds more functionality. The DTW algorithm by itself cannot differentiate between a cause and a effect series. This can lead to senseless “backward” mappings. It is important that the input values for each of the time series is a change event on the resource time series side and also on the workload side - not just a point of measurement, where nothing happened. Normally a measurement log file contains a new data set for every time just a single value changed. These time series must be simplified of all “meaningless no change” time stamps before passed to the DTW algorithm.

3.5. Interpretation of Provisioning Time Values based on Extraction by DTW Algorithm The mapping of measurement points given by DTW enables the approximation of provisioning times. These provisioning times are based on DTW approximation and therefore must be interpreted very carefully. It is not yet validated that a DTW mapping is a good mapping for our cause-effect mapping problem. Not every given mapping must be a real provisioning of the system that took place during the measurement. Several trigger events can have a just a single visible effect. When applying DTW we assume having a reactive system in which any effect takes place after it was triggered. All triggers are system internal events. A system could possibly have external triggers that induce foreseeable changes in resource demands but we do not know about by default, because they are domain specific (e.g. like opening hours of banking terminals), is also thinkable. We observed that an elastic system can overreact and provide more resources than actually needed. For an reactive elastic execution platform this is only possibly if the granularity

20

3. Elasticity

of resizings is too low. It is also possible that an elastic system has features implemented that try to intelligently foresee the workload intensity changes (which than inserts internal trigger events that are normally unknown). In this case, the system is not just a reactive system anymore. DTW possibly outputs negative provisioning times. These values could be interpreted as the above mentioned intelligent behaviours of workload foreseeing, external triggering or just a system’s overreaction. By the concept of cause-effect mappings these backward mappings are not covered and therefore it cannot be said explicitly that negative provisioning times are meaningful and they should be interpreted very carefully. If a high negative provisioning time is in fact meaningful for an intelligent or external trigger elastic system, it would not result in observable worse response times, but in lower resource efficiency due to a lower utilisation rate. Whereas high positive values for provisioning times are linked to high utilisation of resources and slower performance/response times. Taking these thoughts into account, one could say that provisioning times are the better the smaller their absolute distance to zero is. A provisioning time of zero itself would be only possible in an synthetic idealised elastic system.

3.6. A Single-valued Elasticity Metric? It is still an open question whether a single metric is possible and meaningful, that captures the three key aforementioned elasticity characteristics as in Fig. 3.4. One challenge of defining such a metric would be to embed state dependencies, as well as selecting a value range that would maintain the applicability of the metric to future, more powerful and more complex systems. A unified (single-valued) metric for elasticity of execution platforms could possibly be achieved by a weighted product of the three characteristics. To get intuitively understandable values, elasticity could be measured and compared on the basis of percentage values. 0% resource elasticity would stand for no existing reconfiguration points, high or infinite provisioning times (which would mean just manual reconfiguration) and very high effects of resizing actions, e.g. new instantiations, when performance problems have already been discovered or reported by SLA monitoring. near 100% resource elasticity would hint towards a high density of reconfiguration points, small provisioning time and small effects. In the optimal case the virtualized system’s usage appears to be constant, while the number of users and resource demand vary. This could achieve optimal productivity concerning costs or energy consumption in a cloud environment. If such a metric can be established, elasticity of cloud platforms could easily be compared. A proactive system which implements intelligent resource provisioning techniques like described in [18, 19, 20, 21, 22] should exhibit better values of resource elasticity than a simpler reactive system which triggers resource requests or releases by events and which does not perform workload prediction or analysis.

4. Elasticity Benchmark for Thread Pools 4.1. Variable Workload Generation Intelligently designed workloads are extremely important when trying to witness elastic effects. A constant workload for example will never produce elastic effects on its own, when the execution platform’s usage does not change by other workloads. In order to witness elastic effects workloads have to push the boundary of what already provisioned resources can offer them. Only then a drop in performance followed by an increase after resource addition will be visible. One important aspect of designing workloads for elasticity benchmarking is to understand in which way the targeted execution platform scales and the triggering events are released which then may lead to elasticity. Until now, existing benchmarks do not provide the functionality of workloads with the purpose of variable workloads specifically to force resource reallocation. It is explained in detail in J´ oakim von Kistowski’s bachelor thesis [15] how we achieve flexible workloads for elasticity measurements and the set of parameters to design or influence such a workload.

4.2. Experiment Setup The evaluation concept of resource elasticity can be applied to various kinds of resources, even to virtualized ones that are not mapped 1:1 to hardware resources. As a proof-ofconcept, we researched the elasticity behaviour of Java thread pools in depth. Thread pools are an implementation of the pooling pattern, which is based on a collection (pool) of same-typed, interchangeable resources that are maintained continuously. In a resource pool, even when a given resource instance in the pool is not needed, it is not released immediately because it is assumed that the instance may be needed in the near future: the runtime costs of releasing and (later) re-acquiring that resource are significantly higher than the costs of keeping an idle instance. Resource pools enable resource reuse to minimize initialisation costs of new ones. The resources in a pool have be managed efficiently not to induce higher system management costs than initialisation and release would cost for the same amount of resources. After a certain time interval, the resource can be released when it hasn’t been used - this “disposal delay” can be often be set in the implementations of the pool pattern. Beyond “disposal delay”, further configuration

22

4. Elasticity Benchmark for Thread Pools

Figure 4.1.: Illustration of the Thread Pool Pattern [15] options are the minimum pool size (often called “core size”), the maximum pool size and the length of the queue that resides “in front of” the pool. Thread pools are heavily used in databases, application servers and other middleware applications handling many concurrent requests. Similar to thread pools, connection pools (in DBMS drivers, such as JDBC) implement the pooling pattern with the same rationale. For this experiment, we use the default thread pool implementation provided by the Java SE platform API, and the threads have to perform non-communicating, CPU-bound computation-only tasks in parallel. These task that form a configurable, multi-phased workload which is described in [15]. The tasks consist of carefully-defined Fibonacci computation, with randomly-chosen starting values (to prevent function inlining as constant values), and with evaluation of the computation result (to prevent dead code elimination). We are interested in situations with a “fully-busy” thread pool where the arrival of a new task and the finish of another task’s execution are temporally close, to see whether the pool immediately “overreacts” by allocating a new thread instance and accepting the task from the queue, rather than waiting for a (short) time in the hope that another task will be finished. In line with the elasticity metrics defined in Sec. 3.1, we measure the decreases / increases of the thread pool size (effects of reconfiguration and their distribution) and approximate the temporal aspects of the adaptation (delays, provisioning duration, etc.). We use the term “jump size distribution” in the measurement results for the distribution of reconfiguration effects. At the beginning of each measurement, a warmup period of the task and measurement methods is executed to avoid effects of Java Just-In-Time compiler (JIT) optimisations and method inlining. We permitted the benchmark to allocate up to 900 MB of Java heap memory to minimize interference, and used the standard thread pool implementation of the Java platform API, found in the java.util.concurrent package. The core size of the thread pool is not equal to the minimum or initial pool size: even if the core size is set to 2, the JVM may initialize the pool to have 1 active thread. The option of prestarting the core pool changes this. For any number of threads that are below the size of the core pool, no timeout feature for killing these threads is given in the default

4.2. Experiment Setup

23

implementation. The official Java platform API documentation explains that when “a new task is submitted [...], and fewer than corePoolSize threads are running, a new thread is created to handle the request, even if other worker threads are idle. If there are more than corePoolSize but less than maximumPoolSize threads running, a new thread will be created only if the queue is full ” [23]. In the experiment implementation we use the core size of 1 and do not prestart the pool. Our tooling includes the functionality to calibrate the problem size of a single Fibonacci task so that it takes a certain duration (e.g. 50 ms) when running in isolation with no interruptions. The calibration allows us to run the same workload on several platforms with (approximately) same duration of the task. The workload is composed of tasks by specifying inter-task arrival times, as well as grouping tasks in batches (with separate specification of intra-batch and inter-batch arrival times). The measured workload is a configurable series of batches and is designed as follows: the initial batch size is 2 tasks, with inter-task wait time of 10 ms. After an inter-batch wait time of 40 ms, the second batch with three tasks and the same inter-task wait time is started. For each next batch, the number of tasks increases by one, the inter-batch wait time increases by 4 ms, and the inter-task wait time stays the same. The workload intensity reaches its peak when the batch contains 12 tasks, and afterwards decreases again. Each decrease is characterised by making the batch size less by one, and decreasing the interbatch wait time by 4 ms. The last batch of the workload again contains two tasks. Recall that by design of the thread pool, the number of actively processed tasks is equal to or smaller than the number of the threads in the pool. The design and configuration of the workload used for the following experiments is illustrated more detailed in J¨ı¿ 12 akim von Kistowki’s bachelor thesis [15]. We implemented a feeder class that passes the tasks in the discribed way to the thread pool executor. A Java thread pool executor from the java.util.concurrent package offers the use of three different kinds of queues: • direct queues that directly dispatch any tasks that is queued up. Since with this choice resources are allocated immediately there is no space for variations in elasticity. Thread resources are allocated as if they do not induce any system management overheads. • infinite queues have the ability to hold a theoretically endless amount of tasks. With this kind of queue a fixed number of threads (namely core size parameter) processes tasks concurrently. By using infinite queues for thread pool the reuse of the thread resources is forced to the maximum. The thread pool is on the other hand inhibited from showing any resource elasticity. • finite queues can hold a maximum fixed amount of tasks. When using a finite queue every task is queued (unless a core thread is available) and taken from the queue following a first come first served strategy (FCFS or FIFO) if an idle thread is available. If a task cannot be en-queued it is hold in an additional buffer and a new thread is requested ( trigger event) Tasks have to wait for variable durations to begin their execution, but elastic behaviour can be seen, using finite queues. The queue length is a property of the waiting queue the thread pool executor is initialized with. It has a direct influence on the elasticity values measured. The Java thread pool is set up with an array blocking queue from the java.util.concurrent package (which is a implementation of a finite queue). The queue length is varied between a length of 2 and 10 for different runs. The “disposal delay” of idle threads or also called stay alive time was set to 3 different values between 10ms and 250ms. These two parameters directly influence elasticity behaviour, whereas the core pool size of 2 threads and the maximum pool size

24

4. Elasticity Benchmark for Thread Pools

of 100 is held constant due to no direct influence. Once the maximum thread pool size has been reached, if all instances in the pool are busy performing work and the queue is full, rejection of incoming tasks will occur. In our implementation, rejections of tasks are logged and taken into account when interpreting measurement results. For modern CPUs the maximum pool size of 100 should be big enough not to experience task rejections, if the CPU is not under load from other tasks. We measure and protocol task arrival times and task finish times using fine-granular wallclock timer (selected in a platform-specific way using [24]). The high-resolution time provided by the sun.misc.Perf class which has an accuracy of 1 ns can be used on supported Java VMs (whereas this timer is supported in Oracle JDK 6, it cannot be used with in the IBM Java 6 64bit Virtual Machine running under z/OS). A separate “measurer” thread outside of the thread pool runs with the maximum priority, and records the state of the thread pool and the state of its queue. While the measurer runs in a loop without wait or sleep method calls, only changes in the thread pool state are recorded, keeping the logs small. One of the challenges consists in capturing the changes in the thread pool state - this requires a tight measurer loop and short (or no) intervals between measurements, while eliminating longer pauses (e.g. those caused by the Java Garbage Collection). When measuring, we made sure that our benchmark was the single major performance demanding workload during that time periods by having just a small base load on the machine. We observed a platform specific behaviour of the method T hread.sleep(ms, ns) from the java.util.concurrent package: While in a JVM running on Mac OS X the thread woke up quite exact in the specified time of ns (below 1ms), the same code executed in a JVM on a Windows 7 system doesn’t meet the specified time. Values below 1ms could not be achieved, and therefore were not short enough for our measurements. For any JVM on a Windows system we decided to use a busy waiting approach that polls for changes as fast as possible for achieving the most accurate measurements.

4.3. Extraction of Elasticity Metrics After every experiment run, we extract the aforementioned elasticity metrics from the measurement log file. 2nd order values can directly be calculated. All resource resizings are counted and set into relation to the absolute number of changes in workload intensity with is the “resizing ratio”. Every occurring jump size (differentiated by jump direction up or down) is accumulated to finally have a distribution of jump sizes. For visualisation of measurement data, we use JFreeChart to plot x-y-charts and histograms. The plotted x-y-charts show the course of selected discrete values on the time line in the unit of milliseconds. The individual measurements (“dots”) have been connected by a line to improve readability, even though the measurements are of course non-continuous. In the first charts of each experiment run the number of tasks in the executor and the corresponding resource amount are plotted. This plot is intuitively understandable and illustrates the elasticity behaviour. The second plot shows the difference between tasks and provided resources to get an impression of the area between the time lines of tasks numbers and resource entities. Values above 0 occur when resources cover the demand of workload at that point in time, whereas values below 0 hint towards a congestion in resources. This plot visualizes the speed of changes from up to down scaling or the other way around (frequency of cutting the x-axis) as well as the system’s characteristic, if it tends to save resources or more generously provides new ones (amplitude and symmetry of deflections). jFreeChart histograms are plotted for the distribution of waiting times and response times with a bin size of 10ms.

4.3. Extraction of Elasticity Metrics

25

To extract approximations for provisioning times we use a Java implementation of the fDTW algorithm provided by the authors of [17] at a GoogleCode server: “http://code.google.com/p/fastdtw/” in Version 1.0.1 from February 2011. We pass two time series to the fDTW algorithms time series constructor that just contain change events in resource amount or concurrent tasks in executor. As were are using a fast DTW algorithm, we calculate a suitable DTW radius as the maximum distance between resources and task entities. The algorithm then outputs the calculated DTW distance, which was defined as the minimum sum of distances between the mappings. Since this metrics captures the similarity of 2 time series, we get another metrics that quantifies elasticity. Due to the fact that by workload design we know the constant numbers of workload intensity changes, the overall count of resource resizings has to be smaller than or equal to the workload intensity changes. Therefore the number of summands of the minimum sum is equal even for runs on different platforms. This fact helps the DTW distance metric to be portable for our concerns. fDTW also outputs the warp path, which is the mapping of measurement point. The every mapping is a vector [x, y] with the value of x as the xth measurement point in the first time series and accordingly the y th measurement point in the second time series. We look up the time stamps that belong to these indexes and calculate the differences. For every mapping output by fDTW we obtain a single approximate time for a provisioning action. These provisioning times are then plotted as a jFreeChart Histogram again with a bin size of 10 ms. For better illustration of the mapping results, we pass the time series additionally to a DTW algorithm which is implemented in the statistics tool R and independent from the used Java implementation. To plot them we use the following R script: library(dtw) tasks

Suggest Documents