Design of scalable socket-based multi-thread applications for High Performance Computing

´ ´ DE ALTAS PRESTACIONES MASTER EN COMPUTACION ´ PROYECTO FIN DE MASTER Design of scalable socket-based multi-thread applications for High Performan...

Author: Octavia Daniels

0 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

ME759 High Performance Computing for Engineering Applications

Java for High Performance Computing

CS 759 High Performance Computing for Engineering Applications

INFRASTRUCTURE FOR HIGH PERFORMANCE COMPUTING

Python for High Performance Computing

O for High Performance Computing

High Performance Computing

High Performance Computing Blatt 6

Python in High performance computing

Power-Efficient, High-Bandwidth Optical Interconnects for High Performance Computing

A Developer s Guide to On-Demand Distributed Computing. Best Practices for Design and Deployment of High-Performance J2EE Applications

Scalable, High Performance Ethernet Forwarding with CUCKOOSWITCH

SPDK: Building Blocks For Scalable, High Performance Storage Applications Benjamin Walker Intel Corporation

Scalable Cyber-Security for Terabit Cloud Computing

TRENDS OF CPU, GPU AND FPGA FOR HIGH-PERFORMANCE COMPUTING

High-Performance Data Transport for Grid Applications

Dell s High Performance Computing Clusters

NEXT-generation large-scale high-performance computing

High Performance Computing with Application Accelerators

HIPEC NRW. - HIgh PErformance Computing Nordrhein-Westfalen -

High Performance Computing - Benchmarks. Dr M. Probert

Engineering Simulation Solutions & High Performance Computing

5850 High-Performance Computing Spring 2018

´ ´ DE ALTAS PRESTACIONES MASTER EN COMPUTACION ´ PROYECTO FIN DE MASTER

Design of scalable socket-based multi-thread applications for High Performance Computing

Author: Jorge Docampo Carro Directors: Sabela Ramos Garea Guillermo L´opez Taboada A Coru˜ na, June 23, 2014

Abstract This work presents the design and implementation of an efficient mechanism for accessing sockets in multithread applications. The number of cores per processor keeps increasing and, in order to enhance scalability, the applications demand languages to adapt easily to these multithread systems. Java is a suitable language for these environments thanks to its multithread and network built in support. Moreover, the increase in resources of current nodes, new architectures like NUMAdirect and accelerators like the Intel Xeon Phi, makes it possible to schedule several applications within a node, making critical the inter-node communications through shared memory. In order to communicate different applications, Java offers two socket APIs: Java Net and Java NIO, both relying on the native system socket library. However, sharing the socket between threads supposes a bottleneck in terms of contention, introducing high latency jitter. Thus, to overcome this situation and provide predictable performance, this work presents an efficient arbitration system between threads and sockets using Disruptor, a framework that provides thread safety without loss in performance. The evaluation of the solution using both synthetic benchmarks and a real-world application, shows performance improvements in terms of latency, throughput, scalability and predictability.

Keywords Java, Sockets, Multi-thread, High Performance Communications, High Performance Computing, Disruptor.

i

Contents 1 Introduction 1.1

1

About this memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 State of the art 2.1

2 3

Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2.1.1

Java Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2.1.2

Java NIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2.1.3

JFS / UFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Disruptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.3

Other communication frameworks . . . . . . . . . . . . . . . . . . . . . .

9

3 Design of an efficient solution for scalable socket based multi-thread applications

13

3.1

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

3.1.1

Disruptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.1.2

Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.2

Experimental Configuration . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.3

Micro-benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.3.1

Shared Memory Performance Results . . . . . . . . . . . . . . . .

20

3.3.2

Distributed Memory Performance Results . . . . . . . . . . . . .

25

Comparison with a real-world application . . . . . . . . . . . . . . . . .

27

3.4.1

Design of the multithread server . . . . . . . . . . . . . . . . . .

27

3.4.2

Comparison with Netty . . . . . . . . . . . . . . . . . . . . . . .

27

3.4

iii

iv

CONTENTS

4 Conclusions and future work

33

A Resumen en castellano

35

List of Figures 2.1

Behavior of the Disruptor with one producer . . . . . . . . . . . . . . .

2.2

Examples of Storm topologies. Bolts with the same color indicate that

9

they apply the same transformation. . . . . . . . . . . . . . . . . . . . .

10

3.1

Single-thread and multi-thread raw versions of the PingPong benchmark

18

3.2

Diagram of how the final PingPong multi-thread benchmark works . . .

19

3.3

Results of the PingPong multi-thread micro-benchmarks . . . . . . . . .

22

3.4

Results of the PingPong multi-thread micro-benchmarks with UFS . . .

23

3.5

Global results of PingPong multi-thread micro-benchmark with java.net sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.6

Global results of PingPong multi-thread micro-benchmark with UFS sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.7

3.8

24

24

Results of the PingPong multi-thread micro-benchmarks in a distributed memory environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

Design and operation of the multithread server . . . . . . . . . . . . . .

28

v

List of Tables 3.1

Description of the testbed . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2

Results of the benchmark with java.nio sockets and parkNanos workload 30

3.3

Results of the benchmark with java.nio sockets and an active workload 31

3.4

Results of the benchmark with UFS sockets and parkNanos workload

3.5

Results of the benchmark with UFS sockets and an active workload

vii

17

.

31

. .

32

Chapter 1

Introduction Since its beginnings, Java provides key characteristics that are appealing for High Performance Computing (HPC): built-in networking and multithreading support, objectoriented paradigm, simplified memory management, platform independence, portability and security. However, the main drawback which prevented Java to be widely adopted within the HPC community was its poor performance when compared to FORTRAN and C [1], the most popular programming languages in this field. This performance gap has narrowed dramatically over the last years thanks to improvements in Java performance like the introduction of the Just-In-Time (JIT) compilers [2], which obtain optimized native code from Java bytecode, hence the use of Java in HPC has became a realistic [3] and interesting choice. Current HPC applications in Java generally present several levels of parallelism: several JVMs running multithread applications. Sockets are the most used communication mechanism between distributed applications running in separated JVMs. Those applications include a growing number of threads in order to exploit all the cores available in modern CPUs. As the number of running threads grows, it is no longer an option to create a socket for each pair of them, because it is not scalable. Instead, it is common the use of a unique socket shared among all threads. However, sharing this socket between a high number of threads causes contention problems. As an example, a simple PingPong application doubles its 99 percentile latency from using one thread to two, and when using eight threads, it is more than 10 times 1

2

1. Introduction

higher. The same increase can be observed by analyzing the standard deviation, that grows to almost 12 times its original value with eight threads. This is a problem when applications need predictable behavior, like real time applications. A solution to mitigate this contention, is the introduction of queues as an arbitration system. Since queues suffer from concurrency problems, we propose the use of Disruptor. Disruptor eliminates these problems, working as an ideal arbitration system between threads and sockets.

1.1

About this memory

This memory is organized as follows: • Introduction: Description of the project including motivation and goals. • State of the art: This chapter describes the current communication libraries available in Java, as well as the framework for efficient shared memory communications used for the design of our arbitration system. We also present other communication frameworks used in real-world applications. • Design of an efficient solution: This chapter presents the analysis of the libraries and frameworks used, the design of our arbitration system between threads and sockets, and the implementation and results obtained by our solution with both synthetic benchmarks and real-world applications. • Conclusions and future work: This chapter summarizes the main contributions of this project, conclusions and future work.

Chapter 2

State of the art This chapter presents some of the most used communication techniques and frameworks in Java. First, the universal standard of the socket library, emphasizing on the current Java libraries as well as low latency and high throughput socket implementations. Later on, Disruptor, a high performance inter-thread communication mechanism/library used for accessing to shared resources. Finally, High Performance Frameworks, used as baseline to evaluate the current performance level of the commercial market, are introduced.

2.1

Sockets

A network socket is and endpoint for an inter-process communication inside a host or across a computer network. In order to control and use these sockets, the Operative System (OS) provides a socket API (Application Programming Interface). This socket API is usually based on the Berkeley sockets standard. Among all communication domains, we will focus on the most important one: the IP (Internet Protocol) based families. The endpoints of these sockets will be unambiguously identified by an IP address and port number. The three types of IP sockets are: • Datagram sockets: provide connectionless, unreliable communications with relatively small packets. They are transferred using User Datagram Protocol (UDP). • Stream sockets: provide sequenced, reliable, two-way connection based byte streams, which are transferred using Transmission Control Protocol (TCP). 3

4

2. State of the art • Raw sockets: provide direct access to IP packets without using a transport layer. The are scarcely used by user applications due to its simplicity, but they are typically employed by routing protocols.

Now we describe the two socket APIs provided in Java distributions: the Java Net API and the Java New I/O (NIO) API.

2.1.1

Java Net

Java Net is the traditional networking interface for Java since its first release. It is defined in the java.net package [4], including four kinds of sockets: Socket, ServerSocket, DatagramSocket and MulticastSocket. Each one of them provides and endpoint for one type of connection, including TCP and UDP protocols, as well as access to methods for creating, binding, accepting and connecting sockets in a similar way to the traditional socket API available in the OS. The Java Net library delegates its functionality to the system sockets library, calling native code through the Java Native Interface (JNI) [5], so there are no performance enhancements for data transfers, and there is no possibility of asynchronous operations over the sockets.

2.1.2

Java NIO

This implementation was born out of the need for intensive I/O operations, and it was included with Java SE 1.4 in the java.nio package [6]. The objectives of Java NIO are multiplexed and non-blocking communications, obtained through the following main abstractions: Buffers, which work as data containers, Channels that represent connections to entities, and Selectors and SelectionKeys, that provide multiplexed non-blocking connections together with selectable channels. Although it still relies on JNI for the system calls to transfer data, it employs relatively modern facilities to achieve better scalability.

2.2. Disruptor

2.1.3

5

JFS / UFS

Java Fast Sockets (JFS) [7] [8] and Universal Fast Sockets (UFS) [9] are high performance socket implementations that focus on low latency and high throughput. They support communications through shared memory, InfiniBand and 10/40 Gigabit Ethernet using ROCE (RDMA Over Converged Ethernet). These products are being developed and supported by TORUS Software Solutions S.L [10]. They replace the standard socket library in a transparent manner bypassing the kernel and IP stack. They work over shared memory and high speed networks. JFS replaces the standard socket library of the Java Virtual Machine, whilst UFS works with all kinds of sockets and it is no longer restricted to Java. Both of them minimize context switches, buffer copies and interruptions so they can achieve great performance.

2.2

Disruptor

Disruptor [11] is a high performance inter-thread messaging library developed by LMAX and freely distributed on Github [12]. From a simplistic point of view, Disruptor consists of a ring-buffer where one or more publishers insert data in form of events that will be processed by one or more consumers. Disruptor appeared as a solution to mitigate the high impact on latency caused by queues when data was exchanged between processes. This high latency has several causes derived from concurrency issues. The first one is the cost of locks. Locks are needed to provide mutual exclusion, avoid race conditions, and ensure that changes occur in an ordered way. However they are highly expensive due to the arbitration required when they are contended. This arbitration consists mainly of context switching and releasing the control to the OS kernel, which will suspend the threads until the lock is released. During this suspension, data stored in caches may be lost, which causes a big drop on performance. Taking care of race conditions is too complicated and queues are usually implemented with a simple lock on the put and get methods. These single locks represent a significant bottleneck to throughput and a huge drop on latency.

6

2. State of the art The second cause is related to the memory hierarchy, especially the caches. In

them, data is organized in lines that are typically between 32-256 bytes in size, the most common cache-line being 64 bytes. If the size of the element of the list is not multiple of a cache-line, it might provoke a phenomenal referred to as “false sharing”. In this phenomenal two threads write different elements that share one cache-line, thus, threads running in different cores will be stealing one cache-line from one another. This causes write contention that reduces performance dramatically. The third one is regarding the limits of the queue. If it is unbounded (not constrained in size), then the producers (i.e., threads writing) can outpace the consumers (i.e., threads reading), and cause failure due to running out of memory. If it is bounded (constrained in size) it requires to be either array-backed (it relies on an array of a limited size and when it is full, no more data can be added) or to have the size actively tracked (when the variable tracking the size reaches its limit, it no longer accepts more data). Furthermore, there is a lot of contention on the head, tail and size variables of the queues. In most scenarios, queues tend to be full or close to empty due to differences in pace between consumers and producers. It is rarely the occasion in which they are balanced. Moreover these variables often occupy the same cache-line, which causes huge write contention in multi-thread scenarios. At last, there is a problem regarding languages with garbage collectors, like Java: queues generate large amounts of garbage. To process the data, we have to allocate a node of the queue which holds the data, and, after processing it, the node will be removed from the queue, waiting to be freed by the garbage collector. If a program processes 1,000 messages per second, it generates 1,000 objects per second that will not be used again and will result on calls to the garbage collector and its correspondent loss of performance. Disruptor was designed with the goal of resolving the problems associated with queues and concurrency. Its core is a pre-allocated bounded data structure in the form of a ring-buffer. First of all, it gets rid of all the locks, that are substituted by mechanisms like CAS operations, memory barriers and busy spins (loops where flags are

2.2. Disruptor

7

continuously checked). A CAS (Compare And Swap) operation is a special machinecode instruction that allows a word in memory to be conditionally set as an atomic operation. This CAS approach is significantly more efficient than locks because it does not require context switches to the kernel for arbitration. However CAS operations are not free. The processor must lock its instruction pipeline to ensure atomicity and it must employ a memory barrier to make changes visible to other threads. CAS operations are available in Java in the java.util.concurrent.Atomic* classes. Besides the use of CAS operations to ensure an ordered access to shared data, Disruptor uses memory barriers as well. Modern processors change the order of the instructions to increase performance, they only need to guarantee that the logic of the program stays the same. While this is enough for sequential programs, when threads share data it does not guarantee that all memory changes appear in order for all threads. Memory barriers are used to ensure that, in a specific section of code, all changes to memory are seen as ordered. There are three types of barriers depending on the operations that need to be ordered: read, write and full memory barriers. Disruptor pre-allocates all slots in the ring-buffer on start-up. Each slot works as a container for the data that will be processed. Garbage collection works at its best when the objects are either very short lived or almost immortal [13] [14]. With Disruptor pre-allocation, the slots in the ring are reused as long as the instance lives, making them apparently immortal and improving the garbage collection mechanism. Moreover, as the data of the entries is allocated at the same time, it is likely to be stored on contiguous memory positions, which will support cache striding. The sequencing mechanism is the core of how concurrency is managed by Disruptor. A Sequence is an object which provides a variable referring to a position on the ring buffer and concurrent methods for reading and writing this value. This variable is a simple counter and is guaranteed not to share its cache-line with any other variables by padding, which eliminates the problem of false sharing. The methods to access and modify this value are implemented using CAS operations and memory barriers included in the sun.misc.Unsafe package, which eliminates the disadvantages of locks. But the value of the sequence is not enough to access the ring buffer, we have to calculate the

8

2. State of the art

remainder with the ring size to know to which position of the ring buffer corresponds. This operation can be expensive, but its cost can be minimized by making the ring size a power of 2. Then, a bit mask can be used to perform the calculation efficiently. Figure 2.1 presents a basic use case for producers and consumers of the ring buffer. A ring buffer with one producer contains two values, (1) next, a long integer pointing to the first available position to write, and (2) a sequence called cursor, pointing to the last position ready to be processed. When the producer wants to insert new data in the ring (Figure 2.1(a)), it has to request a new position in the ring. This is made with the next() call. This method will return the value of the next sequence and will increase its value, no extra measures for concurrency are taken since only the producer will read and write this variable. Then, the producer can claim this position and modify it. Finally, when the data has been pushed, it calls the publish() method, which updates the cursor sequence. This cursor can be read and written since only one producer writes this value and the consumers just read it. In the multi-producer configuration of Disruptor, the implementation differs but the logic and calls for the producer stays the same. Essentially, the sequence now indicates the next position available to be written and it uses an auxiliary data structure that indicates for each position of the ring if the data is available to be processed. Figure 2.1(b) shows how a Consumer processes the data on the ring buffer. Each consumer has its own Sequence which indicates the last processed position. With each consumer having its own sequence, producers can track them to prevent the ring form wrapping. Moreover, this allows consumers to coordinate the work on the same entry. The consumer waits for the next position of the ring to be available by a call to waitFor(nextSequence). This method checks the cursor sequence, waiting until it surpasses the expected one. Once this happens, it returns the current value of the cursor sequence. Then, the consumer can process all the data from his “nextSequence” position until the current value of the cursor. This allows the consumer to process a batch of events with a single wait, enabling the consumer to regain pace when the producers set a burst of events, increasing throughput and reducing latency. Once it has processed all the batch, he will update its own sequence. In the single-producer

2.3. Other communication frameworks

9

configuration the consumers are reading the value of the cursor and updating a private variable only, so the need for CAS operations is nonexistent and only memory barriers are used. In the multi-producer configuration, consumers have to check that the data is available to process which only involves another memory barrier without any further mechanisms. Both consumers’ and producers’ logic is implemented with Runnable tasks. These tasks are executed by a pool of threads with different configurations using an Executor. The Executor configuration will be further analyzed in Section 3.1.1.

(a) Event publishing from a producer

(b) Event processing from a consumer

Figure 2.1: Behavior of the Disruptor with one producer

2.3

Other communication frameworks

This section describes some common communication frameworks that use sockets and/or Disruptor.

10

2. State of the art

Storm Storm [15] is a free and open source distributed real-time computation system that can be used with any programming language. In essence, it is a Big Data framework for real time computation, while Hadoop [16] is for batch processing. It is based on topologies. A topology is a graph of computations where each node holds the logic for processing the data. All data is passed on in the form of streams, which consist of unbounded sequences of data. There are two different types of elements inside the topology, Sprouts and Bolts, as shown in Figure 2.2. Sprouts generate streams, pulling data from the source (e.g. Twitter and Facebook) and transforming it so the bolts can operate with it, being able to generate new streams if necessary.

(a) Example of a topology with a simple trans- (b) Another example with a different topology formation on 3 steps

Figure 2.2: Examples of Storm topologies. Bolts with the same color indicate that they apply the same transformation.

The process of data with the Storm framework has to be deployed and run in a cluster of machines, referred to as a Storm cluster. From the point of view of the architecture of the cluster, its job is to run topologies. Storm clusters define two types of nodes: Master and Worker Nodes. Master nodes run a daemon in charge of distributing the code across the cluster, scheduling and monitoring tasks. Worker nodes run a daemon in charge of listening to work assignments and they start and stop work processes accordingly. Each of these workers executes a subset of the topology. Moreover, Storm uses Zookeeper [17] between master and worker nodes to provide high

2.3. Other communication frameworks

11

reliability, tracking lost jobs and machines and restarting the job assignment.

Netty Netty [18] is an asynchronous event-driven network application framework for rapid development of maintainable high performance servers & clients. It provides an abstraction of the network layer, maintaining power and flexibility, removing the complexity of coding the network layer. Moreover, it provides an easy maintenance, high performance and scalability. Unlike other asynchronous network libraries, Netty does not require one thread to manage each request, which makes it able to handle more requests with less memory requirements and reducing context switching, therefore improving scalability and performance. Moreover, Netty provides its own buffer API instead of using the NIO ByteBuffer to represent a sequence of bytes. It performs transparent zero copy, increasing the throughput and lowering the latency. It works using events and handlers: the received message is translated to an event, which will be processed by a series of handlers inside a pipeline.

ZeroMQ ZeroMQ [19] (also known as ØMQ, 0MQ, or zmq) is an open source project that looks like an embeddable networking library but acts like a concurrency framework. It provides sockets that carry atomic messages across various transports layers like in-process, inter-process, TCP, and multicast. It allows to connect sockets in an Nto-N fashion with patterns like fan-out, pub-sub, task distribution, and request-reply. It is fast enough to be the fabric for clustered products. Its asynchronous I/O model provides scalability to multicore applications, built as asynchronous message-processing tasks. It runs on most operating systems. Its objective is to increase the performance of communications removing complexity rather than exposing new functionality.

Chapter 3

Design of an efficient solution for scalable socket based multi-thread applications This chapter presents the design of a solution for scalable socket-based multi-thread applications for HPC. The solution has the objective of reducing and stabilizing the latency of communications through sockets, that suffer from read and write contention because of being used by multiple threads. First, Section 3.1 presents the analysis of the communication frameworks used, Section 3.2 introduces the testbed used for the performance evaluation, Section 3.3 describes the proposed solution and the performance results for a network micro-benchmark, and Section 3.4 presents a high performance multi-thread and multi-socket server comparing its performance with real applications.

3.1

Analysis

First we present a thorough analysis of the libraries used in the construction of our efficient solution: Disruptor and sockets. 13

14 3. Design of an efficient solution for scalable socket based multi-thread applications

3.1.1

Disruptor

The analysis of Disruptor is structured in three parts. The first one involves choosing the right implementation of Disruptor. The second one focus on which waiting strategy will be used on the consumer barrier. Finally, the third one involves the configuration of the Executor in charge of running the producer and consumer tasks. There are two different methods to create a ring-buffer depending on the number of producers, RingBuffer.createSingleProducer(...) and RingBuffer.createMulti Producer(...). This is key since, as explained in Section 2.2, Disruptor behavior heavily relies on this number. They differ essentially on the methods to access and manipulate the sequence. For a single producer, no special methods are used (except for a memory barrier) in order for the consumers to push new data, but, for multiple producers, CAS operations and memory barriers (mentioned on Section 2.2) are used. Regarding the consumers strategy of waiting for new data to process, we can select different methods when creating the ring-buffer: • BlockingWaitStrategy - It uses a lock and a condition variable. This strategy can be used when throughput and low-latency are not as important as CPU resources. • LiteBlockingWaitStrategy - It requires the same resources and can be used in the same environments as the BlockingWaitStrategy but it has an AtomicBoolean so the producer does not try to wake up consumers if they are not running. • BusySpinWaitStrategy - It uses a busy spin loop. This strategy uses CPU resources to avoid syscalls, that can introduce latency jitter. It is preferred when threads can be bound to specific CPU cores. • SleepingWaitStrategy - It initially spins, then uses a Thread.yield()1 , and it eventually disables the current thread for the minimum number of nanoseconds the OS and the JVM allow. This strategy is a good tradeoff between performance and CPU resources use. Latency spikes can occur after quiet periods. 1

The executing thread is suspended and the CPU is given to some other runnable thread. This

thread waits until the CPU becomes available again.

3.1. Analysis

15

• TimeoutBlockingWaitStrategy - It uses a lock and a condition variable, trying to predict the minimum waiting time. It is a good tradeoff between performance and CPU resources, but it depends on defining an accurate value of timeout. • YieldingWaitStrategy - It uses a call to Thread.yield() after an initial spin. It is a good tradeoff between performance and CPU resources without incurring significant latency spikes. • PhasedBackoffWaitStrategy - It spins, then yields and then waits using the configured fallback WaitStrategy, which can be any of the previous six strategies explained. This strategy can be used when throughput and low-latency are not as important as CPU resources use. Finally, we have to choose which type of Executor will run the producer and consumer tasks. An object of the class Executor will be in charge of executing new Runnable tasks. This includes details of thread use, scheduling, etc. But instead of an Executor, we will use the sub-interface ExecutorService which also provides methods to manage termination, and methods that can produce a Future object for tracking the progress of one or more asynchronous tasks. The following classes are predefined implementations of the ExecutorService: • newFixedThreadPool - It creates a thread pool that reuses a fixed number of threads. At any point, there can not be more active threads processing tasks than as specified when creating the ExecutorService. If additional tasks are submitted when all threads are active, they wait in the queue until a thread is available. • newSingleThreadExecutor - It creates an Executor that uses a single worker thread. Tasks are guaranteed to be executed sequentially, and no more than one task will be active at any given time. • newCachedThreadPool - It creates new threads as needed, but it reuses previously created threads when they are available. If no thread is available, a new thread

16 3. Design of an efficient solution for scalable socket based multi-thread applications is created and added to the pool. These pools usually improve performance of programs that execute many short-lived asynchronous tasks. • newScheduledThreadPool - It creates a thread pool that can schedule commands to run after a given delay, or to execute periodically. • newSingleThreadScheduledExecutor - Like newScheduledThreadPool, it creates a single-threaded executor that can schedule commands to run after a given delay, or to execute periodically. Moreover, tasks are guaranteed to be executed sequentially, and no more than one task will be active at any given time. Additionally, the ExecutorService uses a ThreadFactory in order to create new threads. Disruptor provides its own implementation called DaemonThreadFactory. With this implementation, all threads are created as daemon threads, which means that they do not prevent the JVM from exiting when the program finishes but the thread is still running. They use this configuration because consumer threads are continuously waiting for new data to process, and, although the program can finish because all data has been processed, consumers keep waiting in the barrier for new data. If they were not daemon threads, the ExecutorService would have to be shutdown explicitly. These two last parameters (WaitStrategy and Executor) offer predefined values but, if none of them were suited for the purpose of a given application, the user can create new ones implementing the proper interfaces. The configurations used in the design of our solution vary depending on the use case. There will be cases where we can sacrifice latency to have more threads processing data, and there will be cases where communication latency has a higher priority.

3.1.2

Sockets

Regarding sockets, we use four different implementations. Since the problem of sharing a socket between many threads exists in either java.net and java.nio packages, we will test our solution with both of them. Moreover, since our goal is to obtain the best performance, we will use the high performance communication libraries JFS and UFS. Both of them can be used trans-

3.2. Experimental Configuration

17

parently, but only UFS is able to work with no-Java clients. This is because JFS implements, through reflection, a mechanism to replace the underlying socket library transparently. UFS relies on preloading the library and intercepting the calls to the underlying network library of the kernel transparently. Our test use UFS because it is a more general solution.

3.2

Experimental Configuration

The testbed for the performance evaluation is one node of the pluton cluster of the Computer Architecture Group of University of A Coruna [20]. A full description of the node is provided in Table 3.1. CPU Model

2 x Intel Xeon E5-2660 Sandy Bridge-EP

CPU Speed/Turbo

2.20 GHz/3 GHz

Cores per CPU

8

Threads per core

2

Cores/Threads per node

16/32 (hyperthreading)

Cache L1/L2/L3

32 KB/256 KB/20 MB

RAM

64 GB DDR3 1600 Mhz Table 3.1: Description of the testbed

The benchmarks are generally run inside a single node, with the purpose of testing shared memory communications. We use one JVM per client and per server launching several threads within them. The version of Disruptor is 3.2.1, Netty 3.2.0-ALPHA4, Java OpenJDK 1.7.0 55 and UFS 1.2.0.

3.3

Micro-benchmarking

In this section, we present the design and results of a multi-thread PingPong benchmark. The goal of this benchmark is to measure the performance of communications in

18 3. Design of an efficient solution for scalable socket based multi-thread applications an environment where sockets, used to send and receive messages between JVMs, are shared between a pool of threads. Figure 3.1(a) shows how a single-thread PingPong benchmark works. A series of messages are exchanged between the JVMs, measuring the time that it takes to transfer them. A PingPongProcessor is defined as a task responsible for handling these communications. The PingPongProcessor task calls the send and recv methods of the socket within a loop, in order to send and receive data. Measuring the time spent in transferring each message, the benchmark presents the results in terms of latency. The raw multi-thread version of this benchmark is shown in Figure 3.1(b), on which multiple PingPongProcessor tasks are run within a thread pool and use the same socket.

(a) Single-thread version

(b) Multi-thread version

Figure 3.1: Single-thread and multi-thread raw versions of the PingPong benchmark Figure 3.2 presents the design of the benchmark using the proposed solution. The goal is to provide a communication API with send and recv methods, implementing this API in a communication library which uses Disruptor to access the sockets in order to reduce contention. The PingPongProcessor follows the same logic as before: it calls the send and recv methods within a loop, but, instead of calling the socket methods, they call the new ones. This new methods rely on a buffer to place the data (SendBuffer and RecvBuffer) which is implemented with Disruptor, and a processor in charge of managing the buffer and the socket (PingProcessor and PongProcessor). The send method works as follows. First, it claims the next position of the SendBuffer

3.3. Micro-benchmarking

19

(using the next method), then, it gets the Event object in which it places the data, and finally it sets the object ready to be processed (by calling the publish method). After publishing the data, the PingProcessor is in charge of calling the socket method with the data of the SendBuffer. After calling the method of the socket, it notifies the PingPongProcessor that the operation has been done. The behavior of the recv method is the same as the send method, exchanging SendBuffer and PingProcessor for RecvBuffer and PongProcessor, which now calls the recv method of the socket and copies the data to the buffer provided by the ring buffer.

Figure 3.2: Diagram of how the final PingPong multi-thread benchmark works The benchmark uses the following parameter values: • Size of message: 1024 bytes. It was chosen as it is high enough to transfer a significant amount of data to be processed. • Ring Buffer size: 1024 entries. It is a requirement for Disruptor for the size of an

20 3. Design of an efficient solution for scalable socket based multi-thread applications entry to be power of two. After some preliminary executions we concluded that 1024 is the minimum size that did not cause the buffer to be full, making the threads wait when introducing data in the buffers. We aim to minimize the use of memory without compromising performance. • WaitStrategy: YieldingWaitStrategy to minimize the latency spikes (our goal is to stabilize the latency) • Threadpool: CacheThreadPool was chosen after a series of experimental tests using all the threadpools. It showed the best performance due to the ability of creating threads on demand, optimizing the use of resources.

3.3.1

Shared Memory Performance Results

Figures 3.3 and 3.4 present the results obtained for the PingPong benchmark, comparing the raw version with the new design using Disruptor. We launch 2 JVMs per node, both of them with a variable number of concurrent tasks per iteration, from a single one to 16, each one executed by a single thread, in order to appreciate the evolution of the results as we increase the parallel work. Figure 3.3 shows the latency results using the java.net library, and Figure 3.4 shows the latency results using UFS. Results with java.nio are not included because there is no need for scalability in the socket creation. Minimum, mean and percentile 99 of the latency are shown, as well as the standard deviation of the measures, for different number of threads. Percentil 99 was chosen instead of the maximum value in order to remove the outliers from the analysis. Figures 3.5 and 3.6 show the same results combined in one plot. Results with java.net show that the minimum latency increases (Figure 3.3(a)) up to 20 microseconds, as well as the mean latency (Figure 3.3(b)), that increases up to 10%. However, in the 99 percentile (Figure 3.3(c)), we can observe that it is significantly reduced, more than 9 times on the 16 threads scenario. Finally, the standard deviation values are much lower and stable (Figure 3.3(d)), it goes from 10 times the mean value in the original version to half the value of the mean in our design in the 8 threads scenario, meaning 20 times less overall. As we predicted, the introduction

3.3. Micro-benchmarking

21

of an arbitration system reduces drastically both maximum values and variation, with the tradeoff of a slightly increase of minimal and mean values. Results with UFS (Figure 3.4) show a large performance increase thanks to the improvement introduced by UFS sockets. Latency results are reduced up to 1500 times. However, in this case, the use of Disruptor reduces the latency mean up to 50% as shown in Figure 3.4(b). From the other latency values we can conclude the same as in Figure 3.3: there is a large reduction in the maximum latencies and standard deviation, which means that the results are much more stable and predictable.

22 3. Design of an efficient solution for scalable socket based multi-thread applications

(a) Minimum latency

(b) Mean latency

(c) Percentile 99

(d) Standard Deviation

Figure 3.3: Results of the PingPong multi-thread micro-benchmarks

3.3. Micro-benchmarking

23

(a) Minimum latency with UFS

(b) Mean latency with UFS

(c) Percentile 99 with UFS

(d) Standard Deviation with UFS

Figure 3.4: Results of the PingPong multi-thread micro-benchmarks with UFS

24 3. Design of an efficient solution for scalable socket based multi-thread applications

Figure 3.5: Global results of PingPong multi-thread micro-benchmark with java.net sockets

Figure 3.6: Global results of PingPong multi-thread micro-benchmark with UFS sockets

3.3. Micro-benchmarking

3.3.2

25

Distributed Memory Performance Results

Figure 3.7 presents the results of the execution of the PingPong benchmark in a distributed memory environment. Each one of the JVMs is executed in a different node with the same characteristics as the one used for the shared memory test. The interconnection network is 10GigE. The scale of the graphs is logarithmic due to the large difference between the values from the sequential scenario (1 thread) to the parallel ones. As we can see, despite the increase in the values of the minimum latency (Figure 3.7(a)), due to the overhead introduced by the disruptors, the values of both the mean and percentile 99 as well as the standard deviation are decreased, meaning that our solution provides better performance and scalability. Moreover, the results are more predictable.

26 3. Design of an efficient solution for scalable socket based multi-thread applications

(a) Minimum latency

(b) Mean latency

(c) Percentile 99

(d) Standard Deviation

Figure 3.7: Results of the PingPong multi-thread micro-benchmarks in a distributed memory environment

3.4. Comparison with a real-world application

3.4

27

Comparison with a real-world application

In this section the solution proposed is used to design a scalable NIO server, able to handle parallel workloads from multiple clients. Then its performance is compared to Netty, a commercial solution that implements a high performance scalable server.

3.4.1

Design of the multithread server

Figure 3.8 presents the design of a scalable multithread server, applying the Disruptor as an arbitrary system between the thread that manages the socket and the threads processing the workloads. The server has a main thread in charge of accepting the connections from the clients. When a client connects to the server, it creates a new instance of the class ChannelHandler. This instance accepts the messages received through the socket, and puts them into the input ring buffer. From this ring buffer, a pool of workers process the data and put the response into the output ring buffer, notifying the socket in charge of the communications that the data is ready to be sent. The notified thread takes all the data available in the output ring buffer and sends it back to the client. The mechanism for distributing the data of the input ring buffer between the workers involves the use of a global sequence and a local sequence for each one of the workers. The workers use the global sequence to get the next position of the ring buffer to process, using CAS operations and memory barriers to maintain the consistency. When a worker processes a position of the buffer it updates its own sequence. The thread managing the socket has to check the sequence values of all the workers in order to avoid overwriting a entry of the buffer that is being processed by a worker. Only memory barriers are used since each thread updates the value of its own sequence. The method for accessing the output ring buffer is the same as the one used in the PingPong benchmark.

3.4.2

Comparison with Netty

In order to compare our solution with a comercial one, we use Netty because it is a framework to deploy high performance scalable servers comparable to the one that we have designed. Servers deployed with Netty consist of a single thread accepting

28 3. Design of an efficient solution for scalable socket based multi-thread applications

Figure 3.8: Design and operation of the multithread server

incoming connections. When the connection is accepted, a new thread is created to manage it. On incoming messages, this thread generates events that are processed by a series of handlers. Any handler can send back a response if needed. The performance comparison is done using the multi-client and multi-thread synthetic application described before. This application simulates a given number of clients working concurrently with the same server. Each one of these clients is a multithread program where each thread sends data that has to be processed by the server. Each client connects to the server using its own socket. java.nio sockets are used because of the need for scalability and for asynchronous process of the messages. This socket is shared between a number of threads, each one of them performing send and receive calls to the socket. To simulate the behavior of real applications, we use a synthetic workload within the server that consists of a call to the LockSupport.parkNanos(..) method, which disables the current thread for thread scheduling purposes for up to the specified waiting time. We also use an active workload that consists in a loop of multiplication and addition operations. Both workloads are tested with a configuration of 2 client applications each one of them running 16 threads. The number of threads run-

3.4. Comparison with a real-world application

29

ning in the servers is continuously changing in order to exploit all the cores if necessary, but if no client is connected, only the thread accepting the connections is running. The parameters used for the server are: • Size of message: 1024 bytes. Like the pinpong benchmark, it was chosen as a high enough value to transfer a significant amount of data in a real-world server-client application. • Ring Buffer size: 1024 entries. With the goal of minimizing the use of memory without compromising the performance, experimental test were carried out, obtaining the specified value. • WaitStrategy: As well as in the PingPong benchmark, our goal is to stabilize the latency, so we chose the YieldingWaitStrategy. • Threadpool: Since all the workloads from a client need to be processed the same way, we chose the CacheThreadPool so the threads that have finished can be reused to process new ones. The values used to analyze the performance comparison are the task time, the mean response time and the percentile 99 of the response time. A task represents the time since the data is sent to the server until it is received back, and the task time is calculated as the total execution time of the application divided by the total number of tasks, taking to account that tasks can be processed in parallel. In terms of response time, we measure both mean and percentile 99 value. This response time refers to the time that a working thread on the client waits for its data to be processed. Since the outliers do not provide a useful value, we use the percentile 99 to measure the maximum response time. Table 3.2 shows the results for the java.nio sockets and parkNanos workload. Our solution introduces some overhead due to the use of the disruptors to provide scalability and a better exploit of the CPU resources. But, when increasing the workload, our solution is more reliable and predictable. In fact, with a workload of 5 milliseconds, our solution provides a task time lower than 13 times. The same results are observed for

30 3. Design of an efficient solution for scalable socket based multi-thread applications the mean response time of processing single tasks and the percentile 99 of the response time. Comparing with the active workload results from Table 3.3, we can see that with no workload, our solution introduces some overhead due to the use of the disruptors. But when introducing a loop of 1000000 iterations, our solution provides up to 4 times less task time, a mean response time 4 times lower and a percentil 99 of the response time 55 times lower. Similar results are obtained increasing the number of iterations 5 times. Results using UFS are showed in Tables 3.4 and 3.5 and, despite the reduction in communication latency, they behave in a similar manner. Workload (microseconds) none

500

5000

Netty

8.80

287.60

2567.13

DisruptorServer

40.13

51.99

190.21

Netty

256.62

8373.24

72384.56

DisruptorServer

1127.76

1207.38

5667.04

Netty

1359.66

38369.96

1582257.50

DisruptorServer

32120.83

23262.89

24033.28

Task time

Mean response time

Percentil 99 response time

Table 3.2: Results of the benchmark with java.nio sockets and parkNanos workload

3.4. Comparison with a real-world application

31

Workload (iterations) none

1000000

5000000

Netty

8.80

899.80

4397.59

DisruptorServer

40.13

238.80

1479.68

Netty

256.62

25726.46

126164.89

DisruptorServer

1127.76

4198.20

29949.29

Netty

1359.66

456163.32

3117547.50

DisruptorServer

32120.83

8237.14

57218.39

Task time

Mean response time

Percentil 99 response time

Table 3.3: Results of the benchmark with java.nio sockets and an active workload

Workload (microseconds) none

500

5000

Netty

6.53

279.02

2539.13

DisruptorServer

41.68

53.41

226.09

Netty

181.15

8348.02

74214.06

DisruptorServer

1190.03

1205.92

5845.94

Netty

1793.22

22577.49

1149894.69

DisruptorServer

216.25

20850.01

93250.90

Task time

Mean response time

Percentil 99 response time

Table 3.4: Results of the benchmark with UFS sockets and parkNanos workload

32 3. Design of an efficient solution for scalable socket based multi-thread applications

Workload (iterations) none

1000000

5000000

Netty

6.53

914.85

4504.47

DisruptorServer

41.68

291.19

1302.86

Netty

181.15

26298.21

130848.49

DisruptorServer

1190.03

3511.08

54230.90

Netty

1793.22

545423.37

2043248.69

DisruptorServer

216.25

4851.97

1135813.61

Task time

Mean response time

Percentil 99 response time

Table 3.5: Results of the benchmark with UFS sockets and an active workload

Chapter 4

Conclusions and future work Currently, sockets are the main resource for communicating distributed applications. Moreover, the number of threads used in these applications increases in order to exploit the growing number of cores of modern CPUs. Java is a good alternative for this environment due to its multithread and network built-in support in addition to the existence of high performance socket implementations as JFS and UFS. However, the simple combination of these two elements, sockets and threads, can cause a loss of performance and scalability, either because of creating too many sockets, which causes scalability issues, or because of sharing the socket between threads, which causes contention problems. In this work, we propose and evaluate a solution that solves the contention problems of sharing the socket between threads by including an arbitration mechanism to access the socket, Disruptor. This solution has been benchmarked using both Java network libraries (java.net and java.nio) and high performance sockets (UFS, JFS), in both synthetic and real-time environments. The evaluation of our solution consisted in the design and implementation of two benchmarks, focusing on both synthetic and real environments, including a comparison with a real-world high performance scalable framework. Our results show that the proposed solution it is able to decrease and stabilize the latency in the synthetic benchmark and is able to exploit the parallelism in real-time applications using all the available resources on the CPU. Our solution provides predictable performance and higher scalability. 33

34

4. Conclusions and future work This work can be extended by a deeper analysis of distributed environments with

a higher number of threads per application. Moreover, this solution could be provided as a library using the send/recv methods presented for the PingPong benchmark and the management of servers and clients. The promising results obtained set the basis to develop an ad-hoc lighter mechanism for arbitration, based on the ring buffer and lock free algorithms, but removing the overhead introduced by the Disruptor framework.

Appendix A

Resumen en castellano Este trabajo presenta el dise˜ no e implementaci´on de un mecanismo multi-thread para el acceso eficiente a sockets. El n´ umero de n´ ucleos por procesador sigue creciendo y, para aumentar la escalabilidad, las aplicaciones requieren lenguajes que se adapten f´acilmente a estos sistemas multi-thread. Java es un lenguaje adecuado para estos entornos gracias a su soporte nativo multi-thread y de red. Adem´as, el aumento de recursos de los nodos actuales, junto con nuevas arquitecturas como NUMAdirect y aceleradores como el Intel Xeon Phi, hacen posible ejecutar varias aplicaciones en un mismo nodo, convirtiendo las comunicaciones a trav´es de memoria compartida en cr´ıticas. Con el objetivo de comunicar diferentes aplicaciones, Java ofrece dos APIs de sockets: Java Net y Java NIO, ambas apoyadas en la librer´ıa de sockets nativa del sistema. Adem´as, existen implementaciones de alto rendimiento de la librer´ıa de sockets, como JFS y UFS, que nos proporcionan el rendimiento adecuado para estos entornos. Sin embargo, la simple combinaci´on de sockets y threads puede provocar p´erdidas de rendimiento y escalabilidad, debido tanto a la creaci´on desmesurada de sockets, que causa problemas de escalabilidad, como a la compartici´on del socket entre threads, que causa problemas de contenci´on. En consecuencia, para solucionar esta situaci´on y proporcionar un rendimiento predecible, este trabajo presenta un sistema eficiente de arbitraje entre threads y sockets usando Disruptor, un framework que proporciona thread safety sin p´erdida de rendimiento. La evaluaci´on de la soluci´on propuesta consisti´o en el dise˜ no e implementaci´on de 35

36

A. Resumen en castellano

dos benchmarks, centr´andonos tanto en entornos sint´eticos como reales, incluyendo una comparaci´on con un framework escalable de alto rendimiento de aplicaci´on real. Tambi´en fue probado con diferentes librer´ıas de sockets para garantizar la portabilidad de la soluci´on. Nuestros resultados muestran que la soluci´on propuesta es capaz de disminuir y estabilizar la latencia en el benchmark sint´etico y es capaz de explotar el paralelismo en la aplicaci´on en tiempo real usando todos los recursos disponibles en la CPU. La soluci´on tambi´en proporciona un rendimiento m´as predecible y una mayor escalabilidad. Como l´ıneas futuras, este trabajo puede ser extendido mediante un an´alisis m´as profundo de entornos distribuidos con un mayor n´ umero de threads por aplicaci´on. Adem´as, esta soluci´on puede ser proporcionada como una librer´ıa usando los m´etodos send/recv presentados en el benchmark sint´etico y el manejo de servidores y clientes. Los prometedores resultados obtenidos sientan las bases para el desarrollo de un sistema de arbitraje ad-hoc m´as ligero, aprovechando los beneficios del ring buffer y los algoritmos libres de bloqueos, eliminado la sobrecarga introducida por el uso del framework de Disruptor al completo.

Bibliography [1] Blount, B. and Chatterjee, S., “An Evaluation of Java for Numerical Computing,” in ISCOPE’98, (Springer Verlag), pp. 35–46, 1998. [2] Corporation, O., “The Java HotSpot Performance Engine Architecture,” [3] Taboada, G. L., Ramos, S., Exp´osito, R. R., Touri˜ no, J., and Doallo, R., “Java in the High Performance Computing Arena: Research, Practice and Experience,” Science of Computing Programming, vol. 17, no. 8, pp. 1709–1719, 2013. [4] Corporation, O., “Summary of the java.net package, Java Platform, Standard Edition 7 API Specification.” http://docs.oracle.com/javase/7/docs/api/java/ net/package-summary.html. [Last visit June 2014]. [5] Liang, S., The Java Native Interface Programmers Guide and Specification. Addison-Wesley, 1999. [6] Corporation, O., “Summary of the java.nio package, Java Platform, Standard Edition 7 API Specification.” http://docs.oracle.com/javase/7/docs/api/java/ nio/package-summary.html. [Last visit June 2014]. [7] “High Performance Java Fast Sockets (JFS) Home page.” http://torusware. com/product/java-fast-sockets-jfs/. [Last visit June 2014]. [8] Taboada, G. L., Touri˜ no, J., and Doallo, R., “Java Fast Sockets: Enabling Highspeed Java Communications on High Performance Clusters,” Computer Communications, vol. 31, no. 17, pp. 4049–4059, 2008. 37

38

BIBLIOGRAPHY

[9] “High Performance Fast Sockets - Universal Fast Sockets (UFS) Home page.” http://torusware.com/product/universal-fast-sockets-ufs/.

[Last visit

June 2014]. [10] “TORUS Software Solutions S.L..” http://torusware.com/. [Last visit June 2014]. [11] Thompson, M., Farley, D., Barker, M., Gee, P., and Stewart, A., “Disruptor:

High performance alternative to bounded queues for exchanging data

between concurrent threads.” http://lmax-exchange.github.com/disruptor/ files/Disruptor-1.0.pdf, May 2011. [12] “Disruptor Home Page on GitHub.” http://lmax-exchange.github.io/ disruptor/. [Last visit June 2014]. [13] Reitbauer, A., Enzenhofer, K., Grabner, A., Kopp, M., Pierzchala, S., and Wilson, S., Java Enterprise Performance. Compuware, 2012. [14] “Java SE 6 HotSpot Virtual Machine Garbage Collection Tuning.” http://www. oracle.com/technetwork/java/javase/gc-tuning-6-140523.html. [Last visit June 2014]. [15] “Storm Home Page on Apache Project.” http://storm.incubator.apache.org/. [Last visit June 2014]. [16] “Hadoop Home Page on Apache Project.” http://hadoop.apache.org/. [Last visit June 2014]. [17] “Apache Zookeeper Home Page.” http://zookeeper.apache.org/. [Last visit June 2014]. [18] “Netty Home Page.” http://netty.io/. [Last visit June 2014]. [19] “ZeroMQ Home Page.” http://zeromq.org/. [Last visit June 2014]. [20] “Computer Architecture Group.” http://gac.des.udc.es/. 2014].

[Last visit June