SCALABLE BULK DATA TRANSFER IN WIDE AREA NETWORKS

1 SCALABLE BULK DATA TRANSFER IN WIDE AREA NETWORKS Nader Mohamed Jameela Al-Jaroodi Hong Jiang David Swanson DEPARTMENT OF COMPUTER SCIENCE AND ENGI...
Author: Julian Jenkins
0 downloads 2 Views 202KB Size
1

SCALABLE BULK DATA TRANSFER IN WIDE AREA NETWORKS Nader Mohamed Jameela Al-Jaroodi Hong Jiang David Swanson DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, UNIVERSITY OF NEBRASKA – LINCOLN, LINCOLN, NE 68588-0115, USA

Abstract Bulk data transfer in wide area networks (WAN) requires scalable and high network bandwidth. In this paper, we identify a number of the scalability limitations that affect the full utilization of peak theoretical network bandwidth. In addition, we study and classify different offered approaches to overcome some of the identified limitations and increase network bandwidth among Grid components in WAN. With these limitations in mind, we study and evaluate the scalability and flexibility of a UDP-based multiple-network-interface socket (MuniSocket) model in WAN. The MuniSocket model is a middleware layer between the distributed applications and the multiple networks and system resources available. MuniSocket utilizes existing system resources, network interface cards, and network links to provide a scalable, reliable and high bandwidth network solution for data-intensive distributed applications, thus eliminating most of the identified limitations. Key words: Scalable network bandwidth, network middleware, grid services, heterogeneous computing, and parallel transfer

The International Journal of High Performance Computing Applications, Volume 17, No. 3, Fall 2003, pp. 237–248 © 2003 Sage Publications

Introduction

The computational Grid (Foster and Kesselman, 1999) has emerged as a collection of services and tools to facilitate a special type of distributed systems. These systems have characteristics distinct from other systems: (1) the resources available for the system are geographically distributed; (2) the resources are heterogeneous, thus having various platform architectures, operating systems and devices; (3) the different components in the system are independent, thus having different access, management and accounting policies. One of the main problems in developing infrastructure services for Grid computing is the heterogeneity in machine architectures, operating systems, and network resources. This heterogeneity forces services to be implemented at the middleware-level such that they will be easily ported to and utilized by different system types. Implementing tools and services that need special kernel functions results in poor portability and utilization of such tools and services. Current Grid tools and services have different degrees of flexibility in dealing with heterogeneous systems and networks. Another important issue in Grid computing is the scalability of services. Grid computing aims to utilize available distributed software and hardware resources on a large scale. This requires the Grid services to be scalable to efficiently utilize the available resources. Grid applications usually require access to high performance machines, data storage systems and high performance and reliable communications infrastructures. They include, for example, remote sensing and visualization applications, computation-intensive applications, distributed simulations, collaborative applications supporting distributed research groups and distributed data analysis. One of the common but critical requirements of these applications is a network solution with low latency and high bandwidth to support the high volumes of data exchanged. A cost-effective approach for such a solution is to aggregate available multiple low-yield but low-cost resources to achieve reasonably high performance collectively. One prime example of this approach is clusters, which provide high computing power by efficiently combining multiple low-cost commodity computing resources. Another example is the concurrent utilization of multiple low-bandwidth network interfaces to seamlessly provide applications with high collective bandwidth as in MuniSocket (Multiple Network Interfaces Socket) (Mohamed et al., 2002). The main contributions of this paper are (1) the identification of scalability limitation of increasing bandwidth among Data Grid components, (2) the classification of the approaches for increasing bandwidth among Data Grid components in WAN, and (3) the development and evaluation of a generic model to avoid most of these limitaSCALABLE BULK DATA TRANSFER IN WAN

237

tions (MuniSocket). In Mohamed et al. (2002), we proposed a UDP-based protocol to utilize the existing multiple network interfaces in cluster nodes to transfer large messages among the nodes. This protocol provides a dynamic load balancing mechanism to distribute the load among the available network resources to achieve high bandwidth. To have a reliable transfer service with the UDP-based protocol, the protocol includes a reliability mechanism over the unreliable multiple transfer services and introduces a fault tolerant data transfer mechanism. Moreover, we have studied and evaluated the performance of the protocol on a cluster, where each node has two fast Ethernet interfaces. In this paper, we further extend the MuniSocket model to provide scalable high bandwidth for Grid applications over WAN. We also evaluate the performance of MuniSocket over a simulated WAN to show its operation and performance on Grids with heterogeneous systems and network resources. However, we will not cover the reliability or load balancing mechanisms, which were covered in Mohamed et al. (2002). In the rest of this paper, we start, in Section 2, with a discussion of the issues related to WAN bandwidth, such as scalability limitations and classification of the different existing approaches for increasing network bandwidth among nodes in a WAN. Then, in Section 3, we introduce the MuniSocket architecture, characteristics, and protocol. Configuration, flexibility, scalability and performance of MuniSocket in a WAN are studied in Section 4. In Section 5 we discuss the advantages of the proposed model in WAN. Finally, Section 6 concludes the paper. 2 Bandwidth Issues in Wide Area Networks Grid services aim to provide uniform access to the various Grid components in an efficient and user-friendly manner. Many services and tools were introduced as groups of integrated sets of services, that are sometimes referred to as “middleware”. Some examples of these tools are Globus (Globus, 1998), Legion (Legion, 1997), SRB (Baru et al., 1998), and workbench systems such as Habanero (Habanero, 1996) and WebFlow (Akarsu et al., 1998). One of the major requirements of Grid computation is the availability of high bandwidth and low latency networks that could support huge data transfers efficiently. In an effort to provide high bandwidth and low latency solutions, many protocols, algorithms and techniques have been devised. The main driving forces behind such demands are the Grid applications, which are classified into the following five types (Foster and Kesselman, 1999): 238

COMPUTING APPLICATIONS

• distributed supercomputing, in which many grid resources are used to solve large problems;

• high-throughput, in which grid resources are used to solve large numbers of small tasks;

• on-demand, in which grids are used to meet peak needs for computational resources;

• collaborative, in which grids are used to connect people;

• data-intensive, in which the focus is on coupling distributed data resources. Most of these applications, particularly the data-intensive applications, require large data transfers, which are mostly in the form of message-passing or file transfer. However, the current transport protocols such as TCP and UDP impose many limitations that limit the utilization of available bandwidths and reduce the performance of these applications. The bandwidth on WAN is affected by many factors that prevent applications from reaching their theoretical peak performances during transmission. In this section, we identify and discuss some of the scalability limitations on bandwidth and then introduce a number of existing approaches taken to lessen or eliminate some of these limitations. The rest of the paper will discuss the MuniSocket solution and how it further reduces many of these limitations. 2.1 LIMITATIONS ON SCALABILITY OF BULK DATA TRANSFERS Achieving and sustaining peak bandwidth is theoretically possible; however, many factors contribute to limiting the performance below that peak level. These limitations collectively reduce the effective bandwidth available to the applications. Any solution intended to increase the effective bandwidth for applications on WAN must incorporate solutions to one or more of the following limitations without imposing an inhibitively high overhead. 2.1.1 Protocol Limitations. One of the most commonly used protocols is TCP, which imposes a number of limitations on the achievable bandwidth. First, the limitation of the window size (Automatic TCP, 2002) restricts the number of packets that can be sent before waiting for acknowledgments. Current research has shown that the window size has to be set equal to the bandwidth*RTT product to achieve near peak bandwidth (Sato et al., 2001; Tierney, 2001). However, this cannot be done easily in practice since bandwidth and delay can change dynamically over time on a WAN. In addition, the buffer size has a significant impact on how well the flow/congestion control mechanisms work, and thus on

how efficiently the bandwidth is utilized. Another factor that limits the TCP performance is the high overhead associated with it. For UDP, an unreliable connectionless protocol without the window size problem of TCP, the limitation is in the form of lost, out-of-order, and delayed packets. Any solution based on UDP must accommodate these shortcomings and build reliability protocols in the solution. However, this may lead to imposing more overhead on the solution, thus making it less efficient. 2.1.2 Network Limitations. A WAN usually spans large geographical areas, thus the delays incurred by the long distances (propagation delay) become very significant (Lee and Stepanek, 2001; Sato et al., 2001). Applications sensitive to delays are most seriously affected even if the available bandwidth is high. In addition, the bandwidth available on the various links on the paths could vary, depending on the technology used and the load on these links. This variation can cause severe limitations on the achievable bandwidth to the applications using these links. One way to reduce this limitation is to use multiple different paths on the networks to achieve an aggregate bandwidth equal to the sum of achievable bandwidths on the paths used. In some cases, routers can be tuned to distribute the packets sent to the same destination over different routs to achieve better utilization of the links. 2.1.3 Network Interface Card Limitations. Network interfaces pose another limitation by restricting the bandwidth to the maximum available on the card. In addition, the addressing schemes used (mainly in TCP/IP) restrict applications to define the IP address and port number to be used and, once chosen, only use that interface all the time. However, many machines, such as in cluster nodes, now come with more than one network card. The multiple cards can only provide limited bandwidth individually, but the currently available services do not provide a flexible and seamless way to utilize them collectively to achieve higher bandwidth. The availability of multiple interfaces on the same machine also creates a problem with addressing. For example, in channel bonding (Beowulf, 2002) all bonded cards are given the same MAC address; therefore, they need to be connected to different switches to avoid MAC confusion at the switch. In addition, many applications require identifying the IP address for a connection, thus it becomes difficult to use two or more cards at the same time. 2.1.4 Heterogeneity Limitations. Current WANs are usually comprised of machines and network infrastructure of diverse types and configurations. Machines may be of many different platforms, under different operat-

ing systems and have different devices such as network cards and I/O devices. The one constant factor currently common to all machines on a WAN is the TCP/IP protocol used, giving rise to the need for services that allow the different machines and devices to be utilized efficiently. As an example, consider a scenario where a machine with a single gigabit Ethernet card is connected through a fat network (with high bandwidth) to another machine with multiple fast Ethernet cards. Such a network connection will limit the achievable bandwidth to that of a single fast Ethernet card, even though theoretical peak bandwidth on either machine is much higher. 2.1.5 Implementation Limitations. Grid components have become increasingly more powerful with high-performance or multiple processing units, making it possible to facilitate advanced mechanisms to achieve high bandwidth by fully and efficiently utilizing the available processing capability. On the other hand, most of the current solutions to scale available bandwidth impose significant processing overhead on the system due to their lack of efficiency or sequential implementation, thus becoming impractical. Many of the solutions are sequential or are not designed to fully utilize the available parallel processing capabilities, thus increasing the apparent overhead. Such implementations become limited to the capabilities of a single processor, while there exists more unutilized processing units. 2.1.6 Other System Components Limitations. With machines providing multiple interfaces or single high bandwidth interface and high processing power, other components may limit the achievable network bandwidth out of each machine. System bus, memory bandwidth, storage I/O, I/O bus and DMA are some examples of such components. If any of these components becomes a bottleneck, as a result of overloading/contention or simply lower capacity, the maximum achievable bandwidth of the machine will be bounded by the bottleneck component. The techniques used to increase bandwidth out of a single machine will always be limited by the capabilities of the machine components. However, solutions that involve multiple machines can overcome these limitations by distributing the load among the available machines. 2.2. APPROACHES TO INCREASE BANDWIDTH IN WAN Many research groups have attempted to solve the problem of low effective bandwidth using techniques such as parallel streams, striped disks and servers, tuning the TCP buffers, pipelining, prefetching and caching. Each approach deals with different limitations and improves SCALABLE BULK DATA TRANSFER IN WAN

239

the performance within the limits of other factors. The approaches taken can be classified as follows: 2.2.1 Tuning TCP Protocol Parameters. TCP imposes limitations on achievable bandwidth due to the overhead and the mechanisms of the window and buffer controls. It has been suggested to tune the different parameters and functions in the TCP protocol to facilitate operations that are more efficient. This can be achieved in different ways such as modifying the TCP buffer and window size (increasing the size to match the bandwidth-delay product) (Lee et al., 2001; Tierney, 2001; Sato et al., 2001; Dickens et al., 2002; Dickens and Gropp, 2002). Another method is to use different acknowledgment schemes such as selective ACK (SACK; Dickens et al., 2002; Dickens and Gropp, 2002; Mathis et al., 1996; EHPDTH, 2003; SACK, 2003), selective negative ACK (SNACK) or a dynamic combination of both (Kettimuthu et al., 2002). This approach requires changes to the TCP protocol, which in many cases may not be accessible to the developer. Most of these techniques, however, cannot overcome other limitations such as the network, NIC, processing power and others. 2.2.2 Using Multiple Parallel TCP Streams. In this approach, multiple logical connections are established to transmit data at a higher rate such that the available bandwidth is saturated quickly. This technique helps overcome the window size limitation of TCP; however, the achievable bandwidth is bounded by the maximum bandwidth supported by the physical network interface. The multiple streams require opening multiple sockets using concurrent threads to control them, which imposes high processing overhead on the CPU. Examples of this technique are found in PSockets (Sivakumar et al., 2000), within the domain of satellite-based information systems (Ostermann et al., 1996) and the GridFTP (Allcock et al., 2001, 2002; GridFTP, 2000) within the Globus project (Globus, 1998), in addition to Sivakumar et al. (2000), Chen et al. (2001) and Lee et al. (2001). Moreover, DPSS (Lee et al., 2001; Tierney et al., 1999) and GridFTP provide striping across multiple servers, which enables them to overcome the NIC and system components limitations. One advantage of this approach is that it can be implemented at the middleware level, thus making it highly portable. However, using this technique on a single machine limits the bandwidth to the maximum available bandwidth on the NIC used, subject to the system components limitations. 2.2.3 Using the Striping Technique at Different Levels of the Communication Protocol Stack. This technique is well known in the context of storage systems, for example, the redundant arrays of inexpensive disks (RAID) 240

COMPUTING APPLICATIONS

architecture (Katz et al., 1989). However, in the context of networks striping is used to describe the aggregation of multiple networks to achieve higher bandwidth, hence higher throughput. A number of projects, such as IBM’s PTM packets (Theoharakis and Guerin, 1993), and the stripe, an IP packets striping protocol (Adiseshu et al., 1996), use the striping technique (Brendan et al., 1995). A well-known implementation of striping is the Ethernet channel bonding. Other examples are DPSS and GridFTP. The aim of the stripping algorithms is to distribute varying-sized transmission units into multiple available connections. In addition, the striping algorithms try to achieve load balancing for multiple networks and fairness in serving packets and frames coming from higher layers. This technique distributes the load among multiple network interfaces (on a single or multiple machines) such that an aggregate bandwidth close to the sum of all available bandwidths is achieved. In addition, many striping techniques try to increase overall throughput as opposed to increasing peak achievable bandwidth for a single stream. On the other hand, the striping technique among multiple servers requires having an efficient striping and access mechanism to the distributed data. Moreover, the technique does not guarantee utilizing peak bandwidth within each network interface used. In addition, it is usually used on homogeneous systems and networks, as in channel bonding on Linux clusters. 2.2.4 Using UDP-Based Techniques. Many approaches moved towards using UDP to avoid the limitations of TCP. However, this approach requires developing mechanisms for flow control and reliability, which impose some overhead. On the other hand, others tried to avoid these costly mechanisms by confining the solutions to dedicated high bandwidth networks or quality of service (QoS) enabled networks (He et al., 2002). This restriction allows for efficient bulk data transfer provided that the packet loss rate is kept at a minimum. One important advantage of using UDP is its low overhead and unrestricted window, thus enabling continuous transmission to saturate the network. Given proper flow control and reliability mechanisms, UDP-based protocols provide efficient data transfers. These techniques can be implemented at the middleware level, thus providing high portability. 2.2.5 Providing alternative transport protocols. Some research groups approached the problem by designing alternative protocols that replace TCP or UDP at the transport layer. The main advantage here is that the new protocols are not restricted to any limitations imposed by the standard protocols. Thus, they can be designed to provide the most efficient mechanisms for data transfer. However, this creates a problem of compatibility since

all participating parties must support the new protocol. Two examples of alternative protocols are NETBLT (Clark et al., 1987), which provides mechanisms for control of congestion, packet loss and long delays, and reliable data protocol (RDP; Velten et al., 1984; Partridge and Hinden, 1990), which provides reliable, connection-oriented and efficient bulk data transfer. 2.2.6 Other Approaches. Other less common techniques include creating a pipeline of packet fragments for store-and-forward networks (Wang et al., 1998). This technique reduces the latency by overlapping transmission of the packet fragments over the different links. 2.3 THE MUNISOCKET SOLUTION In general, the techniques discussed above aim at scaling the achievable bandwidth to support efficient bulk data transfers on WANs such as the computational Grid. Most approaches provide solutions that are limited to a single network interface, but a few others also explore the possibility of utilizing multiple interfaces per machine. However, this latter approach has not been fully explored nor efficiently utilized. For a near peak bandwidth performance, the techniques used must overcome most (if not all) of the limitations discussed. Most of the research projects discussed provided solutions to one or more of the limitations, but very few have attempted to solve all of them. The GridFTP is one of the very few approaches that attempt to overcome as many limitations as possible; however, at the expense of using additional machines and striping data among them. Our proposed MuniSocket model provides an efficient, scalable utilization of multiple physical network interfaces on a single machine to transfer large data messages in parallel, thus considerably increasing the bandwidth available to the application. MuniSocket can provide a scalable solution to increase achievable bandwidth to approach peak performance, subject only to the system resource capacities. MuniSocket is a middleware solution that is close to the application, thus capable of making informed decisions regarding load distribution and control. It can also utilize single and multiple network interfaces depending on the system and application environment. In addition, MuniSocket is multithreaded, thus capable of utilizing all available processing power. The middleware level implementation provides yet another advantage by making the solution machine independent and portable, thus allowing its utilization over different platforms and on heterogeneous systems. Finally, MuniSocket provides a cost-effective and expandable solution to providing high bandwidth and load balancing among the interfaces for data-intensive distributed applications.

3

The MuniSocket Model

MuniSocket is a middleware layer that provides abstract network APIs to hide low-level technical details of multiple networks from users. MuniSocket provides interfaces between the user applications, the operating system, and communication networks, thus enabling distributed applications to communicate with one another. MuniSocket API is primarily concerned with (1) naming and locating communication endpoints, where users specify the network addresses and logical port numbers, and (2) assigning them to references for future usage. MuniSocket API provides basic primitives such as send and receive to be used by distributed applications to communicate. At the time of opening a socket, multiple communication channels can be bonded to MuniSocket. With each channel, the IP address for local network interface, remote IP address, and a remote logical port are defined. These channels can be separate physical channels or shared physical channels. An example of separate physical channels is a separate network card from the first machine and a separate network card from the second machine, and a dedicated network that links them. An example of shared channels is a machine with two low-bandwidth NICs connected to a machine with high-bandwidth NICs through a WAN. In this case, we can define two channels that start from separate NICs in the first machine and end up with the same interface card in the second machine. MuniSocket provides a scalable and reliable transport service on top of multiple network interfaces using the UDP protocol. Unreliable transport protocols usually do not provide data integrity, FIFO order, or delivery assurance for the sent packets. However, UDP provides an optional valid field for checking data integrity. This field is utilized by some UDP socket implementations to ensure data integrity of the received packets. These implementations deliver the received packet to the higher layer if and only if it has no errors. However, these implementations cannot guarantee the packet delivery and order of delivery. In this section, we assume a UDP implementation that will not deliver any packet with errors. This function can be provided for implementations that do not provide this semantic by adding a data check-sum field to the packet, which will be processed by both the sender and the receiver. MuniSocket is multithreaded, with one sending thread per channel, one receiving thread per channel, and fragment status vectors (FSVs) (see Figure 1). MuniSocket divides large messages generated by the application into fragments. Each fragment is encapsulated into a UDP datagram. Multiple threads are used to process the message and prepare the fragments for transmission. Each sending thread is responsible for handling the preparation and transmission of some fragments on a specific SCALABLE BULK DATA TRANSFER IN WAN

241

Fig. 1 The Reliable MuniSocket Architecture: the sender components (left) and the receiver components (right) are connected through two networks.

channel, while each receiving thread is responsible for handling received fragments and placing them into the appropriate place in a receiving buffer provided by the user. Some components are added to provide reliable services, namely, two periodic tasks and a thread. The periodic tasks are the acknowledgment generator on the receiver side and the fragment timer on the sender side. The added thread is an acknowledgment receiver at the sender. The fragment timer is responsible for determining which fragments to be resent due to timeouts. The FSVs are responsible for fragment sequencing on both the sender and receiver sides. The FSV maintains information about fragment status and provides a fragment sequence number to a sending thread. During message transfer, each fragment can be in one of three states on the sender side and one of three states on the receiver side (see Figure 2). These states are maintained in the FSV by the socket threads and the periodic tasks. On the sender side, a fragment can be either ready, in-transit, or acknowledged. When a sending thread starts processing a fragment, the fragment status changes from ready to in-transit. As soon as an acknowledgment is received for the fragment, the acknowledgment receiver changes the status of the fragment to acknowledged. On the receiver side, the fragment also has three states: in-transit, received, and acknowledged. As soon as a receiving thread receives a fragment, its state is changed to received. In addition, as soon as an acknowledgment is sent for the received fragment, the fragment status changes to acknowledged. 242

COMPUTING APPLICATIONS

Fig. 2 State Diagrams of the Reliable Fragment Transfer Protocol: the sender state diagram on the top and the receiver state diagram on the bottom.

The fragment timer works periodically and deals with the timeout of each sent fragment. It can use a different timeout value for each network and this value is optimized for each network. For example, network 1 has timeout T1, network 2 has timeout T2 and the fragment timer task runs every p units of time, where T1 and T2 are multiples of p. Each fragment i has its timeout counter Ci in the FSV initialized to 0. When a fragment is sent, the fragment timer starts to increment the fragment counter Ci by 1 in each run. When the fragment’s status changes to acknowledged, the send operation is complete. However, if Ci becomes greater than or equal to (Tj /p) before the status changes, the fragment timer changes the fragment status from in-transit to ready. Therefore, a sending thread can process the timed out fragment again. The acknowledgment generator periodically generates an acknowledgment for all the received fragments during the current period and sends it alternately on each of the available networks. Included in each acknowledgment is the fragment number k of the last fragment of all contiguously (in order) received fragments. Thus, if the acknowledgment contains an in-order fragment number k, the status of all fragments numbered less than or equal

to k is changed to acknowledged. Also contained in the acknowledgment are fragment numbers, greater than k, of some out-of-order fragments received. The acknowledgment receiver on the sender side changes the status of all fragments indicated to acknowledged. Load balancing and fault tolerance are achieved by letting each sending thread maintain a sending window that is used to implement flow and congestion control mechanisms between the sender and receiver threads. Load balancing is achieved by having threads connected to unloaded (or less-loaded) networks process more fragments while other threads connected to loaded networks are blocked for longer periods of time by the flow and congestion control mechanisms. The window sizes are dynamically changed based on the channel load. In addition, fault tolerance is achieved by allowing any sender thread connected to a stable network to send fragments while other threads connected to unstable networks are blocked by the flow and congestion control mechanism. Acknowledgments are sent in a round-robin fashion among the available networks. Since each acknowledgment contains the fragment number of the last in-order fragment received, the penalty on losing an acknowledgment is minimized. More details about the protocol and its performance can be found in (Mohamed et al., 2002). 4 The MuniSocket Configuration and Performance in WAN This section presents the configuration and performance evaluation of MuniSocket on WAN. Experiments were conducted to evaluate the different scalability and flexibility aspects of MuniSocket. The prototype implementation of MuniSocket was evaluated by connecting sender and receiver pairs of MuniSocket to a UDP WAN simulator. In Section 4.1, we discuss different scenarios where MuniSocket can be utilized to increase the end-to-end network bandwidth. In Section 4.2 we discuss the simulation environment and in Section 4.3 we show the performance and overhead of MuniSocket.

4.2 SIMULATION ENVIRONMENT A prototype implementation of MuniSocket was evaluated using a UDP WAN simulator, called udpQueue, which runs on a Linux cluster. This simulator was developed specifically for evaluating MuniSocket on WAN. The udpQueue simulator allows the actual implementation of UDP applications to be tested in a WAN environment. Different WAN properties, such as high propagation delay, limited delivery bandwidth, random transmission delay, and packet loss ratio, can be simulated by udpQueue. udpQueue consists of a receiver to receive UDP datagrams, a datagram queue to delay the datagrams, and a sender to send the UDP datagrams. After a UDP datagram is received, it is placed in the queue for later processing by the sender. Each udpQueue runs on a dedicated node in the cluster. Sets of udpQueues can be used to simulate multiple networks. In addition, multiple udpQueues can be connected together to simulate complex WAN scenarios. The udpQueue can simulate a WAN that is limited by the cluster resources in terms of CPU power and networks. For example, a cluster with a gigabit Ethernet interconnect can simulate a WAN that connects two machines with gigabit Ethernet networks. Any network with lower bandwidth and higher transfer latency than the gigabit Ethernet can also be simulated using the gigabit Ethernet connections. To evaluate the performance gains of the proposed MuniSocket model, a number of experiments were conducted. All experiments were performed on Sandhills, a 24-node cluster. Each node contains two AthlonMP 1.2GHz processors with 256KB cache and 1GB RAM. The nodes were equipped with two Fast Ethernet and one Gigabit Ethernet cards. The experiments were designed to measure the transfer time and the effective bandwidth for MuniSocket on different WAN topologies and workload conditions. Two nodes were dedicated as the sender and receiver processes. Both processes use MuniSocket as the network API to communicate with the other process. The sender and receiver nodes were connected by a set of udpQueue simulators running on other cluster nodes. 4.3 PERFORMANCE RESULTS

4.1 MUNISOCKET CONFIGURATION MuniSocket can be utilized in different scenarios to increase end-to-end network bandwidth among Grid components. Different configurations for MuniSocket, as shown in Figure 3, can avoid the above mentioned bandwidth scalability limitations such as protocol, network, NIC, and implementation limitations. The machines, their operating systems, NICs, and network links can be homogeneous or heterogeneous.

In all experiments, a 120 ms delay was defined to be the propagation delay between the sender and the receiver. This amount simulates the propagation delay over a long distance, between USA and Europe, for example. In addition, a 256MB message was used in all experiments. As shown in Table 1, the configurations used represent the six different cases described in Section 4.1. The WAN simulator provided the required bandwidth and delay parameters. The efficiency of a given configuraSCALABLE BULK DATA TRANSFER IN WAN

243

Fig. 3

Different scenarios for utilizing and configuring MuniSocket to scale the available network bandwidth.

tion was calculated relative to the peak effective bandwidth measured in all the networks used. For example, in case 3, when two fast Ethernet cards were used, the effective bandwidth measured for the two cards (182.81) was divided by the sum of peak effective bandwidth on both cards (186.54). The CPU utilization represents the processing involved during the send operation. A low CPU utilization implies that MuniSocket does not require much more processing power. In addition, the 244

COMPUTING APPLICATIONS

last column shows the performance with the same configuration, but with three percent packet loss rate. The simulator simulates an average of 3% packet loss ratio in random patterns throughout the transmission period. Each packet has an equal probability of 97% of being delivered and 3% of being lost. This random loss pattern was selected because it has the worst-case effect on the data transfer performance. However, patterns with contiguous areas of packet losses interspersed with long

Table 1 Different measurements of MuniSocket performance in different configurations. Scenario

Peak Effective BW (Mbps)

Efficiency (relative to peak effective BW)

Sender CPU Utilization per proc.

Effective BW with 3% packet loss (Mbps)

Case 1, fast Ethernet interface cards and fat network (> 100Mbps).

93.27

100.00%

10.33%

87.95

Case 2, gigabit Ethernet cards with dedicated 155Mbps network.

143.26

100.00%

14.03%

132.86

Case 3, two fast Ethernet cards per machine each connected to a separate LAN. LANs connected to a fat WAN.

182.81

98.00%

20.13%

166.95

Case 4, one fast Ethernet card and one gigabit Ethernet card 224.37 per machine. LANs connected by dedicated 100Mbps line for fast Ethernet and dedicated 155Mbps for gigabit Ethernet.

94.86%

21.25%

202.44

Case 5, two fast Ethernet cards at sender and one gigabit Ethernet card at receiver. Machines connected through a fat WAN through fast and gigabit Ethernet LANs.

184.19

98.74%

20.45%

166.95

Case 6, two fast Ethernet cards and one gigabit Ethernet card per machine. Fast Ethernet cards connected to separate LANs then through a fat WAN. Gigabit Ethernet LANs connected by dedicated 155Mbps net.

300.26

91.04%

33.75%

267.67

intermediate perfect transmission periods that are typical in data transfer in WANs have less effect on the performance. This is due to the acknowledgment mechanism, where sending and processing acknowledgments with contiguous perfect transmission has less overhead in MuniSocket. In this case, most acknowledgments will only contain the last fragment of all contiguously (in order) received fragments. In general, MuniSocket has been shown to support the different possible configurations of homogeneous and heterogeneous networks and achieve good bandwidth utilization consistently. In addition, MuniSocket was proven to efficiently overcome some of the bandwidth scalability limitations discussed earlier. However, the scalability of the model is bounded by other system resources. For example, in case 6, three sending threads are needed for the three cards, but only two processors are available, resulting in lower efficiency than cases 3, 4 and 5.

bility and scalability for bulk data transfer in WAN, as follows.

5

Discussion

MuniSocket’s design as a middleware service to increase bandwidth offers several advantages in terms of flexi-

• More NICs can be added to the sender and the receiver

machines to avoid the NIC bottleneck (limited bandwidth). These interface cards can be homogenous or heterogeneous and can be connected to a single high bandwidth (fat) WAN or to multiple separate networks. MuniSocket can handle multiple interface cards simultaneously, thus providing high aggregate bandwidth to the applications given enough system resources. • More network lines can be utilized between the sender and receiver to increase the total network bandwidth between them. The network links used can be homogenous or heterogeneous and can be dedicated lines, shared WAN paths, logical channels with dedicated bandwidths, or combinations of these. • MuniSocket provides load balancing among connected heterogeneous interfaces and networks to adjust to the varying available bandwidth and latencies. • MuniSocket is implemented as a middleware layer and does not require changes in the kernel. It can be developed as a library of functions (or classes); therefore, it is easily portable to heterogeneous systems. SCALABLE BULK DATA TRANSFER IN WAN

245

• MuniSocket for WAN is based on UDP, which is











6

readily available for all operating systems and the majority types of networks. Adopting UDP enables MuniSocket to be used on heterogeneous systems connected by heterogeneous networks. In addition, UDP/IP is an integral part of the Internet infrastructure, thereby allowing MuniSocket to be used on nodes connected over the Internet. MuniSocket provides reliability and fault tolerance mechanisms for reliable bulk data transfer on top of the unreliable UDP on the existing network interfaces and lines. MuniSocket is multithreaded such that it can utilize the existing multiprocessors to perform its tasks with minimum delay. This helps reduce the system limitations. MuniSocket overcomes the limitation imposed by the TCP window size that restricts bulk data transfer in WANs, by using UDP with more efficient reliability mechanisms. The MuniSocket overhead is reasonably small, relative to large message transfer time. Unlike multiple streaming approaches that require many threads, MuniSocket needs only two threads, a sending and a receiving thread, for each defined channel. The number of channels usually represents physical network interfaces. MuniSocket can be used as part of other solutions such as GridFTP to avoid the limitations of NICs. In addition, since it does not require changes in the kernel, it can be used in parallel with other grid tools. Conclusion

The first important contribution of this paper is the identification of the different limitations on bandwidth scalability. These limitations are imposed by or are a result of the protocol limitations, network limitations, network interface limitations, implementation limitations, system components limitations, and heterogeneity constraints. These limitations cause effective bandwidth to be much lower than the theoretical peak bandwidth on the network. Moreover, exploring the existing solutions for increasing available bandwidth for Grid applications in this study has led to classifying them into the following five groups: (1) TCP enhancements; (2) parallel TCP; (3) striping (data or servers); (4) using UDP; (5) replacing the TCP and UDP protocols by specialized alternatives. In this paper, we have also explored and studied the flexibility and scalability of MuniSocket for bulk data transfer among distributed applications on WANs. MuniSocket was shown to overcome most of the identified 246

COMPUTING APPLICATIONS

limitations on scalability, such as protocol, network, network interface, processing power and heterogeneity constraints. MuniSocket has demonstrated good performance in different system configurations by achieving high aggregate bandwidths that are close to the peak performances of the available networks and interfaces. In addition, it allows applications to simultaneously utilize the different interfaces to achieve desirable results. Although MuniSocket was implemented on top of UDP, any other suitable transport protocol can be used. An analytical model to model and predict MuniSocket performance on WANs will be investigated and developed in our future work. In addition, we have started to investigate other reliability protocols to further enhance MuniSocket’s performance. ACKNOWLEDGMENTS This project was partially supported by a National Science Foundation grant (EPS-0091900) and a Nebraska University Foundation grant, for which we are grateful. We would also like to thank other members of the secure distributed information (SDI) group and the research computing facility at the University of Nebraska-Lincoln for their continuous help and support. In addition, we would like to thank the reviewers for their valuable suggestions and constructive recommendations that helped improve the paper’s technical quality and presentation. AUTHOR BIOGRAPHIES Nader Mohamed obtained his BSc in Electrical Engineering from University of Bahrain, Bahrain, in 1992 and MSc in Computer Science from Western Michigan University, Kalamazoo, Michigan, USA, in 1998. He is currently a PhD Candidate at the Department of Computer Science and Engineering, University of Nebraska-Lincoln in Nebraska, USA. His research interests include network middleware, computer networks, cluster and grid computing, and object-oriented distributed systems. Jameela Al-Jaroodi obtained her BSc in Computer Science from University of Bahrain, Bahrain and MSc in Computer Science from Western Michigan University, Kalamazoo, Michigan, USA, in 1998. She is currently a PhD Candidate at the Department of Computer Science & Engineering, University of Nebraska-Lincoln in Nebraska, USA. Her research interests include system middleware, parallel and distributed systems, objectoriented parallel and distributed programming models, computer security, computer networks, and cluster and grid computing.

Hong Jiang received a BSc degree in Computer Engineering in 1982 from Huazhong University of Science and Technology, Wuhan, China; an MASc degree in Computer Engineering in 1987 from the University of Toronto, Toronto, Canada, and a PhD degree in Computer Science in 1991 from the Texas A&M University, College Station, Texas, USA. Since August 1991 he has been at the University of Nebraska-Lincoln, Lincoln, Nebraska, USA, where he is Associate Professor and Vice Chair in the Department of Computer Science and Engineering. His present research interests are computer architecture, parallel/distributed computing, computer storage systems and parallel I/O, performance evaluation, middleware, networking, and computational engineering. He has over 50 publications in major journals and international conferences in these areas, and his research has been supported by DOD and NSF. He is a Member of ACM, the IEEE Computer Society, and the ACM SIGARCH and ACM SIGCOMM. David Swanson received a PhD in physical (computational) chemistry at the University of Nebraska-Lincoln (UNL) in 1995, after which he worked as an NSF-NATO postdoctoral fellow at the Technical University of Wroclaw, Poland, in 1996, and subsequently as a National Research Council Research Associate at the Naval Research Laboratory in Washington, DC, from 1997–1998. In early 1999 he returned to UNL where he has coordinated the Research Computing Facility and currently serves as an Assistant Research Professor in the Department of Computer Science and Engineering. The Office of Naval Research, the National Science Foundation, and the State of Nebraska have supported his research in areas such as large-scale parallel simulation and distributed systems. REFERENCES Adiseshu, H., Parulkar, G., and Vargese G. 1996. A reliable and scalable striping protocol. Computer Communication Review, 26(5): 131–141. Akarsu, E., Fox, G. C., Furmanski, W., and Haupt, T. 1998. WebFlow – high-level programming environment and visual authoring toolkit for high performance distributed computing. In ACM/IEEE Conference in Supercomputing, Orlando, IEEE Computer Society. CD-ROM. Allcock, W., Bester, J. Bresnahan, J., Chervenak, A., Liming, L., and Tuecke, S. 2001. Draft GridFTP protocol. Internet Draft. http://www-fp.mcs.anl.gov/dsl /GridFTP-ProtocolRFC-Draft.pdf. Allcock, W., Bester, J., Bresnahan, J., Chervenak, A., Liming, L., Meder S., and Tuecke S. 2002. GridFTP protocol specification, GGF GridFTP Working Group Document. Automatic TCP Window Tuning and Applications. 2002. http://dast.nlanr.net/Projects/Autobuf_v1.0/autotcp.html.

Baru, C., Moore, R., Rajasekar, A., and Wan, M. 1998. The SDSC storage resource broker. In H. Johnson, ed., Proceedings of CASCON’98 Conference, Toronto. Beowulf Ethernet Channel bonding web page. 2002. http:// www.beowulf.org/software/bonding.html. Brendan, C., Traw, S., and Smith J. 1995. Striping within the network subsystem. IEEE Network, 9(4): 22–29. Channel bonding benchmark results. 2002. http://www.fos.su. se/compchem/jazz/bond.html. Chen, J., Akers, W., Chen, Y., and Watson, W. 2001. Java parallel secure stream for Grid computing. In Computing in High Energy and Nuclear Physics (CHEP’01), Beijing, China. Clark, D., Lambert, M., and Zhang, L. 1987. NETBLT: a bulk data transfer protocol. RFC 998. http://www.armware.dk/ RFC/rfc/rfc998.html. Dickens, P., and Gropp W. 2002. An evaluation of objectbased data transfers on high performance networks. In High Performance Distributed Computing (HPDC’02), Edinburgh, IEEE Computer Society, pp. 255-264. Dickens, P., Gropp, W., and Woodward, P. 2002. High performance wide area data transfers over high performance networks. In International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems with IPDPS’02, Fort Lauderdale, IEEE Computer Society. CD-ROM. EHPDTH: Enabling High Performance Data Transfers on Hosts (Notes for Users and System Administrators). 2003. http:// www.psc.edu/networking/perf_tune.html#intro. Foster, I., and Kesselman, C. 1999. The Grid: blueprint for a new computing infrastructure, Morgan Kaufmann, San Francisco, CA. Globus Team. 1998. The Globus Metacomputing Project. http:// www.globus.org. GridFTP: Universal Data Transfer for the Grid, The Globus Project white paper. 2000. The University of Chicago and The University of Southern California. http://www. globus.org/datagrid/deliverables/C2WPdraft3.pdf, and GridFTP Update. 2002. http://www.globus.org/datagrid/ deliverables/ GridFTP-Overview-200201.pdf. Habanero Web page. 1996. http://www.isrl.uiuc.edu/isaac/ Habanero/. He, E., Leigh, J., Yu, O., and DeFanti, T. A. 2002. Reliable blast UDP: predictable high performance bulk data transfer. In IEEE Cluster, Chicago, IEEE Computer Society, pp. 317–324. Katz, R., Gibson, G., and Patterson, D. 1989. Disk system architectures for high performance computing. Proceedings of IEEE, 77(12): 1842–1858. Kettimuthu, R., Hegde, S., Allcock, W. E., and Bresnahan, J. 2002. Appropriateness of transport mechanisms in data grid middleware. In Proceedings of 15th Annual Supercomputing Conference, Baltimore, MD. Poster presentation. Lee, C., and Stepanek, J. 2001. On future global grid communication performance. In 10th Heterogeneous Computing Workshop with IPDPS’01, San Francisco, IEEE Computer Society. CD-ROM.

SCALABLE BULK DATA TRANSFER IN WAN

247

Lee, J., Gunter, D., Tierney, B., Allcock, W., Bester, J. Bresnahan J., and Tecke, S. 2001. Applied techniques for high bandwidth data transfers across wide area networks. In Computing in High Energy and Nuclear Physics (CHEP’01), Beijing, China. Legion Web page. 1997. http://www.cs.virginia.edu/~legion/. Marzullo, K., Ogg, M., Ricciardi, A., Amoroso, A., Calkins, F., and Rothfus, E. 1996. NILE: Wide-area computing for high energy physics. In SIGOPS Conference, New York. ACM. Mathis, M., Mahdavi, J., Floyd, S., and Romanow, A. 1996. TCP selective acknowledgement options. RFC 2018. http:// www.faqs.org/rfcs/rfc2018.html. Mohamed, N., Al-Jaroodi, J., Jiang, H., and Swanson, D. 2002. A user-level socket layer over multiple physical network interfaces. In 14th International Conference on Parallel and Distributed Computing and Systems, Cambridge, MA. IASTED, pp. 810–815. Ostermann, S., Allman, M., and H. Kruse. 1996. An applicationlevel solution to TCP’s satellite inefficiencies. In Workshop on Satellite-Based Information Services (WOSBIS), New York. Partridge, C., and Hinden, R. 1990. Version 2 of the reliable data protocol (RDP), RFC 1151. SACK: List of sack implementations. 2003. http://www.psc. edu/networking/all_sack.html.

248

COMPUTING APPLICATIONS

Sato, H., Morita, Y., Karita, Y., and Watase, Y. 2001. Data transfer over the long fat networks. In Computing in High Energy and Nuclear Physics (CHEP’01), Beijing, China. Sivakumar, H., Bailey S., and Grossman, R. 2000. PSockets: the case for application-level network striping for data intensive applications using high speed wide area networks. In High-Performance Network and Computing Conference (SC2000), Dallas, IEEE/ACM. CD-ROM. Theoharakis, V., and Guerin, R. 1993. SONET OD-12 Interface for variable length packets. In Second International Conference on Computer Communication and Networks, San Diego, CA. Tierney, B. L. 2001. TCP tuning guide for distributed application on wide area networks, Usenix ;login. http://www-didc. lbl.gov/tcp-wan-perf.pdf. Tierney, B. L., Crowley, B., Holding, M., Hylton, J., and Drake, F. 1999. A network-aware distributed storage cache for data intensive environments. In IEEE High Performance Distributed Computing Conference (HPDC-8), Redondo Beach, CA. http://www-didc.lbl.gov/DPSS/. Velten, D., Hinden, R., and Sax, J. 1984. Reliable data protocol, RFC-908. http://www.cis.ohio-state.edu/cgi-bin/rfc/ rfc0908.html. Wang, R., Krishnamurthy, A., Martin, R., Anderson, T., and Culler, D. 1998. Modeling communication pipeline latency. SIGMETRICS 98, Madison, WI. ACM.

Suggest Documents