GNET-1: Gigabit Ethernet Network Testbed

GNET-1: Gigabit Ethernet Network Testbed Y. Kodama, T. Kudoh, R. Takano, H. Sato, O. Tatebe and S. Sekiguchi Grid Technology Research Center (GTRC) Na...
Author: Ophelia Fowler
2 downloads 0 Views 718KB Size
GNET-1: Gigabit Ethernet Network Testbed Y. Kodama, T. Kudoh, R. Takano, H. Sato, O. Tatebe and S. Sekiguchi Grid Technology Research Center (GTRC) National Institute of Advanced Industrial Science and Technology (AIST) Central 2, 1-1-1 Umezono Tsukuba Ibaraki 305-8568 Japan E-mail: [email protected] Abstract GNET-1 is a fully programmable network testbed. It provides functions such as wide area network emulation, network instrumentation, traffic shaping, and traffic generation at gigabit Ethernet wire speeds by programming the core FPGA. GNET-1 is a powerful tool for developing network-aware grid software. It is also a network monitoring and traffic-shaping tool that provides high-performance communication over wide area networks. This paper describes several sample uses of GNET-1 and presents its architecture.

1. Introduction Gigabit networks have come to be used as wide area networks (WAN) in recent years. However, when there is a large delay on a WAN, it is difficult to effectively use the gigabit network on the WAN from a single application. Several protocols, for example HighSpeed TCP [1], Scalable TCP [2], and FAST TCP [3], and middleware such as GridFTP [4] have been developed to solve this problem. To develop such protocols and middleware, an actual network is not practical as the development platform. Conditions on an actual network vary due to other streams and failures of links. Although software must be run on the actual network, it is desired that the network is stable, and able to freely change characteristics such as bandwidth, delay, etc., during development. Another reason why a network testbed is needed is that developing software will not have harmful effects on an actual network. We developed a network experimental testbed, GNET-1, to realize this goal. The aim of GNET-1 consists mainly of network emulation, network instrumentation, and network control. It emulates a network, specifying latency and bandwidth, as a platform for de-

0-7803-8695-7/04/$20.00 © 2004 IEEE

veloping software. It measures network statistics, such as bandwidth and number of frames, at a very fine resolution of less than 1 millisecond. It controls output bandwidth and provides flow control of the network for improving the performance of actual network-aware applications. For network emulation, there are several software network emulation tools like NIST Net [5]. But software tools have indeterministic software scheduling and large delays, especially when the traffic is high or multiple functions are run simultaneously. GNET-1 is a hardware tool that supports wire-rate transfer in a hardware pipeline manner. For network measurement, many tools have been produced, but each of them has a specific function, and multiple tools are needed for multiple measurements. GNET-1 is a programmable tool, and it can easily change its functions according to what the user wants to measure. Following this introduction, the remainder of the paper is organized as follows. Section 2 states the aims we had when we developed GNET-1. Section 3 describes sample uses of GNET-1. Section 4 presents the architecture of GNET-1 and its functions. Section 5 summarizes the paper.

2. Aims of GNET-1 We developed an experimental network tool, GNET1, to realize an ideal network platform for developing network-aware software. GNET-1 can change network parameters such as bandwidth, delay, and bit error without harmful effects for a real network. Figure 1 shows a photograph of GNET-1 and the control PC. GNET-1 has four 1000Base-SX optical Gigabit Ethernet (GbE) ports. Currently, two ports are paired, and it can emulate two gigabit network links. It can change link speed, link delay, and link quality, freely. For example, it supposes that two PC clusters exist and both network switches are connected via GNET-1.

185

CLUSTER 2004

Router or Switch

Width:19inch, Height:1U GNET-1

CH2

CH1

CH3

CH0

PC1

GNET-1 Control SNMP Server

ping reply

PC0 ping request

USB Figure 2. Measure a delay between ports

4 1000BASE-SX ports

Figure 1. Photograph of GNET-1

If the link delay is set to zero, the two communicate as if they were in the same computer room. If the link delay is set to 4 milliseconds, they communicate as if they were connected by a metropolitan area network. If the link delay is set to 100 milliseconds, they communicate as if they were connected by a wide area network, for example, between Japan and the U.S. over a transpacific network. As shown in the examples, GNET-1 can emulate any network, from a local area network to a wide area network. It makes experiments on the Internet available in a computer room. GNET-1 is very useful for developing network-aware software. If the software is evaluated on an actual network, and the performance is not good, or is not stable, someone shifts responsibility for network instability to the network itself. Sometimes this is true, but sometimes the software has bugs. Using an actual network, it is difficult to reproduce the same situations. This is the reason why such shift of responsibility occurs. To evaluate performance quantitatively, it is essential that the situation can be reproduced. To debug an error, it is also necessary that the situation can be reproduced. GNET-1 can reproduce the same situation on a network as often as needed. GNET-1 also measures network statistics such as bandwidth and number of frames at a very fine resolution. Usually network bandwidth is measured by an average over several minutes in network switches or routers. But congestion cannot be detected using only average bandwidth. This is because the bandwidth of a stream is not stable and the conflicts between peaks of streams causes the congestion. If the number of streams is increased, the conflicts may increase. Since GNET-1 can measure network bandwidth with a fine resolution of less than 1 millisecond, it can detect when the conflicts occur.

Figure 3. Errors and variances of delay in NIST Net and GNET-1

We are now using GNET-1 for developing grid middleware, GridMPI [8] and Gfarm [9]. GridMPI provides users an environment to run MPI applications efficiently in the Grid. GridMPI has a layer called the latency-aware communication topology, which optimizes communication with non-uniform latency and hides the various lower-level communication libraries. GNET-1 is very useful for developing such software on multiple network environments with different delays. Gfarm is designed for global peta-scale data-intensive computing. It provides a global parallel file system with online peta-scale storage, scalable I/O bandwidth, and scalable parallel processing, and it can exploit local I/O in a grid of clusters with tens of thousands of nodes. GNET-1 is very useful for improving the performance in high bandwidth-delay product networks described in section 3.4.

3. Sample uses of GNET-1 3.1. Comparison with NIST Net GNET-1 can measure an accurate one-way network delay between ports by 32 nanoseconds increment. Fig-

186

Internet

Figure 5. Network emulation

Figure 4. Bandwidth in NIST Net and GNET-1

ure 2 shows the connection of the experimental environment to measure a delay. The network equipment to be measured, such as a router or a switch, is connected by CH1 and CH2 of GNET-1. When PC0 runs a ping command to PC1, packets are transferred from PC0 to PC1 via CH0, CH1, CH2, CH3. GNET-1 counts the cycles of 32 nanosecond clock ticks from transmitting the packet on CH1 to transmitting it on CH3. It also counts the cycles when reply packets are transferred from CH2 to CH0. If CH1 and CH2 are directly connected, the delay of GNET-1 itself is measured. GNET-1 emulates a network delay by a hardware pipeline, while NIST Net emulates a network delay by software. To implement the delay insertion, GNET-1 puts the time stamp to received frames, stores them in the off-chip memory, and transmits the frame those received time is past the specified delay time. When it is set to no delay, the delay of GNET-1 between ports is about 1.1 microseconds. The delay is specifiable from 0 to 1.6 seconds by 100 nanoseconds increment. If the delay is less than 134 milliseconds, all frames are transferred at the wire-rate with no drop. We compared the network delay emulated by NIST Net and GNET-1. Figure 3 shows the results. The version of NIST Net that we used is 2.0.11 and the PC that runs NIST Net is 2-way SMP of Xeon 2.4 GHz with three INTEL/PRO 1000 NICs on RedHat 9.0. The x-axis is delay specified for network emulator. The y-axis is the difference between the specified delay and the measured delay. We measured the delay a hundred times for each parameter, and show the average and the min-max bar. The figures shows that the errors of GNET-1 are very small and there is almost no variance. On the other hand, the figure shows that the error of NIST Net is about 100 microseconds, and the variance are also about 100 microseconds. [5] states

that the variance of the NIST Net in the latency is due mainly to the addition of the 8192 Hz realtime clock interrupt. Since GNET-1 emulates a network delay by hardware pipeline, the delay has only small overhead and no variance. We also compared the bandwidth on GNET-1 with the bandwidth on NIST Net. Figure 4 shows the results. They are measured by iperf in 60 seconds. The x-axis is the socket buffer size for iperf. The y-axis is the bandwidth reported by iperf. The figures shows that the bandwidth of GNET-1 is over 930 Mbps with enough socket buffer even if the delay is set to 1 millisecond and 10 millisecond. On the other hand, the bandwidth of NIST Net with delay is saturated at about 850 Mbps, while it is over 930 Mbps with no delay. This shows GNET-1 supports wire-rate transfer.

3.2. Evaluation of TCP/IP protocol with bit error We evaluated several protocols on a network with bit errors using GNET-1. Two PCs are connected by GNET-1 in Figure 5. GNET-1 inserts a delay, generates bit errors, and measures the precise bandwidth. GNET-1 generates random bit errors. The error rate can be changed from 0 to 3 % in 10−11 steps. It generates a 64 bit random number using a linear feedback shift register, and the initial value of the random numbers can be set. To calculate actual bit error rate, it counts the transfer byte and the generated bit error. The precise bandwidth measurement function continuously acquires the number of bytes for transmitting by specifying an arbitrary measurement interval from 100 microseconds to 8.5 seconds. It also can get either the number of bytes for receiving or the number of frames for receiving or transmitting. We compared three protocols, standard TCP/IP, HighSpeed TCP [1], and ssthr-fix. HighSpeed TCP is proposed by Floyd as RFC3649 for high-throughput communication. We used patches of web100 projects [6] as a HighSpeed TCP implementation. TCP/IP uses a congestion window (cwd) to avoid congestion. If congestion is detected, TCP/IP shrinks the size of the cwd,

187

Table 1. Bandwidth with bit error protocol bit error bit/s 0 10−10 10−9 10−8

standard TCP/IP Mbps 962 262 71 17

HighSpeed TCP Mbps 962 630 130 22

GPS

ssthr-fix Mbps 962 771 383 155

1

Internet 3

2

4,5

RTT:100ms, bit error=1E-10 1200

Figure 7. Measurement of one-way delay

1000

Bandwidth (Mbps)

1

800 600

ssthr-fix

400

HighSpeed standard

width is less than 155 Mbps even if the slow start mode is always used.

200

3.3. Evaluation of one-way delay

0 0

20

40

60

80

100

time (sec)

Figure 6. Bandwidth with bit error and increases the size of the cwd by receiving acknowledges. This is called slow-start, and the cwd grows exponentially. When the cwd is larger than the slow-start threshold (ssthr), cwd control is changed and the cwd is increased linearly. This is called congestion avoidance, and the cwd increases very slowly in standard TCP/IP. HighSpeed TCP increases the cwd in congestion avoidance faster than standard TCP/IP. The ssthr-fix is our original patch for TCP/IP for this experiment. It fixes the slow start threshold as the bandwidth-delay product, to hide the network delay. We ran iperf, the benchmark for measuring bandwidth, for 100 seconds between two PCs connected via GNET-1. GNET-1 generated bit errors at a specified rate, and precisely measured the bandwidth at intervals of 100 milliseconds. We measured the bandwidth three times for each parameter, and the average bandwidth is shown in Table 1. Figure 6 shows the time transition of the bandwidth when the bit error is 10−10 . When the bit error rate is 10−10 , the bandwidth decreases to about 1/4 in standard TCP/IP. From Figure 6, we see that the bandwidth halves every bit error occurs. There is little increase of bandwidth before the next bit error occurs. On the other hand, the bandwidth in HighSpeed TCP and ssthr-fix is better than the bandwidth in standard TCP/IP. The bandwidth recovers at an early stage even if a bit error occurs. However, when the bit error rate is 10−8 , the band-

We measured detailed one-way delay on an actual wide area network using GNET-1. Figure 8 is the measured network we used. It was prepared between Japan and the U.S. for Bandwidth Challenge of SC2003, an international conference held in Phoenix in November, 2003. Two GNET-1 systems were connected, one at each end of the network in Figure 7. A sender GNET1 was connected to the router at the SC2003 site. A receiver GNET-1 was connected to the router in AIST and to the PC that stored the received packets. The flow of the measurement was as follows. The numbers in Figure 7 represents where the followings are processed. 1. The clocks of both GNET-1 systems were adjusted by GPS. 1 GNET-1 can receive GPS data, such as date, time, latitude, longitude, altitude, and the state of GPS satellites. It also adjusts the internal clock by GPS with precision of a microsecond and sets the time stamp as 64bit formatted with RFC1305. 2. The sender GNET-1 generates frames that consist of a specified header and specified data. The data includes a transmitting time as 64bit formatted with RFC1305 adjusted by GPS, the sequential number of frames, random data for padding to the specified frame length, and the parity data 1 Unfortunately, since the conference venue could not receive GPS signals, the sender clock was adjusted only at the beginning of the experiment using a portable GPS unit. So the delay at the end of experiment was slightly smaller than the delay at the beginning. But the effect is small enough to evaluate the jitter in the small period involved.

188

Figure 8. Transpacific network testbed

Table 2. One-way delay and jitter network

SC2003 ↓ LA ↓ AIST SC2003 ↓ NY ↓ AIST

dura tion (sec)

60

10

band width (Mbps) 100 300 500 700 900 100 300 500 700 900

delay (msec) 71.304 71.285 71.268 71.451 85.021 151.182 151.176 151.171 151.169 151.172

std. dev. (msec) 0.010 0.010 0.013 0.249 8.106 0.005 0.005 0.006 0.006 0.004

of the payload. In this experiment, the MTU of frames is 1500. The transfer bandwidth is varied by changing the interval of frame generation. 3. The receiver GNET-1 checks the received frames, and if they match the specified source MAC address and the destination MAC address, it adds the received time following the transmitting time and truncates other data. The received time is the same format of the transmitting time. Since it only forwards the headers of frames, the frame receiving performance is reduced only 40 Mbps even if frames whose MTU is 1500 are transferred at wire-rate. It then forwards the modified frames to the receiver PC. 4. The receiver PC stores all frames from the GNET1 to the disk. 5. After all data is stored, it is statistically processed and one-way delay and jitter are calculated.

Table 2 shows the result of the measurement. The measured network data is shown in Figure 9. We measured one-way delay on two networks. One is the APAN/TransPAC line via Los Angeles (it is shown as LA in the table), and the other is the SuperSINET line via New York (similarly, it is shown as NY). The network capacity of both lines on the transpacific route is 2.4 Gbps. The bottleneck is the line from the Tsukuba WAN to AIST, and its network capacity is 1 Gbps. The table shows the one-way delay and the standard deviation when the traffic bandwidth was changed from 100 Mbps to 900 Mbps by 200 Mbps increments. It shows the one-way delay on the LA line is 71 milliseconds, and 151 milliseconds on the NY line. It also shows the deviation of the delay on the NY line is smaller than that of the LA line. Figure 9 shows the one-way delay distribution graph for some regions of measured data.

3.4. Stable network transfer with multiple high bandwidth streams GNET-1 can control the output bandwidth to emulate a network on which the network capacity is smaller than 1 Gbps, or one that shares the link with other streams. According with the frame size, it changes the output bandwidth from 1 Gbps to 23 Mbps when the MTU is 1500, and from 1 Gbps to 120 Mbps when the MTU is 9000. This function is very useful for pacing output bandwidth in order to avoid congestion on the network. TCP/IP controls the output bandwidth in accordance with the bottleneck line. This is called self clocking. Self clocking assumes that acknowledgments arrive at equal intervals. But the assumption is broken when the amount of data to be transferred in a unit of time is controlled at the router, such as with credit base flow control. We observed that network bandwidth oscillates from the peak bandwidth to 0 on an actual net-

189

For instance, it assumes that two 500 Mbps streams are transferred onto a 1 Gbps network. If self clocking is not adapted for each streams, a 1 Gbps output phase and no output phase appear by turns, and the average output becomes 500 Mbps. When the 1 Gbps output phases of two streams conflict, the sum of output bandwidth exceeds the network capacity, and some frames will be lost. The loss of frames decreases the bandwidth greatly. GNET-1 controls the output bandwidth by adjusting the inter frame gap (IFG) between frames, and the output bandwidth does not exceed the specified value in any time interval longer than the frame interval. Therefore, GNET-1 can guarantee that total bandwidth does not exceed 1 Gbps and it can achieve stable communication without packet loss if it controls both streams under 500 Mbps. We evaluated the effect on an actual wide area network. Figure 8 is the network used. It is the Trans-Pacific Grid Datafarm testbed between Japan and the U.S. constructed for Bandwidth Challenge of SC2003, an international conference held in Phoenix in November, 2003. The APAN/TransPAC Los Angeles line (LA) is 2.4 Gbps, and the round trip latency is 141 milliseconds. The APAN/TransPAC Chicago line (Chicago) is a 644 Mbps ATM line, its peak bandwidth on TCP/IP is about 500 Mbps, and the round trip la-

router

work with large delay. When a stream is transferred onto a network, oscillation is not a problem. However, when multiple streams are transferred onto a network, oscillation becomes a big problem.

Internet

router

Figure 9. One-way delay

NY Chicago LA3 LA2 LA1

Figure 10. Control of output bandwidth

tency is 250 milliseconds. The SuperSINET New York line (NY) is 4.8 Gbps. We used only 1 Gbps on it, and the round trip latency is 285 milliseconds. Total network bandwidth is 3.9 Gbps. We used three Gigabit Ethernet for the LA line, and called them LA1, LA2, and LA3, respectively. We used a Gigabit Ethernet each for the Chicago and NY lines. We used five Gigabit Ethernet, and GNET-1 controlled the output bandwidth in Figure 10. We transferred the data from disks to disks from the U.S site to the Japan sites. Figure 11 shows the data transfer performance using Gfarm when the files were transferred using multiple streams between Japan and the U.S. We used jumbo packets with an MTU of 6000. First, GNET1 controlled the output bandwidth of LA1, LA2, LA3, Chicago, and NY to 800 Mbps, 750 Mbps, 800 Mbps, 500 Mbps, and 930 Mbps, respectively. But the LA

190

The circuits in the FPGA consists mainly of four GbE MACs, four FIFO memory controllers, and other control circuits. The GbE MACs run at a 125 MHz clock speed, the FIFO memory controllers run at a 62.5 MHz clock speed, and other circuits run at a 31.25 MHz clock speed. Since most functions are implemented to run at a 31.25 MHz clock speed, the timing of the user circuits is not tight. We designed the circuit in verilog hardware description language, and synthesized it using Xilinx ISE and XST. It takes about 1 hour to create the configuration data. Since GNET-1 consists of a large-scale FPGA as a control circuit, functions can be easily added or modified. Some of other functions that already implemented on GNET-1 are followings. All functions are described in the web page of GNET-1 [7]. The current gate usage of the FPGA, including the full set of functions, is about 44 %, and more functions can be implemented. Figure 11. Total bandwidth on the transpacific testbed

a) Network emulation a-1) Frame discard This function discards a frame which begins transmitting immediately after the discard is indicated. The timing of the frame discard is measured along with bandwidth.

line caused packet loss, and the bandwidth was unstable. So we decreased the LA1 bandwidth to 780 Mbps, and increased NY to 950 Mbps at 300 seconds. After that, the bandwidth was stable. We achieved 3.78 Gbps total bandwidth, that is, 97 % of the peak bandwidth, in a stable manner.

4. Architecture of GNET-1 Figure 12 shows a block diagram of GNET-1. GNET-1 has four 1000Base-SX optical Gigabit Ethernet (GbE) ports. 2 The GbE ports are connected to a large-scale FPGA. The FPGA is a Xilinx XC2V6000, that includes 76K logic cell, 2.5 Mbit of memory, and 824 user I/O pins. GNET-1 has four SRAM ports. Each SRAM has 144 Mbit capacity and can be accessed by more than 1 Gbps of read and write simultaneously. GNET-1 is connected to the control PC by USB 1.1. The control PC sets and gets the parameters of GNET-1. It also has two MICTOR ports for connecting a logic analyzer in order to observe internal signals, and a GPS (Global Positioning System) port in order to get the precise clock. The most attractive feature of GNET-1 is that its functions are reconfigurable. If you design the control circuit for the desired function and load it into the FPGA, you can use any function easily. 2 The new version of GNET-1 has four GBIC ports and can select a 1000Base-T port.

a-2) Tail drop This function specifies the FIFO size to be used to drop the received frame at the tail. b) Network measurement b-1) Frame trace This function traces each transferred frame up to first 128 octets. It specifies position of frame to be traced by 4 octets as a unit. It can trace the headers in 1.5 seconds if frames those MTU is 1500 are transferred at wire-rate. b-2) Clock level measurement This function traces the signal in the FPGA by each clock tick up to 2M samples using Xilinx’s Chipscope and Agilent’s Trace Port Analyzer.

5. Conclusion GNET-1 is a useful tool for developing latency-aware software. This paper shows actual uses of GNET-1, such as network emulation, network measurement, and network control, and its architecture. We will improve the functions of GNET-1. We will implement several control flow techniques, such as credit base, Random Early Detection (RED) [11], etc.

191

Figure 12. Block diagram of GNET-1 board We will implement a switch function for 4 input/output ports. We are also developing a new network tool supporting 10 Gigabit Ethernet, named GNET-10. The first version of GNET-10 has two 10 GbE ports, and two 512 MByte DRAMs.

dova, D. Quesnal, and S. Tuecke. “Data Management and Transfer in High Performance Computational Grid Environments,” Parallel Computing Journal, Vol.28(5), pp.749–771, May 2002. [5] M. Carson and Darrin Santay, “NIST Net – A Linux-based Network Emulation Tool,” ACM Computer Communication Review, Vol.33, No.3, pp.111–126, July 2003.

Acknowledgments

[6] M. Mathis, J. Heffner and R. Reddy. “Web100: Extended TCP Instrumentation for Research, Education and Diagnosis,” ACM Computer Communications Review, Vol.33, No.3, July 2003.

This research was partially supported by an NEDO Grant-in-Aid for private sector fundamental technology “Research on Large-Scale and Reliable Servers,” as well as a grant from the Ministry of Education, Sports, Culture, Science and Technology (MEXT) of Japan through the NAREGI (National Research Grid Initiative) Project.

[7] http://www.gtrc.aist.go.jp/gnet/. [8] http://www.gridmpi.org/. [9] O. Tatebe, Y. Morita, S. Matsuoka, N. Soda and S. Sekiguchi, “Grid Datafarm Architecture for Petascale Data Intensive Computing,” Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), pp.102-110, 2002.

References [1] S. Floyd. “HighSpeed TCP for large congestion windows,” In Internet draft, draft-floydtcp-highspeed-02.txt, 2003. http://www.icir.org/ floyd/hstcp.html. [2] Scalable TCP. http://www-lce.eng.cam.ac.uk/ ˜ctk21/scalable. [3] FAST TCP. http://netlab.caltech.edu/FAST/. [4] B. Allcock, J. Bester, J. Bresnahan, A. L. Chervenak, I. Foster, C. Kesselman, S. Meder, V. Nefe-

[10] H. Kung and R. Morris. “Credit based flow control for ATM networks,” IEEE Network, March/April 1995. [11] S. Floyd and V. Jacobson. “Random Early Detection Gateways for Congestion Avoidance,” IEEE/ACM Transactions on Networking, Vol.1, No.4, pp. 397-413, August 1993.

192