remote application programming interface for storage elements

NEXPReS is an Integrated Infrastructure Initiative (I3), funded under the European Community's Seventh Framework Programme (FP7/2007-2013) under grant...
Author: Harold May
0 downloads 0 Views 2MB Size
NEXPReS is an Integrated Infrastructure Initiative (I3), funded under the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° RI-261525.

Test report of transparent local/remote application programming interface for storage elements Title: Date: Version:

Test report of transparent local/remote application programming interface for storage elements 8th August 2012 1.4

Author: Co-Authors:

Jimmy Cullen Neal Jackson, Ralph Spencer

Summary:

Deliverable 8.7

Document Log Version Date 1.0 2012-07-16 1.1 2012-07-26 1.2 2012-07-30 1.3 1.4

2012-07-31 2012-08-08

Summary of Changes Initial version Updated local test results Updated with 12hr and remote test results Updated discussion and conclusions Corrected data rates used and corrected typographical errors

Authors J Cullen, N Jackson, R. Spencer J Cullen J Cullen J Cullen, N Jackson, R. Spencer J Cullen

1

Contents Summary ................................................................................................................................... 3 Introduction ................................................................................................................................ 4 Software..................................................................................................................................... 6 vlbi-streamer........................................................................................................................... 6 UDPmon................................................................................................................................. 6 Jive5ab ................................................................................................................................... 6 Tests performed ......................................................................................................................... 7 Local testing ........................................................................................................................... 7 Remote testing ....................................................................................................................... 8 Results and Analysis .................................................................................................................10 Local tests .............................................................................................................................10 Data Rates .........................................................................................................................10 System load .......................................................................................................................14 CPU Utilisation ...................................................................................................................16 Data sizes ..........................................................................................................................18 12 hour test ...........................................................................................................................21 Data rates ..........................................................................................................................21 Load...................................................................................................................................23 CPU utilisation ...................................................................................................................23 Data on disk .......................................................................................................................23 Remote Testing .....................................................................................................................24 Initial phase........................................................................................................................24 Second phase ....................................................................................................................28 Discussion and Conclusion .......................................................................................................34 References ...............................................................................................................................37

2

Summary The vlbi-streamer software is designed to save data receive on a network to multiple hard disks at high data rates. It has been designed with the needs of VLBI experiments at its core. This deliverable reports on several tests that have been run to evaluate the performance, reliability and effectiveness of the software against its stated aims. Three test situations were used, involving machines in the UK, Finland, Sweden and the Netherlands. Measurements were made on data rates, system load, CPU utilization and the amount of data written to disk. Tests in Manchester looked at the effect varying certain input parameters had on the measured values, and a simulation of a long running VLBI experiment at a high data rate was performed. Using the machines in Finland, Sweden and the Netherlands, tests were run to simulate a VLBI experiment, with recorded mock VLBI datasets recorded at the observatories at a high data rate, then streamed from the observatories to JIVE at a lower rate. The test results show that the software is very capable of writing data received from the network to disk at the highest rates available. Through the intelligent implementation and management of ring buffers, vlbi-streamer reads and writes data in blocks which cause low system overheads per transaction compared to other sizes, thus allowing efficient operation at high data rates. Over international links, tests show that vlbi-streamer can be used to reliably send and receive datasets using modest system resources, allowing multiple streams to operate simultaneously. vlbi-streamer is an important development that will become a very useful tool in VLBI operations.

3

Introduction Very Long Baseline Interferometry (VLBI) observations generate large datasets at each of the observatories which need to be brought together at a single correlator for processing. Electronic VLBI (eVLBI) makes use of computer networks to stream data in real time from the observatory to the correlator for processing, but the majority of observations are written to file and stored on magnetic hard disk drives. The Mark 5 VLBI Data System [1] series of recorders are an international standard device used to write VLBI data to disk. The machine contains two banks of 8 disks, with a maximum size 1 TeraByte (TB) per disk, giving a maximum of 16 TB of storage space. The maximum write rate is reported as 2 Gigabits per second (Gbps). The Mark 5 comprises a mixture of Commercial Off The Shelf (COTS) hardware and bespoke equipment to allow recording data at these high speeds and in such volumes. With the progression in capability of COTS computing equipment, it is now possible to match the performance of the Mark 5 with entirely COTS equipment, which means that the cost of recording data is lower and makes it accessible to more observatories. Deliverable 8.2 “Hardware design document for simultaneous I/O storage elements” [2] describes COTS equipment which can be used to provide storage space for VLBI data. These computers, known as flexbuff machines, provide much storage space for VLBI observations. To make maximum use of these machines, Aalto University Metsähovi Radio Observatory have developed software that allows data to be read or written to the flexbuff, details of which can be found in deliverable 8.4 “Design document of transparent local/remote application programming interface for storage elements”. The software, called vlbi-streamer [3], was written with the intention of allowing VLBI data to be written to disk at high speed, allowing more sensitive observations to be made. Deliverable 8.4 details the vlbi-streamer software, so here we only briefly describe its design and functionality. vlbi-streamer is written in C and is designed to reliably read and write data to/from hard disk drives and network cards at high speeds. Rather than using the hard disks cooperatively in a RAID array, they are mounted on the host operating system individually. This allows greater control over the disks and removes redundancy. Each disk has its own ring buffer used as an intermediate store for data between the hard disk and network interface. For each disk there are two threads – one to move data from the NIC to the ring buffer, and one to move data from the ring buffer to the hard disk. Two main criteria for the software are that it performs reliably, and is capable of working at high data rates. At Jodrell Bank Observatory three machines matching the specification given in deliverable 8.2 have been built, and are detailed in table 1. jbnexpres1 and jbnexpres2 were purchased for network testing for NEXPReS Work Package 6, and jbnexpres3 was purchased for Work Package 8.

4

Hostname Hardware Type (tower, rack mount, etc) Operating system CPU Motherboard Drive controller Number of Drives Network connectivity Functional capacity Hostname Hardware Type (tower, rack mount, etc) Operating system CPU Motherboard Drive controller Number of Drives Network connectivity Functional capacity Drive Interface Hostname Hardware Type (tower, rack mount, etc) Operating system CPU Motherboard Drive controller Number of Drives Network connectivity Functional capacity Drive Interface

jbnexpres1 Tower Debian Squeeze, kernel 2.6.32-5-amd64 AMD Phenom™ II X6 1090T Processor Asus Crosshair IV Formula mobo controller 1 (system disk) Chelsio Communications Inc T310 10GbE Single Port Adapter ~2TB jbnexpres2 4U rackmount Debian Squeeze, kernel 3.2.16 AMD Phenom™ II X6 1090T Processor Asus Crosshair IV Formula mobo controller (6 drives) + 2 x LSI MegaRAID 9240-8i controllers (16 drives) 22 + 1 (system disk) Chelsio Communications Inc T310 10GbE Single Port Adapter ~40TB SATA III jbnexpres3 4U rackmount Debian Squeeze, kernel 3.2.16 2 x Intel(R) Xeon(R) CPU E5620 @ 2.40GHz Supermicro X8DTH-6F mobo controller (4 drives) + 4 x LSI MegaRAID 9240-8i controllers (32 drives) 36 + 1 (system disk) Chelsio Communications Inc T420 10GbE Dual Port Optical Adapter ~70TB SATA III

Table 1. Hardware specification for flexbuff machines in Manchester.

5

Software vlbi-streamer The vlbi-streamer software is in active development and therefore improvements to the core functionality and the incorporation of new features is ongoing. The software code is hosted on a Google project site [3], and uses git for versioning. The first release of the code was on 17th April 2012, and a major update with a redesign of the core functionality was released on 31st May 2012. At regular intervals releases of the software have been made in the form of gzipped archives, however to get the latest code a clone of the appropriate branch of the source repository is needed. The testing of the software was performed in two phases, local and remote. For the local testing, the software used in these tests was cloned from the repository on Thursday 19th July. For the remote testing, the software used was cloned from the repository on Thursday 26th July.

UDPmon UDPmon [4] is a software application that allows measurement of network characteristics using the User Datagram Protocol. It has been developed such that it generates random data to populate datagrams, mimicking the payload of real VLBI experimental datagrams. UDPmon is normally used on a pair of machines (server/client) and through a series of control and then data interactions precise measurements of the PC hardware and network characteristics are derived. A version of the software has been developed where the server or client version of the program can run independently and collect statistics about its operation. In its send mode, UDPmon simulates the data output of a radio telescope, and is therefore ideal for use in testing VLBI recording equipment.

Jive5ab As part of NEXPReS’ Work Package 5, JIVE have developed the Jive5ab software [5] which is a replication in software of the functionality of the Mark 5a and 5b hardware machines used to record VLBI data to a series of hard disks. The software is designed to be compiled and run on any modern PC. The software can be used to generate UDP datagrams containing a fill pattern the same as the Mark 5B, and send these to a network address. This mode can be used in VLBI recording simulations.

6

Tests performed Testing of the vlbi-streamer was led by The University of Manchester and split into two sections: i. local testing to evaluate performance at high speeds, and ii. remote testing involving disparate machines to replicate expected usage patterns.

Local testing At The University of Manchester, there is a 10 Gbps fibre connection between the Schuster Building (physics building on the main campus) and the Jodrell Bank Observatory, located approximately 30 kilometres south. jbnexpres1 was located in the Schuster Building and jbnexpres2 and jbnexpres3 located at JBO. UDPmon was used on jbnexpres1 to generate UDP datagrams of 8232 byte size and sent to jbnexpres3, where vlbi-streamer wrote the data to disk. The packet size was chosen to simulate a packet with 8192 bytes of VLBI data plus a 40 byte VLBI Data Interchange Format (VDIF) header. To test performance of the vlbi-streamer software a range of data rates generated on jbnexpres1 were used, and on jbnexpres3 vlbi-streamer was run with a range of disks and maximum memory. To quantify performance, load, memory usage, percentage of CPU usage by the programme and the received packets and bytes were measured. Each test was run for 30 minutes. The total amount of data written to disk on each test was also measured. Tests were run at 4, 8 and 10 Gbps, and with a maximum of 12, 16 or 20 GB of RAM and 8, 16, 18, 32 or 36 disks. At each data rate the tests were run from this simple bash script: #!/bin/bash number_of_disks=( 8 16 18 32 36 ) max_RAM=( 12 16 20 ) for a in "${number_of_disks[@]}" do for b in "${max_RAM[@]}" do ./if_bytes.py eth7 > if_bytes_vlbistreamer_test_$a$b.txt & ./vlbistreamer -d $a -A $b -p 8232 -m r -s 7900 -v test_$a$b 1800 > vlbistreamer_test_$a$b.txt sync echo 3 > /proc/sys/vm/drop_caches done done For each number of disks, the tests used increasing amounts of RAM up to the maximum, before increasing the number of disks. It is important to bear the test sequence when interpreting the results.

7

To simulate a long running VLBI session a further test was run using all 36 disks and 20GB of RAM.

Remote testing Flexbuff machines in Finland (Metsähovi Radio Observatory), Sweden (Onsala Space Observatory) and The Netherlands (Joint Institute for VLBI in Europe) were used to test the performance of the vlbi-streamer software in a simulation of a real usage pattern. Using PCs in Onsala and Metsähovi, Jive5ab generated 8232 byte size datagrams containing Mark5B fill pattern and transmitted them to flexbuff machines in the local station, where vlbi-streamer was running and wrote the data to disk. The inter packet delay used was 16 microseconds, which produced a data rate of 4 Gbps.

Figure 1. Initial remote testing. Data generated locally is sent to vlbi-streamer at 4 Gbps.

In Metsähovi, the flexbuff with hostname watt was used, which is comprised of the same components as jbnexpres3 (see table 1). All 36 disks and 10 GB of RAM were used when recording the data. In Onsala the flexbuff machine is comprised of similar components as jbnexpres2, except that it has 24 disks available to save data to. Problems were found with two of the disks, so 22 disks and 10GB of RAM were used when recording the data. Onsala is connected to JIVE via a static light path, and Metsähovi is connected to JIVE via a routed Ethernet connection. On the flexbuff machines in Onsala and Metsähovi, vlbi-streamer

8

was then started in the send mode, and the Jive5ab data that was recorded was sent to a flexbuff machine in JIVE, which is comprised of the same components as jbnexpres3. Sending the data, again a maximum of 10 GB of RAM was used, and an inter packet delay of 32 microseconds, which resulted in a data rate of 2 Gbps from each station. At JIVE, two vlbistreamer sessions were started on the flexbuff machine, each recording to all 36 disks and using 10 GB of RAM.

Figure 2. In the second phase, data recorded at the observatories is streamed simultaneously to JIVE at a lower data rate than it was recorded at.

The Linux kernel uses the concept of pages as the basic unit of memory, with the default page size set at 4096 bytes. The kernel also uses virtual memory as an abstract concept, acting as a logical layer between application memory requests and physical memory. The mapping of virtual to physical memory addresses is handled by the Translation Lookaside Buffer (TLB). When an application wants to use large amounts of memory, this increases the work load of the TLB, however if the page size is increased, large amounts of virtual memory can be allocated with a smaller TLB overhead. The typical hugepage size is 2048 kilobytes. vlbi-streamer offers the use of hugepages, and the remote tests used this feature.

9

Results and Analysis Local tests One of the options supported by vlbi-streamer is verbose mode, where detailed information regarding the operation of the software is written to standard output every second. Sample output is shown below: ---------------------------------------Net Send/Receive completed: 9437Mb/s HD Read/write completed 9261Mb/s Dropped 0 Incomplete 0 Time 10s Ringbuffers: Free: 65, Busy: 14, Loaded: 0 Recpoints: Free: 3, Busy: 13, Loaded: 0 ---------------------------------------For each of the tests, the verbose mode was used and the output saved to file. A python script was written to parse the output and create time series plots and also the variation in some metrics.

Data Rates Figure 3 shows an example plot of data rates on the network and hard disks, and figures 4 and 5 show histograms of the distributions of these values. These three plots were generated from the 30 minute test at 10 Gbps writing to 36 disks and using 20 GB of RAM. These plots are typical of all plots created from the tests and show that the standard deviation of data rates of data received at the network card is much smaller than that of the write-to-disk rate. The small standard deviation value for the network received data corresponds to UDPmon on jbnexpres1 sending data at regularly spaced intervals, which is analogous to the constant bit rate data generated at radio telescopes. It is because of the constant flow of data that UDP was chosen over any other transport protocol which provides reliable delivery through retransmission [6]. The relatively large standard deviation in write-to-disk data rates is understood as the vlbi-streamer software making use of the ring buffers. Rates which are below the mean are times when the amount of data taken from the network and written to ring buffers is greater than the amount of data written from ring buffers to disk, and the opposite is true of values of data rates greater than the mean. The distribution is approximately symmetric about the mean, which shows that the asynchronous read from network card and write to disk process of the software is well designed, efficient and stable.

10

It was noted that the values reported by the vlbi-streamer receiving the data do not match with those reported by UDPmon sending the data. In the example output above, vlbistreamer reports data arriving on the network at 9.437 Gbps, yet UDPmon reported sending data at 9.896 Gbps.

Figure 3. Data received on the network and written to disk as reported by vlbi-streamer.

11

Figure 4. Histogram of distribution of data rates reported at the network card.

Figure 5. Histogram of distribution of write-to-disk data rates.

12

In addition to the vlbi-streamer verbose output, another python script was written which logged each second several system metrics. Below is an example output of the script: eth7 RX packets per second: RX errors per second: RX dropped per second: RX overruns per second: TX packets per second: TX errors per second: TX dropped per second: TX overruns per second: RX bytes per second: TX bytes per second: One minute load: Five minute load: Fifteen minute load: Total memory: Used memory: Free memory: Buffers: Cached memory: Percent CPU usage:

153527 0 0 0 1 0 0 0 1270888228 594 11.94 12.36 12.40 24556556 2876948 21679608 17448 8632 54.0

As with the vlbi-streamer output, this information was saved to file for each test and again a python script written to plot various time series sequences. In addition to individual tests, data from all tests at each data rate were recorded cumulatively and plotted. To investigate the disparity between send and receive data rates, the bytes received at the network card were measured for each test independently of the vlbi-streamer software. Figure 6 shows the received bytes per second for the duration of all tests at 10 Gbps. We can see a trend of increasing minimum and mean values as the tests progress, which is not understood as this should be a constant bit rate. More puzzling are the values seen, as 1.26 x 109 bytes per second equates to 10.08 Gbps, which is clearly above line speed. The most likely explanation of this phenomenon is incorrect timing. The python script used to collect these statistics issues a one second sleep command after collecting and processing the data. The time taken to do this collection and processing of data is added to the one second sleep, therefore the time period over which the statistics are collected is greater than one second. As the tests progressed, the time taken to collect and process the statistics gradually became longer, which is reflected in the increasing trend in the data. Although the time taken to run the script is small, at these high data rates the effects are noticeable. This shows the importance of writing software in a compiled language, as vlbistreamer is, where timing and speed are imperative.

13

Figure 6. Bytes received at the NIC during all tests run at 10 Gbps.

System load Figures 7, 8 and 9 show the evolution of 1, 5 and 15 minute load for the 4, 8 and 10 Gbps tests respectively. The regular decreases in 1 minute load are the boundaries between individual tests. As explained earlier, for each data rate, the tests were run with increasing amounts of RAM and then increasing numbers of disks. In figure 7 we can see that for the tests at 4 Gbps, the load was highest for the early tests, and then dramatically reduced from test 5 (16 disks, 16 GB RAM) to test 6 (16 disks, 20 GB RAM). A similar, but smaller effect can be seen in the tests with 8 disks. From test 6 onwards there is a small relatively constant increase in load, ending in a load which is below that of the early tests. In contrast, the loads shown in figures 8 and 9 are small for the first three tests (8 disks) and increase markedly for test 4 (16 disks, 12 GB RAM). In many respects both plots are similar, which is not surprising. They show higher load in the middle tests than the final tests, and share similar load values. Load measures the average number of processes that the CPU has been executing over a defined time period, with a load of 1 identifying that the CPU core was occupied 100% of the time and higher values showing that there were too many processes for the core to execute and so some had to wait for execution. All tests at the same data rate were run from a single script and without any delay between runs. This means that the load created on the machine for all tests apart from the first one will be influenced by the previous test’s load, however given the length of each test these effects are negligible even for the 15 minute average.

14

Figure 7. Load values for all consecutive tests run at 4 Gbps.

Figure 8. Load values for all consecutive tests run at 8 Gbps.

15

Figure 9. Load values for all consecutive tests run at 10 Gbps.

CPU Utilisation Figures 10, 11 and 12 plot the percentage of the CPU utilisation by the vlbi-streamer software for all tests at 4, 8 and 10 Gbps respectively. Boundaries between Individual tests are easily identified by the vertical lines, where momentarily CPU usage decreases, in some cases to zero, after a test has finished and then increases, in some cases to more than 160%, at the start of a new test. After the initial spike in usage, it quickly decreases to a stable value, which for all tests at 4 Gbps was between 40 and 60%. In figures 11 and 12 we see that the stable CPU usage value was around 50% usage on the tests with 8 disks, and significantly higher at between 80 and 100% for the other tests. This is comparable with the load patterns seen in figures 8 and 9, however this correlation is absent at 4 Gbps.

16

Figure 10. Percentage CPU utilisation at 4 Gbps.

Figure 11. Percentage CPU utilisation at 8 Gbps.

17

Figure 12. Percentage CPU utilisation at 10 Gbps.

Data sizes Each of the local tests ran the vlbi-streamer for 1800 seconds, and the data stream for each test ran for the entire of the test, therefore the amount of data we should expect to record should simply be a function of the data rate. Table 2 shows the expected amount of data on disk for each data rate. Data Rate (Gbps) 4.1122 8.1956 9.8947

Expected amount of data (GB) 861.7 1717.4 2073.4

Table 2. Expected size of data on disk at various data rates.

Figures 13, 14 and 15 show the total amount of data on disk for each test at 4, 8 and 10 Gbps respectively. In all three figures we can see that the size of data recorded by 8 disks is below the higher disk numbers. Figure 13 shows that at 4 Gbps the amount of data written to disk there are small fluctuations in the sizes, except when using 36 disks. At 8 and 10 Gbps data rates, apart from the tests using 8 disks, the small fluctuations in the amount of data written to disk are present also, but not visible on the plots because of the scale, however these are consistently within a few GBs for all tests. It is believed that most of these fluctuations may be due to rounding errors, since the data on each disk was rounded to the nearest GB, and when

18

multiplied by the large number of disks over which the data is spread can have a large effect upon the total data size. The values for data on disk are consistently higher than those predicted, which can be explained by the file containers, directories and other filesystem metadata.

Figure 13. Data size on disk for various disk and maximum amount of RAM values at receive rate of 4 Gbps.

19

Figure 14. Data size on disk for various disk and maximum amount of RAM values at receive rate of 8 Gbps.

Figure 15. Data size on disk for various disk and maximum amount of RAM values at receive rate of 10 Gbps.

20

12 hour test Data was sent from jbnexpres1 using UDPmon to jbnexpres3 at 8 Gbps for 12 hours and recorded onto 36 disks with 20 GB of RAM.

Data rates Figure 16 shows the instantaneous network and write to disk data rates for the 12 hour test, and figures 17 and 18 plot the histograms of these values. We can see a single strong peak for the network histogram and the write to disk data rate appears to be a normal distribution. The mean values of both histograms are in close agreement with one another.

Figure 16. Network and write to disk data rates for the 12 hour test.

21

Figure 17. Histogram of data received on the network interface during the 12 hour test.

Figure 18. Histogram of the write to disk data rates for the 12 hour test.

22

Load The load on the machine is plotted in figure 19. Although variable, it has a range of approximately 2 with a lower bound of 10, and is comparable with the load seen in the 30 minute tests (see figure 8).

Figure 19. System load during the 12 hour recording at 8 Gbps using 36 disks and 20 GB of RAM.

CPU utilisation As with the shorter tests, the percentage of CPU usage by the software spiked initially to just under 100%, but then fell rapidly to approximately 72%, where it remained fairly constant until the test completed.

Data on disk For the test, the data rate was the same as the 30 minute tests, which was close to 8 Gbps (see table 2). Therefore an expected 4.4256 x 1013 bytes of data will be sent to the vlbi-streamer software for writing to disk. After the test was complete the total amount of data saved to disk was found to be 4.4273 x 1013 bytes, which is extremely close to the expected value. The slight excess can be explained due to file containers, directories and other filesystem metadata.

23

This level of correspondence shows that the software is very capable of operating at high data rates for extended periods of time and reliably stores data to disk for processing at some later point. The total storage capacity of the machine is 70 TB, and this test showed that writing data to more than half of the total capacity in one continuous session is feasible and reliable.

Remote Testing Initial phase Data was generated on machines local to the flexbuffs in Metsähovi and Onsala at 4 Gbps using Jive5ab (see figure 1). The flexbuff machines then recorded that data using 36 (Metsähovi) and 22 (Onsala) disks and 10 GB of RAM. Unlike the local tests performed in Manchester, these recording sessions used the hugepages feature of the software. Data rates Figures 20 and 21 show the network and hard disk data rates in Metsähovi and Onsala respectively as reported by vlbi-streamer. Table 3 shows the mean and standard deviation for these rates. Metsähovi Mean value (Mbps) Standard deviation (Mbps)

Onsala

Network 4044.96

Disk 4045.23

Network 4071.48

Disk 4071.68

3.79

222.36

174.42

601.08

Table 3. Mean and standard deviation values of data received and written to disk on the flexbuff machines in Metsähovi and Onsala.

Using the figures and table, we can see that the receipt of data in Metsähovi was much more regular than in Onsala. There are several spikes in the network data in figure 21, some of which show data rates above the capabilities of the hardware, so these must be false. The spikes in the network data are often accompanied by larger spikes in the write to disk data rates, which suggests caching of data is being performed somewhere.

24

Figure 20. Data rates on flexbuff machine during the initial phase of remote testing in Metsähovi.

Figure 21. Data rates on flexbuff machine during the initial phase of the remote testing at Onsala.

25

Load Figures 22 and 23 show the load on the machines during the three hour recording. The Onsala machine has a smaller range of values and lower mean load. These values are consistent with those seen in the local tests (see figure 7).

Figure 22. Load on Metsähovi flexbuff while receiving data at 4 Gbps.

26

Figure 23. Load on Onsala flexbuff while receiving data at 4 Gbps.

CPU utilisation Unlike the local tests where percentage CPU usage spiked initially, the percentage use of the CPU on Metsähovi‘s flexbuff machine rose steadily to less than 28 % and remained constant throughout the test. Similarly, the flexbuff machine in Onsala did not display the initial spike, but rose to over 70%, and is shown in figure 24.

Figure 24. Percentage CPU use by vlbi-streamer on Onsala's flexbuff.

27

Data on disk There was a difference in the amount of data written to disk at both sites, as reflected in table 3. Table 4 shows the expected and actual data sizes. The units have been chosen as kilobytes, as this is the default unit returned by the Linux du command, which was used to calculate the data sizes on disk. Site Metsähovi Onsala

Measured data rate (Gbps) 4.23824 4.25626

Expected data (kilobytes) 5587523438 5611280273

Data on disk (kilobytes) 5621095068 5652972264

Table 4. Expected and actual data sizes.

The settings for Jive5ab were the same at both sites, yet the data rate received by each flexbuff is slightly different. Possible reasons for this slight difference include network topology and equipment, and PC networking differences, e.g. different network card manufacturers. The difference is small, with the Onsala flexbuff receiving data ~0.4% faster than at Metsähovi. This slight difference in data rates means that we should expect to find a larger amount of data recorded to disk in Onsala than Metsähovi, which we do, and which is in line with the faster data rate.. In Metsähovi the data on disk is 0.6% larger and in Onsala 0.7% larger than the input data stream. This increase in size can be explained due to file containers, directories and other filesystem metadata.

Second phase In the second phase of the test, the data stored at Metsähovi and Onsala were sent to a flexbuff at JIVE simultaneously at 2 Gbps, where two instances of vlbi-streamer were set to record. Each instance used all 36 disks to write to and had 10 GB of RAM.

Data rates Figures 25 and 26 show the data rates from Metsähovi and Onsala respectively of recorded data to JIVE. The scale of the graphs is greatly affected by the ring buffers filling initially at a very high rate. After this phase the rates settle to the familiar pattern of steady network throughput and the cached read/write disk data rates. Figures 27 and 28 show the received data rates from the simultaneous vlbi-streamer recordings.

28

Figure 25. Data rates for the sending of recorded data from Metsähovi to JIVE.

Figure 26. Data rates for the sending of recorded data from Onsala to JIVE.

29

Figure 27. Data rates for received data from Metsähovi to JIVE.

Figure 28. Data rates for received data from Onsala to JIVE.

30

Figure 29. Network data throughput for the switch to which the JIVE flexbuff is connected.

Figure 29 is a cacti generated plot of data throughput seen at the switch in JIVE through which the JIVE flexbuff connects to the Onsala and Metsähovi machines. The continuous block on the right represents the transfer of data from Metsähovi and Onsala, with the other regions representing preparatory tests. Table 5 shows the mean and standard deviation data transfer rates from the flexbuffs at Metsähovi and Onsala to JIVE’s flexbuff. The mean values match very closely and standard deviations follow the patterns observed in earlier tests.

Mean value (Mbps) Standard deviation (Mbps)

Metsähovi Network Disk 1960.86 1960.89

0.62

157.38

Onsala Network Disk 1960.70 1960.95

36.36

247.88

Metsähovi - JIVE Network Disk 1960.13 1960.13

3.03

156.89

Onsala - JIVE Network Disk 1959.58 1959.75

5.80

156.02

Table 5. Data rates for the simultaneous transfers.

Load The data rates are calculated per vlbi-streamer instance, however the load is a system wide property, so represents inputs from all running processes. The load on the sending machines was steady at approximately 4 for the Onsala flexbuff and 4.5 for the Metsähovi flexbuff. Figure 30 shows the load on the flexbuff receiving the data and it shows that it is held relatively constant throughout the transfers. This shows that the work load caused by receiving the simultaneous streams is easily achievable by the machine, and when compared to the local tests using 36 disks, the load is even slightly lower than that for a single stream at 4 Gbps.

31

Figure 30. Load on JIVE's flexbuff whilst receiving and recording two independent streams.

CPU utilisation On the sending machines, percentage CPU usage spiked initially to 170% in Metsähovi and 220% in Onsala, but quickly dropped on both machine to a stable 100%. This represents one core of the CPU being employed constantly to process the send requests. On older single core architectures this would have caused problems for other processes to get CPU time, but on modern multicore processors this is acceptable behaviour. Indeed on these machines there are either 6 or 8 physical (16 logical) cores available. In JIVE, only the highest of the pair of vlbi-streamer processes was logged, and is plotted in figure 31. When looked at comparison with the local tests performed at 4 Gbps (figure 10), then the combined values for both processes is higher, however this is due to the fact these are independent processes handling the data, even though the rate is the same.

32

Figure 31. Percentage CPU use by vlbi-streamer.

Data on disk The data on disk sizes recorded in Metsähovi and Onsala and sent to JIVE are given in table 6. Data size on disk in Metsähovi (KB) 5621095068

Data size on disk in JIVE (KB) 5619539588

Data size on disk in Onsala (KB) 5652972264

Data size on disk in JIVE (KB) 5652979724

Table 6. Data sizes on disk.

We can see that the copy of the datasets sent to JIVE are extremely close in size to the originals at the observatories.

33

Discussion and Conclusion The testing described in this report was designed to test many facets of the vlbi-streamer software’s capabilities and determine if it is well suited to the task of writing VLBI data to hard disk reliably and at high speed. Specifically, the purpose of the tests were: 

 

local tests – measure software and hardware performance through data rates, load, percentage CPU usage and the amount of data written to disk at various disk and RAM values. 12 hour test – test performance in a simulation of a long running VLBI session recording data at a very high data rate remote test – simulate a VLBI session where two stations record data at high data rates, then send the recording to JIVE for processing at lower data rates

In all three tests similar measurements were made of the host machines which allowed comparisons to be made. The software used to generate the test data streams was chosen as they have been designed to simulate VLBI data. Test results show that the vlbi-streamer is capable of reliably writing data at the highest network speeds achievable, that it adapts well to different hardware and is easy to use. A clear understanding of the data patterns has allowed the software’s design to be tailored to fit VLBI data needs. At the core of the software, the ring buffer design and data transport threads allow the efficient use of system resources, which permit the software to operate so effectively at such high speeds. The local tests served as a baseline against which performance comparisons can be made. The data rates reported from UDPmon do not match with those reported by vlbi-streamer, with UDPmon consistently reporting higher rates. This mismatch requires further investigation, and one area of suspicion is the inclusion of protocol headers in the rates reported. From table 5 however, we can clearly see that vlbi-streamer is self-consistent, with the output rates matching extremely closely to the input rates. vlbi-streamer’s design is centred around ring buffers used as temporary storage space for data between hard disk and network card. To make efficient data transfers to and from hard disks, data must be read or written in large blocks, which require the use of intermediate storage. The histograms of network and disk data rates are a useful indicator of the use of the ring buffers. Data is received on the network card at a constant rate as evidenced by the small standard deviation in the received network data rates. In contrast, the standard deviation of the write to disk data rates is large, which shows that the ring buffers are being used as flexible temporary storage space for data. Data received over the network is temporarily stored in the ring buffer until the buffer contains enough data to make an efficient write to disk, and during sending, data is read from disk in large blocks and stored in the buffers and from there sent to the network card at high speed.

34

Comparing the local tests at 4 Gbps and the initial phase of the remote test in Metsähovi, we see that the percentage of CPU usage is significantly lower in Metsähovi (28%) than Manchester (42%). We believe this difference is due to the use of hugepages in the remote tests, which have a lower work overhead. Not all architectures and operating systems allow the use of hugepages, but where possible we recommend its use. The load on the flexbuffs created by vlbi-streamer is high, always being above 1. To gain an understanding of why this load is so high, it is useful to read the man page from the uptime command: “System load averages is the average number of processes that are either in a runnable or uninterruptable state. A process in a runnable state is either using the CPU or waiting to use the CPU. A process in uninterruptable state is waiting for some I/O access, eg waiting for disk.” Given the load graphs from the local tests, we can say that the load created on vlbi-streamer is a complex mixture of runnable and uninterruptable processes. At 4 Gbps tests show load is higher for small disk numbers than larger, which suggests I/O access load, and at 8 and 10 Gbps, tests with higher disk numbers have higher load, suggesting threads waiting for CPU time. The machine used for the local tests has dual quad core Intel processors, which support hyper threading, where two logical cores are presented to the OS for each physical core, giving a total of 16 cores. The load average values are an aggregate across all cores, so to get a load average per core we must divide the system load average by 16. From the plots we can see that the load per core average rarely exceeds 1, and therefore for the majority of tests there were compute resources to spare. A further interesting test would be to have other processes running on a flexbuff at the same time as running vlbi-streamer to see how they interact and affect one another. A good candidate for this test would be to have the flexbuff process the incoming data at the same time as recording. From the second phase of the remote test, we can see that the load and CPU usage for two simultaneous vlbi-streamers is low. The percentage usage of CPU by the software has been shown to occasionally spike at the beginning of the tests to over 100%, but rapidly decrease to below 100%. On multicore CPUs this is not a problem, as 100% utilisation represents a single core being occupied full time. In general the amount of usage of the CPU is low, showing that the software is not CPU intensive. The amount of data written to disk was used as a guide to how faithfully the received data was written to disk. A packet loss monitoring script was written and used in the remote test which showed minor packet loss in the final seconds of the transfer to JIVE. This is also reflected in the data sizes given in table 6, where the transferred data sets match in size to an accuracy of 99.97% in Metsähovi and differ by only 7.46 MB out of 5.65 TB on the Onsala dataset. These results show the high level of accuracy with which the software is able to transfer reliably and quickly large datasets over international links.

35

One possible reason for the discrepancy in data transfers is the routes the data took. Onsala is connected to JIVE over a light path, which has only one router on the path. Metsähovi is connected to JIVE through a routed Ethernet network, with 9 routers on the link. It is possible that some of the data loss occurs because of the increased amount of routing necessary to send the data. It is expected that the size of the data recorded to disk will be fractionally larger than the amount of data transmitted because of the filesystem, vlbi-streamer config files and the method of measuring data on disk. One other factor to consider is that the measured data rate is a single instantaneous measurement which as we have demonstrated in the plots has a small standard deviation associated with it which could be a source of error. In summary the vlbi-streamer software has been tested thoroughly in several situations and found to be a reliable, easy to use and accurate tool for recording and transmitting VLBI data onto flexbuff machines. The test results show that the software adapts well to the hardware provided to it and makes efficient use of those resources. This software will become a useful resource to the astronomy community for recording data to COTS based equipment.

36

References [1] http://www.haystack.mit.edu/tech/vlbi/mark5/index.html [2] http://www.jive.nl/dokuwiki/lib/exe/fetch.php?media=nexpres:2011-02-28_wp8-d8.2.pdf [3] http://code.google.com/p/vlbi-streamer/ [4] http://www.hep.man.ac.uk/u/rich/net/index.html [5] http://www.jive.nl/nexpres/doku.php?id=nexpres:wp5:jive5ab&s[]=jive5ab [6] Kershaw, Hughes-Jones, A study of constant bit-rate data transfer over TCP/IP, Future Generation Computer Systems 26 (2010) 128 – 134.

37

Suggest Documents