Performance Evaluation of Breadth-First Search on Intel Xeon Phi

Performance Evaluation of Breadth-First Search on Intel Xeon Phi Alexander Frolov, Elizaveta Golovina, and Alexander Semenov OAO “NICEVT”, Varshavasko...
Author: Rosalind Warren
0 downloads 0 Views 486KB Size
Performance Evaluation of Breadth-First Search on Intel Xeon Phi Alexander Frolov, Elizaveta Golovina, and Alexander Semenov OAO “NICEVT”, Varshavaskoe shosse, 125, 117587 Moscow, Russia {frolov,golovina,semenov}@nicevt.ru, WWW home page: http://www.dislab.org

Abstract. Breadth-First Search (BFS) is one of the most important kernels in graph computing. It is the main kernel of the Graph500 rating that evaluates performance of large supercomputers and multiprocessor nodes in terms of traversed edges per second (TEPS). In this paper we present the results of BFS performance evaluation on a recently released high-performance Intel Xeon Phi coprocessor. We examine previously proposed Queue-based and Read-based approaches to BFS implementation. We also apply several optimization techniques, such as manual loop unrolling and prefetching, that significantly improve performance on Intel Xeon Phi. On a representative graph set Intel Xeon Phi 7120P demonstrates 178 % maximal and 137 % average speedup as compared to the Intel Xeon E5-2660 processor. We achieved 4366 MTEPS on Intel Xeon Phi 7120P for the graph with scale 25 and have the 89th place on the November 2013 Graph500 list. This is the fourth place among research teams in the class of single node x86-based systems. Keywords: Intel Xeon Phi, Breadth-First Search, graph algorithms

1

Introduction

Large-scale graph processing is a relatively new and fast-growing application area in HPC. It is usually characterized by large datasets stored in the memory of compute nodes with low spatial and temporal locality of memory access. This leads to inefficiency of many hardware and software optimizations designed for regular access patterns, such as hardware prefetching, multilevel data cache hierarchy, TLB buffers, and DDR burst mode operations. Heterogeneous computing by means of GPUs and many-core coprocessors such as Nvidia Kepler, AMD Firestream, Intel Xeon Phi has become a widespread phenomenon in HPC, and especially in large supercomputers, holding top positions of the Top500 list. For example, the current No.1 system (as of November 2013) Tianhe-2 supercomputer has 48 K Intel Xeon Phi coprocessors. GPUs were originally designed for applications that are well-suited for stream architecture. However, recently significant efforts have been made to optimize irregular problems, such as graph processing, to achieve high performance on GPU. The Intel Xeon Phi coprocessor has a many-core multithreaded architecture and is more versatile than GPUs. But at first glance high performance can

2

be achieved by using 512-bit vector arithmetic instructions, and a vectorization of graph applications seems to be difficult. In this paper we present performance evaluation of several Breadth-First Search algorithm implemented on the Intel Xeon Phi and Intel Xeon Sandy Bridge-EP processors.

2

Breadth-First Search Algorithms

Breadth-First Search (BFS) is an important building block of many graph applications. BFS is the main kernel of the Graph500 [1] rating that evaluates performance of large supercomputers and multiprocessor nodes. Starting from the source vertex the frontier expands outwards during each step visiting all of the vertices at the same depth before visiting any at the next depth. Vertices at the same depth are called a level. We investigate two approaches to BFS parallelization on two different multicore architectures: Intel Sandy Bridge and Intel Many Integrated Cores (MIC). We consider several algorithms in each approach: 1. Queue-based approach (a) naive algorithm (b) block algorithm 2. Read-based approach (a) top-down algorithm (b) hybrid algorithm The first approach, called Queue-based, is based on queue-type data structures for workload distribution and represents a conventional technique of multithreaded programming [2, 3]. The second, called Read-based, is based on iterative reading of an array containing corresponding level numbers of graph vertices and has been proposed for GPU architecture [2]. These approaches differ fundamentally in parallel graph processing, and, as we show further, it is reflected in the obtained performance results. In all implementations we store the graph as an adjacency matrix in Compressed Row Storage (CRS) format. All algorithms are implemented using C++ and OpenMP. 2.1

Queue-based Approach

We examine two algorithms representing the Queue-based approach: a naive Queue-based algorithm and a block Queue-based algorithm. In the naive Queue-based algorithm vertex numbers of the current level are stored in the Q array, while vertex numbers of the next level are added to the Qnext array. Each vertex in Q is processed to determine if it has any unvisited neighbors by testing all of its neighbors. All unvisited neighbors are marked as visited in the marked array and added to Qnext to be processed at the next level. Then Q and Qnext are swapped and the next level is processed. Maintaining

3

consistency of Qnext requires using of atomic operation sync fetch and add to avoid interference with other threads when adding a new vertex to Qnext . The algorithm is presented in Fig. 1. In order to reduce utilization of the atomic operation sync fetch and add in the block Queue-based algorithm [4] each thread allocates a portion of Qnext of size k for adding vertces to the next level. When the portion is full, a thread allocates another portion using the same atomic operation. This optimization provides a k times decrease of atomic operation usage.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Qcounter = 1 // initialization of Q vertices counter Q[0] = r // Q array initialization marked[r] = 1 // mark source vertex r while Qcounter > 0 // while current level is not full Qnext counter = 0 // zero next level counter #pragma omp parallel // parallel level processing for all vertex in Q do // for all vertices w, which is neighbor of vertex for all w : (vertex, w) in E do if marked[w] == 0 then // if w is unvisited // add w in Qnext Qnext [ sync f etch and add(Qnext counter , 1)] = w marked[w] = 1 // mark w end if end for end for // switch to next level: swap Q and Qnext , Qcounter = Qnext counter swap(Q, Qnext ) end while Fig. 1: Naive Queue-based algorithm

2.2

Read-based Approach

In case of the Read-based approach we examine two algorithms: a top-down algorithm and a hybrid bottom-up algorithm. Originally the Read-based approach was developed for GPUs. The main idea of this approach is to use a single array called levels for workload distribution. Size of levels equals the number of vertices in the graph. Each element of levels contains the level number for each vertex or −1 for unvisited vertices. At each level levels is scanned and vertices at the current level are detected and processed, that is their neighbor lists are read and next level number is stored

4

in levels for unvisited vertices. This algorithm is called a top-down algorithm (Fig. 2). Another witty idea for BFS implementation is proposed in [5]. It was observed that many real-world graphs (such as social networks) have the following feature. When a considerable part of the search is done, at some levels there is only a small number of unvisited neighbors left, i.e. a large amount of vertex processing is useless. For these levels it is efficient to use a bottom-up search: neighbors of all unvisited vertices are analyzed, and if any such neighbor of the vertex is at the current level, then the vertex is an ancestor of that neighbor in the search tree, and there is no need to analyze other neighbors of the vertex. Single level processing using the bottom-up algorithm is presented in Fig. 3. We propose a hybrid bottom-up algorithm. In this algorithm we use the top-down Read-based algorithm for some levels (unlike [5], where the Queuebased approach is used for top-down algorithm), and for other levels we use the bottom-up algorithm.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

curLevel = 0 // level number initialization levels[r] = curLevel // source vertex r will be processed at the level number 0 levelV ertsCount = 1 // number of vertices at the current level while levelV ertsCount > 0 // while there are any vertices at the current level levelV ertsCount = 0 // parallel levels processing #pragma omp parallel for reduction(+:levelV ertsCount) for all vertex in V do // for all graph vertices // ignore vertices that are not at the current level if levels[vertex] == curLevel then // for all vertices w that are neighbors of vertex for all w : (vertex, w) in E do if levels[w] == -1 then // if w is unvisited levels[w] = curLevel + 1 // mark w to next level levelV ertsCount = levelV ertsCount + 1 end if end for end if end for curLevel = curLevel + 1 end while Fig. 2: Top-down algorithm, Read-based approach

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

// parallel levels processing #pragma omp parallel for reduction(+:levelV ertsCount) for all vertex in V do // for all graph vertices if levels[vertex] == -1 then // if vertex is unvisited // for all neighbors vertices w of vertex for all w : (vertex, w) in E do // if w is in the current level if levels[w] == curLevel then // if w is in the current level levels[vertex] = curLevel + 1 // mark vertex for the next level levelV ertsCount = levelV ertsCount + 1 break // quit from inner for loop end if end for end if end for Fig. 3: Level processing using the bottom-up algorithm

3

Performance Analysis

Table 1 provides configuration details of hardware platforms used for BFS algorithms performance evaluation. The first is a single socket Intel Xeon Sandy Bridge-EP server platform and the other two are the Intel Xeon Phi 5110P (Phi5110P) and Intel Xeon Phi 7120P (Phi-7120P) coproccessor add-on cards. Both Phi-5110P and Phi-7120P are implementations of the same coprocessor core, Phi-7120P was released in the second quarter of 2013, six months later than Phi-5110P. Phi-7120P has increased frequency enabled by Intel Turbo Boost Technology and twice as much memory as Phi-5110P. Performance of the BFS algorithms is measured in Millions of Traversed Edges per Second (MTEPS). We use both Intel Xeon Phi platforms in native mode, i.e. computation is exclusively performed on Intel Xeon Phi without using the host CPU. 3.1

Performance Evaluation

The performance of the investigated algorithms on Uniform Random graph with 134 M vertices and average vertex degree 8 on Sandy Bridge-EP and on Phi5110P is presented in Fig. 4 and Fig. 5. The performance of the naive Queuebased algorithm on Phi-5110P is 20 times lower than on Sandy Bridge-EP when single thread is used. Thus obtaining of high performance on Intel Xeon Phi is only possible with a large number of threads. However, scalability of the naive and block Queue-based algorithms is very poor despite the reduction of atomic operation usage in the latter algorithm. Read-based algorithms show significantly better scalability and performance than the Queue-based algorithms, as shown in Fig. 4 and Fig. 5. This improve-

6 Table 1: System specifications Sandy Bridge-EP

Phi-5110P

Phi-7120P

Intel Model

Xeon E5-2660

Xeon Phi 5110P

Xeon Phi 7120P

CPU speed, GHz

2.2

1.05

1.238

Number of sockets

1

1

1

Number of cores

8

60

61

Number of threads in core

2

4

4

Data caches size

64 Kb * / 2 Mb * / 20 Mb

Memory size, GB

32

8

16

32 Kb * / 512 Kb * 32 Kb * / 512 Kb *

Memory type

DDR3

GDDR5

GDDR5

Memory bandwidth, GB/s

51

352

352

Memory latency, ticks

200

300

350

* – per one core 800

hybrid+prefetch+relabel hybrid+prefetch hybrid

700

read+prefetch read

Performance, MTEPS

600

block simple

500 400 300 200 100 0 0

2

4

6

8 10 Number of threads

12

14

16

Fig. 4: BFS performance of Sandy Bridge-EP on Uniform Random graph with 134M vertices and average vertex degree 8

ment can be explained by several advantages of the Read-based approach over Queue-based. First, the Read-based algorithms do not use atomic operations, that heavily limit performance scalability on large number of cores and threads. Second, the Q and Qnext arrays are not used, so memory usage is reduced, as a result cache is used more effeciently. Finally, spatial locality of memory access pattern in the Read-based approach is much higher than in the Queue-based approach [2]. Indeed, at each level of the Read-based algorithms the levels array

7

is read sequentially, as well as the CRS packed neighbor array is accessed with monotonically increasing indices with possible jumps over contiguous memory portions. 1200 hybrid+prefetch+relabel hybrid+prefetch hybrid

1000

top-down+prefetch

Performance, MTEPS

top-down

800

block naive

600

400

200

0 0

50

100 150 Number of threads

200

250

Fig. 5: BFS performance of Phi-5110P on Uniform Random graph with 134 M vertices and average vertex degree 8

Sequential access pattern permits a better utilization of memory bandwidth, raises cache efficiency, permits an efficient work of hardware prefetch, reduces TLB misses count. Also the threads share the CRS neighbors array better by not loading it to core’s caches repeatedly. It seems that the Read-based algorithms perform redundant work when process the whole levels array at each level. But as sequential memory bandwidth is very high, and processing time of vertices that are not at the current level is negligible, the overheads are very small. However, still there is a random access in the Read-based approach. In the line 10 of the top-down algorithm in Fig. 2 access index to the levels array is vertex w from the CRS neighbors array, for many graphs w is random. Memory bandwidth comparison for sequential (vectorized) access and random access on Sandy Bridge-EP and Phi-5110P is presented in Tab. 2. Performance results for vectorized memory access are taken from [6] for the SE10P coprocessor (prerelease MIC card). All other results were obtained using DISBench [7]. Table 2 shows that random access memory pattern is very expensive. The reasons are the following. When a random addresses stream is issued and one data word is requested from the cache line, the efficiency of integrated DRAM controller is greatly reduced. First, there is a reduction of useful amount of data transmitted through the memory bus. Second, there are additional commands in memory chips that are necessary to work with a stream of inconsecutive addresses. The hybrid algorithm allows to considerably speed up the processing of some levels, as a result we achieve an even better performance, see Fig. 4 and Fig. 5. But there is a random access in the hybrid algorithm as well.

8 Table 2: Memory bandwidth (in GB/s) for read and write operations on Sandy Bridge-EP and Phi-5110P for sequential and random access patterns Sequential access

3.2

Random access

Read

Write

Read

Write

Sandy Bridge-EP

42

19

3.3

2.2

Phi-5110P

183

160

3.8

3.4

Optimizations

To optimize performance of our algorithms on Intel Xeon Phi we had to detect bottlenecks. In the first place it could be either bandwidth or latency limitation of random access. We introduced manual loop unrolling in the top-down algorithm in the line 8 (Fig. 2) and manual prefetching of levels[w] into cache using the mm prefetch intrinsic. Both these techniques (top-down+prefetch algorithm) on Phi-5110P provide 2.1 times increase of single thread performance, and 1.52 times for 240 threads, see Fig. 5. This improvement shows that the bottleneck of the top-down algorithm on Intel Xeon Phi is memory latency of random access. For the top-down+prefetch algorithm the bottleneck most likely is the maximal rate of random accesses that is determined by the memory bandwidth. At the same time on Sandy Bridge-EP the performance of the top-down+prefetch algorithm equals the performance of the top-down algorithm. In other words manual loop unrolling and data prefetch showed no effect. We assume that this can be explained by high quality of code generated by Intel C compiler. We applied the same optimizations to the hybrid algorithm. We call it a hybrid+prefetch algorithm. Another possible way to increase performance is to improve data locality for access to the levels array (for example, the line 10 of the top-down algorithm, Fig. 2). It can be done by preprocessing adjacency matrix to the band form with a reverse Cuthill-McKee algorithm [8]. As a result, since rows of the matrix in the Read-based approach are processed sequentially in accordance with sequential processing of the levels array, cache hit rate ofr accesses to the levels array is increased. Also neighbors adjacency lists are sorted for TLB miss reduction. The hybrid+prefetch algorithm with the described above preprocessing we call hybrid+prefetch+relabel. The performance of the latter algorithm is increased by 12 % as compared to the hybrid+prefetch algorithm on Phi-5110P. We used libhugetlbfs library for large pages support on Phi-5110P, but it gave no performance gain. 3.3

Performance Comparison

We evaluated performance of Intel Xeon Phi and Intel Xeon Sandy Bridge-EP using Uniform Random graphs, RMAT [9] graphs with (A, B, C) = (0.45, 0.25,

9

0.15), default Graph500 Kronecker graphs [1] and default SSCA2 graphs [10], see Tab. 3. Random-k and RMAT-k below denote Uniform Random and RMAT graphs with average degree k. The performance of the best algorithms (hybrid+prefetch+relabel or hybrid+prefetch) on Phi-7120P (244 threads) and Sandy Bridge-EP (16 threads) on four graph types is presented in Fig. 6. Along the X-axis is a number of vertices in the graph. For each graph type on Phi-7120P the best performance among variants with manual loop unrolling of 2, 4, 8, 16 is given. The performance on smaller data sizes rapidly increases on Sandy Bridge-EP until data become large enough not to fit in the cache, thereafter the performance gradually drops. Therefore Sandy Bridge-EP is not very efficient for data intensive problems with large datasets. On the other hand, the performance on Phi-7120P slowly increases and at some point outperforms Sandy Bridge-EP. Intel Xeon Phi is better designed for a massive parallelism and use of high memory bandwidth. Intel Xeon Phi 7120P has 16 GB of memory compared to 8 GB on Intel Xeon Phi 5110P. Overall performance comparison of Sandy Bridge-EP, Phi-5110P and Phi-7120P on maximal graphs fitting in 8 GB memory of Phi-5110P is presented in Fig. 7. Maximal, average and minimal performance on Phi-5110P and Phi7120P are 134 %, 98 %, 71 % and 165 %, 121 %, 89 % of Sandy Bridge-EP performance. Overall performance comparison of Sandy Bridge-EP and Phi-7120P on maximal graphs fitting in 16 GB memory is presented in Fig. 8. Maximal, average and minimal performance on Phi-7120P are 178 %, 137 % and 100 % of Sandy Bridge-EP performance. The increased frequency and memory capacity of Phi-7120P compared to Phi-5110P make Intel Xeon Phi 7120P rather attractive. We achieved 4366 MTEPS Graph500 performance result on Intel Xeon Phi 7120P on the graph with scale 25, and have the 89th place on the November 2013 list. There is no other Intel Xeon Phi in this Graph500 list, and this is the fourth place among the research teams in the class of single node x86-based systems. Performance comparison of single node x86-based systems for different research teams is presented in Table 4.

4

Conclusion

In this paper we studied two different approaches to Breadth-First Search (BFS) implementation: conventional Queue-based approach and stream-type Read-based approach. We experimentally showed that for the Intel Xeon Sandy Bridge-EP and Intel Xeon Phi processors better performance is obtained on Read-based algorithms. Read-based approach is characterized by the absence of atomic operations, intensive use of high memory bandwidth for sequential access and presence of random access. Several optimization techniques, such as manual loop unrolling and prefetching, were applied to the Read-based algorithms, which significantly improved performance on Intel Xeon Phi while did not show any significant effect on Intel Xeon Sandy Bridge-EP.

10

9000 Random-8, SB

8000

Random-8, MIC-7120P Random-32, SB

Performance, MTEPS

7000

Random-32, MIC-7120P RMAT-8, SB

6000

RMAT-8, MIC-7120P RMAT-32, SB

5000

RMAT-32, MIC-7120P

4000 3000 2000 1000 0 10

12

14

16 18 20 22 24 SCALE, number of vertices in graph is 2SCALE

26

28

30

Fig. 6: Performance comparison of Intel Xeon Sandy Bridge-EP and Intel Xeon Phi 7120P on Uniform Random and RMAT graphs of various sizes

180% Sandy Bridge-EP Phi-5110P

160%

Phi-7120P 140% 120% 100% 80% 60% 40% 20% 0% random-8-27

random-32-25

random-64-24

RMAT-8-26

RMAT-32-24

RMAT-64-23

SSCA2-25

graph500-25

Fig. 7: Performance comparison of Intel Xeon Sandy Bridge-EP, Intel Xeon Phi 5110P and Intel Xeon Phi 7120P on maximal graphs fitting in 8 GB memory of Phi-5110P

200% 180%

Sandy Bridge-EP Phi-7120P

160% 140% 120% 100% 80% 60% 40% 20% 0% random-8-28 random-32-26 random-64-25

RMAT-8-27

RMAT-32-25

RMAT-64-24

SSCA2-26

Graph500-25

Fig. 8: Performance comparison of Intel Xeon Sandy Bridge-EP and Intel Xeon Phi 7120P on maximal graphs fitting in 16 GB memory of Phi-7120P

11 Table 3: Graphs used for evaluation

# # Edges Vertices Degree Directed Generator (M) (M)

Abbreviation

Graph

random-k-n

Uniform Random

2n

k*2n

k

N

Own

RMAT-k-n

RMAT

2n

k*2n

k

Y

[11]

SSCA2-25

SSCA2

25

2

267.8

8.0

Y

[12]

SSCA2-26

SSCA2

226

720.1

10.7

Y

[12]

1047.2

31.2

N

[1]

graph500-25

Kronecker

25

2

Table 4: Graph500 performance (November 2013) of single node x86-based systems for different research teams

Organization

System

Graph500 Scale

GTEPS

Chuo University

4x Intel(R) Xeon(R) CPU E5-4650

27

31.6

University of Tsukuba

Xeon E5-2650 v2, GeForce GTX TITAN

25

17.2

National University of Defense Technology

SMP, x86-based (?)

24

9.7

DISLab, NICEVT / svetcorp.net

Intel Xeon Phi 7120P

25

4.4

Generally on Intel Xeon Phi BFS performance increases as graph size increases in contrast to Intel Xeon Sandy Bridge-EP. Intel Xeon Phi 7120P has 16 GB of memory compared to 8 GB of Intel Xeon Phi 5110P. On the 8 largest graphs fitting in 8 GB average performance on Intel Xeon Phi 5110P and Intel Xeon Phi 7120P is 98 % and 121 % of Intel Sandy Bridge-EP performance. On the 8 largest graphs fitting in 16 GB of Intel Xeon Phi 7120P average performance is 137 %. The maximal performance gain on Intel Xeon Phi 7120P compared to Intel Xeon Sandy Bridge-EP is 178 %, minimal performance is 100.31 %. To the best of our knowledge as of February 2014 we are the first to publish performance results of BFS on Intel Xeon Phi. In the paper [4] BFS scalability results on 32-core Intel Xeon Phi prototype called Knights Ferry are presented. However, this paper does not contain absolute performance results or comparison with traditional Intel Xeon processors. On the Graph500 benchmark we achieved 4366 MTEPS on Intel Xeon Phi 7120P on the graph with scale 25, and have the 89th place on the November 2013 list. There is no other Intel Xeon Phi in this Graph500 list, and this is

12

the fourth place among the research teams in the class of single node x86-based systems. The authors would like to thank the Svet Computers [13] company for the provided IntellectDigital SciPhi 470 desktop supercomputer with Intel Xeon Phi 7120P coprocessor.

References 1. Graph500 benchmark. URL: http://www.graph500.org. 2. Hong, S., Oguntebi, T., Olukotun, K.: Efficient Parallel Graph Exploration on Multi-Core CPU and GPU. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques. 78–88 (2011) 3. Agarwal, V., Petrini, F., Pasetto, D., Bader, D.: Scalable Graph Exploration on Multicore Processors. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1–11 (2011) 4. Saule, E., Catalyeurek, U.: An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. 1629–1639 (2012) 5. Beamer, S., Asanovi´c, K., Patterson, D.: Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 10 pages (2012) 6. Saule, E., Kaya, K., Catalyurek, U.: Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi. arXiv:1302.1078, 5 Feb 2013. URL: http://gdcms.sysu.edu.cn/docs/20130329144208946523.pdf 7. Frolov, A., Gilmendinov, M.: DISBench: Benchmark for Memory Performance Evaluation of Multicore Multiprocessors. Accepted to 12th International Conference, PaCT 2013, St.Petersburg, Russia, September 30-October 4, 2013 8. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th national conference. 157–172 (1969) 9. Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph mining. SIAM Data Mining. (2004) 10. Bader, D., Madduri, K.: Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessor. The 12th International Conference on High Performance Computing (HiPC 2005). 465–476 (2005) 11. Bader, D., Madduri, K.: SNAP: Small-world network analysis and partitioning: an open-souce parallel graph framework for the exploration of large-scale networks. International Parallel and Distributed Processing Symposium (IPDPS). (2008) 12. Bader, D., Madduri, K.: GTGraph: A Synthetic Graph Generator Suite. URL: http://www.cse.psu.edu/ madduri/software/GTgraph/ (2006) 13. Svet Computers company. URL: http://svetcorp.net

Suggest Documents