Parallel schedulers on dense matrices Christian Eder joint work with Jean-Charles Faug`ere POLSYS Team, UPMC, Paris, France
June 11, 2013
1 / 23
1 Basics
2 Naive dense matrix multiplication
3 Naive dense Gaussian Eliminaton
4 Cache-oblivious dense Gaussian Elimination
5 Features of the library
2 / 23
Preconditions I I Using dense matrices with unsigned int64 entries. I Computing in Fp , p some prime < 216 . I We compared the following set of parallel schedulers: 1. pthread (or in other words, by hand), 2. OpenMP (sometimes together with pthread), 3. Intel TBB (using lambda expressions), 4. XKAAPI (in particular, the C interface KAAPIC),
3 / 23
Preconditions I I Using dense matrices with unsigned int64 entries. I Computing in Fp , p some prime < 216 . I We compared the following set of parallel schedulers: 1. pthread (or in other words, by hand), 2. OpenMP (sometimes together with pthread), 3. Intel TBB (using lambda expressions), 4. XKAAPI (in particular, the C interface KAAPIC),
Note The implemented algorithms are not optimized in order to keep the influence on the schedulers as low as possible.
3 / 23
Preconditions II Results presented computed on the HPAC compute server
NUMA
with hyperthreading: 64 cores
Also tested on:
48-core (real cores) AMD Magny Cours NUMA, 4-core (8 with hyperthreading) Intel Sandy Bridge. 4 / 23
Preconditions II Results presented computed on the HPAC compute server
I 8 Intel Xeon E5-4620 cores @ 2.20 GHz
NUMA
I L1 cache: 32 KB I L2 cache: 256 KB I shared L3 cache: 16 MB I 96 GB RAM with hyperthreading: 64 cores
Also tested on:
48-core (real cores) AMD Magny Cours NUMA, 4-core (8 with hyperthreading) Intel Sandy Bridge. 4 / 23
Preconditions II Results presented computed on the HPAC compute server
I 8 Intel Xeon E5-4620 cores @ 2.20 GHz
NUMA
I L1 cache: 32 KB I L2 cache: 256 KB I shared L3 cache: 16 MB I 96 GB RAM with hyperthreading: 64 cores
Also tested on:
48-core (real cores) AMD Magny Cours NUMA, 4-core (8 with hyperthreading) Intel Sandy Bridge. 4 / 23
Tested algorithms
1. Naive Dense Matrix Multiplication 2. Dense Gaussian Elimination: (a) Naive implementation (with and without pivoting) (b) Cache-oblivious implementation (GEP by Chowdhury and Ramachandran without pivoting)
5 / 23
1 Basics
2 Naive dense matrix multiplication
3 Naive dense Gaussian Eliminaton
4 Cache-oblivious dense Gaussian Elimination
5 Features of the library
6 / 23
Naive dense matrix multiplication
We compared several variants of parallelized for loops:
7 / 23
Naive dense matrix multiplication
We compared several variants of parallelized for loops: I 1-dimensional vs. 2-dimensional parallel loops
7 / 23
Naive dense matrix multiplication
We compared several variants of parallelized for loops: I 1-dimensional vs. 2-dimensional parallel loops I For Intel TBB we compared the different integrated schedulers: auto partitioner: Splitting work to balance load affine partitioner: Improves choice of CPU affinity simple partitioner: Recursively splits a range until it is no longer divisible (grainsize is critical)
7 / 23
Timings Timings: bench-4a7a7e230bef0495ee882549092f0e33~ Mat Mult uint64 Matrix dimensions: 6000 x 5000, 5000 x 7000
900
Raw sequential pThread 1D Open MP collapse(1) outer loop Open MP collapse(1) inner loop Open MP collapse(2) KAAPIC 1D KAAPIC 2D Intel TBB 1D auto partitioner Intel TBB 1D affinity partitioner Intel TBB 1D simple partitioner Intel TBB 2D auto partitioner Intel TBB 2D affinity partitioner Intel TBB 2D simple partitioner
800
700
Real time in seconds
600
500
400
300
200
100
0
1
2
4
8
Number of threads
16
32
64
8 / 23
GFLOPS/sec GFLOPS/sec: bench-4a7a7e230bef0495ee882549092f0e33~ Mat Mult uint64 Matrix dimensions: 6000 x 5000, 5000 x 7000
30
Raw sequential pThread 1D Open MP collapse(1) outer loop Open MP collapse(1) inner loop Open MP collapse(2) KAAPIC 1D KAAPIC 2D Intel TBB 1D auto partitioner Intel TBB 1D affinity partitioner Intel TBB 1D simple partitioner Intel TBB 2D auto partitioner Intel TBB 2D affinity partitioner Intel TBB 2D simple partitioner
25
GFLOPS per second
20
15
10
5
0
1
2
4
8
Number of threads
16
32
64
9 / 23
1 Basics
2 Naive dense matrix multiplication
3 Naive dense Gaussian Eliminaton
4 Cache-oblivious dense Gaussian Elimination
5 Features of the library
10 / 23
Naive dense Gaussian Elimination
Compared to naive multiplication we saw a different behaviour:
11 / 23
Naive dense Gaussian Elimination
Compared to naive multiplication we saw a different behaviour: I KAAPIC, Open MP and Intel TBB are in the same range.
11 / 23
Naive dense Gaussian Elimination
Compared to naive multiplication we saw a different behaviour: I KAAPIC, Open MP and Intel TBB are in the same range. I Open MP behaves a bit worse when it comes to hyperthreading.
11 / 23
Naive dense Gaussian Elimination
Compared to naive multiplication we saw a different behaviour: I KAAPIC, Open MP and Intel TBB are in the same range. I Open MP behaves a bit worse when it comes to hyperthreading. I pthread implementation slows down due to lack of real scheduler.
11 / 23
Timings Timings: test-naive-gep-hpac-talk Naive GEP uint64 Matrix dimensions: 8192 x 8192
700
Raw sequential pThread 1D Open MP collapse(1) outer loop KAAPIC 1D Intel TBB 1D auto partitioner Intel TBB 1D affinity partitioner Intel TBB 1D simple partitioner
600
Real time in seconds
500
400
300
200
100
0
1
2
4
8
Number of threads
16
32
64
12 / 23
GFLOPS/sec GFLOPS/sec: test-naive-gep-hpac-talk Naive GEP uint64 Matrix dimensions: 8192 x 8192
18
Raw sequential pThread 1D Open MP collapse(1) outer loop KAAPIC 1D Intel TBB 1D auto partitioner Intel TBB 1D affinity partitioner Intel TBB 1D simple partitioner
16
14
GFLOPS per second
12
10
8
6
4
2
0
1
2
4
8
Number of threads
16
32
64
13 / 23
Speedup Speedup: test-naive-gep-hpac-talk Naive GEP uint64 Matrix dimensions: 8192 x 8192
5
Raw sequential pThread 1D Open MP collapse(1) outer loop KAAPIC 1D Intel TBB 1D auto partitioner Intel TBB 1D affinity partitioner Intel TBB 1D simple partitioner
4
Speedup
3
2
1
0
1
2
4
8
Number of threads
16
32
64
14 / 23
1 Basics
2 Naive dense matrix multiplication
3 Naive dense Gaussian Eliminaton
4 Cache-oblivious dense Gaussian Elimination
5 Features of the library
15 / 23
Cache-oblivious dense Gaussian Elimination Implemented I-GEP from [CR10].
16 / 23
Cache-oblivious dense Gaussian Elimination Implemented I-GEP from [CR10]. Basic ideas are: I Assume matrix of dimensions 2k × 2k .
16 / 23
Cache-oblivious dense Gaussian Elimination Implemented I-GEP from [CR10]. Basic ideas are: I Assume matrix of dimensions 2k × 2k . I Do not consider pivoting.
16 / 23
Cache-oblivious dense Gaussian Elimination Implemented I-GEP from [CR10]. Basic ideas are: I Assume matrix of dimensions 2k × 2k . I Do not consider pivoting. I Recursively split matrix in 4 same-sized parts. 2k−1
2k−1
16 / 23
Cache-oblivious dense Gaussian Elimination Implemented I-GEP from [CR10]. Basic ideas are: I Assume matrix of dimensions 2k × 2k . I Do not consider pivoting. I Recursively split matrix in 4 same-sized parts. 2k−1
2k−1
I Stop recursion once parts fit in cache. 16 / 23
Cache-oblivious dense Gaussian Elimination
Needs a bit of globally bookkeeping (inverse pivots, etc.)
17 / 23
Cache-oblivious dense Gaussian Elimination
Needs a bit of globally bookkeeping (inverse pivots, etc.)
17 / 23
Cache-oblivious dense Gaussian Elimination
Needs a bit of globally bookkeeping (inverse pivots, etc.)
17 / 23
Cache-oblivious dense Gaussian Elimination
Needs a bit of globally bookkeeping (inverse pivots, etc.)
17 / 23
Cache-oblivious dense Gaussian Elimination
Needs a bit of globally bookkeeping (inverse pivots, etc.)
17 / 23
Cache-oblivious dense Gaussian Elimination
Differences to the naive approach: I The base cases are not parallelized.
18 / 23
Cache-oblivious dense Gaussian Elimination
Differences to the naive approach: I The base cases are not parallelized. I There are no parallel FOR loops.
18 / 23
Cache-oblivious dense Gaussian Elimination
Differences to the naive approach: I The base cases are not parallelized. I There are no parallel FOR loops. I Instead we need to use a recursive task scheduling:
18 / 23
Cache-oblivious dense Gaussian Elimination
Differences to the naive approach: I The base cases are not parallelized. I There are no parallel FOR loops. I Instead we need to use a recursive task scheduling: pthread: no scheduling, left unbound. Open MP: parallel sections (real tasks should be available in Open MP 4.0) KAAPIC: kaapic spawn Intel TBB: invoke
18 / 23
Timings Timings: test-co-gep-hpac-talk Cache-oblivious GEP uint64 Matrix dimensions: 8192 x 8192
400
Raw sequential pThread 1D Open MP parallel sections KAAPIC Spawn Intel TBB Invoke
350
Real time in seconds
300
250
200
150
100
50
0
1
2
4
8
Number of threads
16
32
64
19 / 23
GFLOPS/sec GFLOPS/sec: test-co-gep-hpac-talk Cache-oblivious GEP uint64 Matrix dimensions: 8192 x 8192
80
Raw sequential pThread 1D Open MP parallel sections KAAPIC Spawn Intel TBB Invoke
70
GFLOPS per second
60
50
40
30
20
10
0
1
2
4
8
Number of threads
16
32
64
20 / 23
Speedup Speedup: test-co-gep-hpac-talk Cache-oblivious GEP uint64 Matrix dimensions: 8192 x 8192
14
Raw sequential pThread 1D Open MP parallel sections KAAPIC Spawn Intel TBB Invoke
12
10
Speedup
8
6
4
2
0
1
2
4
8
Number of threads
16
32
64
21 / 23
1 Basics
2 Naive dense matrix multiplication
3 Naive dense Gaussian Eliminaton
4 Cache-oblivious dense Gaussian Elimination
5 Features of the library
22 / 23
Features of the library
I Detection of available parallel schedulers
23 / 23
Features of the library
I Detection of available parallel schedulers I Userfriendly interface to add new algorithms easily: For example, one can easily drop in ATLAS, OpenBLAS, PLASMA, etc. Easy to use and highly customizable, Python-based benchmarking tools including plotting functionality Publicly available: https://github.com/ederc/F4RT
23 / 23
Features of the library GFLOPS/sec: bench-35adccead66ea99653a407c5a66039e3 Tiled GEP double Matrix dimensions: 32768 x 32768
160
OpenBLAS / GotoBLAS
140
GFLOPS per second
120
100
80
60
40
20
0
1
2
4
8
Number of threads
16
32
64
23 / 23
Features of the library GFLOPS/sec: bench-5f898c444ab6510f97b907dfe30ec69b Tiled GEP double Matrix dimensions: 1024 x 1024 with dimensions doubled in each step using 32 threads
140
120
GFLOPS per second
100
80
60
40
20
0
OpenBLAS / GotoBLAS
0
1
2
3
Number of increasing steps
4
5
23 / 23
Features of the library GFLOPS/sec: bench-5ce3357af4f8f3b6cf377a6eabd0f2db Tiled GEP double Matrix dimensions: 32768 x 32768
30
PLASMA
25
GFLOPS per second
20
15
10
5
0
1
2
4
8
Number of threads
16
32
64
23 / 23
Features of the library GFLOPS/sec: bench-f0ee92bdc4b86593fa79cffc9c29099c Tiled GEP double Matrix dimensions: 1024 x 1024 with dimensions doubled in each step using 32 threads
16
14
GFLOPS per second
12
10
8
6
4
2
0
PLASMA
0
1
2
3
Number of increasing steps
4
5
23 / 23
Features of the library
I Detection of available parallel schedulers I Userfriendly interface to add new algorithms easily: For example, one can easily drop in ATLAS, OpenBLAS, PLASMA, etc. I Easy to use and highly customizable, Python-based benchmarking tools including plotting functionality
23 / 23
Features of the library
I Detection of available parallel schedulers I Userfriendly interface to add new algorithms easily: For example, one can easily drop in ATLAS, OpenBLAS, PLASMA, etc. I Easy to use and highly customizable, Python-based benchmarking tools including plotting functionality I Publicly available: https://github.com/ederc/LA-BENCHER
22 / 23
Bibliography
[PL13]
E. Agullo et al. PLASMA Users’ Guide: Parallel Linear Algebra Software for Multicore Architectures, Version 2.0
[OB13]
Z. Xianyi, W. Quian and Z. Chothia. OpenBLAS, http://xianyi.github.com/OpenBLAS
[CR10]
R. A. Chowdhury and V. Ramachandran. The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation
[WP04]
R. C. Whaley and A. Petitet Minimizing development and maintenance costs in supporting persistently optimized BLAS
[WPD01]
R. C. Whaley, A. Petitet and J. J. Dongarra Automated Empirical Optimization of Software and the ATLAS Project
[WD99]
R. C. Whaley and J. J. Dongarra Automatically Tuned Linear Algebra Software
23 / 23