June 11, 2013

1 / 23

1 Basics

2 Naive dense matrix multiplication

3 Naive dense Gaussian Eliminaton

4 Cache-oblivious dense Gaussian Elimination

5 Features of the library

2 / 23

Preconditions I I Using dense matrices with unsigned int64 entries. I Computing in Fp , p some prime < 216 . I We compared the following set of parallel schedulers: 1. pthread (or in other words, by hand), 2. OpenMP (sometimes together with pthread), 3. Intel TBB (using lambda expressions), 4. XKAAPI (in particular, the C interface KAAPIC),

3 / 23

Preconditions I I Using dense matrices with unsigned int64 entries. I Computing in Fp , p some prime < 216 . I We compared the following set of parallel schedulers: 1. pthread (or in other words, by hand), 2. OpenMP (sometimes together with pthread), 3. Intel TBB (using lambda expressions), 4. XKAAPI (in particular, the C interface KAAPIC),

Note The implemented algorithms are not optimized in order to keep the influence on the schedulers as low as possible.

3 / 23

Preconditions II Results presented computed on the HPAC compute server

NUMA

with hyperthreading: 64 cores

Also tested on:

48-core (real cores) AMD Magny Cours NUMA, 4-core (8 with hyperthreading) Intel Sandy Bridge. 4 / 23

Preconditions II Results presented computed on the HPAC compute server

I 8 Intel Xeon E5-4620 cores @ 2.20 GHz

NUMA

I L1 cache: 32 KB I L2 cache: 256 KB I shared L3 cache: 16 MB I 96 GB RAM with hyperthreading: 64 cores

Also tested on:

48-core (real cores) AMD Magny Cours NUMA, 4-core (8 with hyperthreading) Intel Sandy Bridge. 4 / 23

Preconditions II Results presented computed on the HPAC compute server

I 8 Intel Xeon E5-4620 cores @ 2.20 GHz

NUMA

I L1 cache: 32 KB I L2 cache: 256 KB I shared L3 cache: 16 MB I 96 GB RAM with hyperthreading: 64 cores

Also tested on:

48-core (real cores) AMD Magny Cours NUMA, 4-core (8 with hyperthreading) Intel Sandy Bridge. 4 / 23

Tested algorithms

1. Naive Dense Matrix Multiplication 2. Dense Gaussian Elimination: (a) Naive implementation (with and without pivoting) (b) Cache-oblivious implementation (GEP by Chowdhury and Ramachandran without pivoting)

5 / 23

1 Basics

2 Naive dense matrix multiplication

3 Naive dense Gaussian Eliminaton

4 Cache-oblivious dense Gaussian Elimination

5 Features of the library

6 / 23

Naive dense matrix multiplication

We compared several variants of parallelized for loops:

7 / 23

Naive dense matrix multiplication

We compared several variants of parallelized for loops: I 1-dimensional vs. 2-dimensional parallel loops

7 / 23

Naive dense matrix multiplication

We compared several variants of parallelized for loops: I 1-dimensional vs. 2-dimensional parallel loops I For Intel TBB we compared the different integrated schedulers: auto partitioner: Splitting work to balance load affine partitioner: Improves choice of CPU affinity simple partitioner: Recursively splits a range until it is no longer divisible (grainsize is critical)

7 / 23

Timings Timings: bench-4a7a7e230bef0495ee882549092f0e33~ Mat Mult uint64 Matrix dimensions: 6000 x 5000, 5000 x 7000

900

Raw sequential pThread 1D Open MP collapse(1) outer loop Open MP collapse(1) inner loop Open MP collapse(2) KAAPIC 1D KAAPIC 2D Intel TBB 1D auto partitioner Intel TBB 1D affinity partitioner Intel TBB 1D simple partitioner Intel TBB 2D auto partitioner Intel TBB 2D affinity partitioner Intel TBB 2D simple partitioner

800

700

Real time in seconds

600

500

400

300

200

100

0

1

2

4

8

Number of threads

16

32

64

8 / 23

GFLOPS/sec GFLOPS/sec: bench-4a7a7e230bef0495ee882549092f0e33~ Mat Mult uint64 Matrix dimensions: 6000 x 5000, 5000 x 7000

30

Raw sequential pThread 1D Open MP collapse(1) outer loop Open MP collapse(1) inner loop Open MP collapse(2) KAAPIC 1D KAAPIC 2D Intel TBB 1D auto partitioner Intel TBB 1D affinity partitioner Intel TBB 1D simple partitioner Intel TBB 2D auto partitioner Intel TBB 2D affinity partitioner Intel TBB 2D simple partitioner

25

GFLOPS per second

20

15

10

5

0

1

2

4

8

Number of threads

16

32

64

9 / 23

1 Basics

2 Naive dense matrix multiplication

3 Naive dense Gaussian Eliminaton

4 Cache-oblivious dense Gaussian Elimination

5 Features of the library

10 / 23

Naive dense Gaussian Elimination

Compared to naive multiplication we saw a different behaviour:

11 / 23

Naive dense Gaussian Elimination

Compared to naive multiplication we saw a different behaviour: I KAAPIC, Open MP and Intel TBB are in the same range.

11 / 23

Naive dense Gaussian Elimination

Compared to naive multiplication we saw a different behaviour: I KAAPIC, Open MP and Intel TBB are in the same range. I Open MP behaves a bit worse when it comes to hyperthreading.

11 / 23

Naive dense Gaussian Elimination

Compared to naive multiplication we saw a different behaviour: I KAAPIC, Open MP and Intel TBB are in the same range. I Open MP behaves a bit worse when it comes to hyperthreading. I pthread implementation slows down due to lack of real scheduler.

11 / 23

Timings Timings: test-naive-gep-hpac-talk Naive GEP uint64 Matrix dimensions: 8192 x 8192

700

Raw sequential pThread 1D Open MP collapse(1) outer loop KAAPIC 1D Intel TBB 1D auto partitioner Intel TBB 1D affinity partitioner Intel TBB 1D simple partitioner

600

Real time in seconds

500

400

300

200

100

0

1

2

4

8

Number of threads

16

32

64

12 / 23

GFLOPS/sec GFLOPS/sec: test-naive-gep-hpac-talk Naive GEP uint64 Matrix dimensions: 8192 x 8192

18

Raw sequential pThread 1D Open MP collapse(1) outer loop KAAPIC 1D Intel TBB 1D auto partitioner Intel TBB 1D affinity partitioner Intel TBB 1D simple partitioner

16

14

GFLOPS per second

12

10

8

6

4

2

0

1

2

4

8

Number of threads

16

32

64

13 / 23

Speedup Speedup: test-naive-gep-hpac-talk Naive GEP uint64 Matrix dimensions: 8192 x 8192

5

Raw sequential pThread 1D Open MP collapse(1) outer loop KAAPIC 1D Intel TBB 1D auto partitioner Intel TBB 1D affinity partitioner Intel TBB 1D simple partitioner

4

Speedup

3

2

1

0

1

2

4

8

Number of threads

16

32

64

14 / 23

1 Basics

2 Naive dense matrix multiplication

3 Naive dense Gaussian Eliminaton

4 Cache-oblivious dense Gaussian Elimination

5 Features of the library

15 / 23

Cache-oblivious dense Gaussian Elimination Implemented I-GEP from [CR10].

16 / 23

Cache-oblivious dense Gaussian Elimination Implemented I-GEP from [CR10]. Basic ideas are: I Assume matrix of dimensions 2k × 2k .

16 / 23

Cache-oblivious dense Gaussian Elimination Implemented I-GEP from [CR10]. Basic ideas are: I Assume matrix of dimensions 2k × 2k . I Do not consider pivoting.

16 / 23

Cache-oblivious dense Gaussian Elimination Implemented I-GEP from [CR10]. Basic ideas are: I Assume matrix of dimensions 2k × 2k . I Do not consider pivoting. I Recursively split matrix in 4 same-sized parts. 2k−1

2k−1

16 / 23

Cache-oblivious dense Gaussian Elimination Implemented I-GEP from [CR10]. Basic ideas are: I Assume matrix of dimensions 2k × 2k . I Do not consider pivoting. I Recursively split matrix in 4 same-sized parts. 2k−1

2k−1

I Stop recursion once parts fit in cache. 16 / 23

Cache-oblivious dense Gaussian Elimination

Needs a bit of globally bookkeeping (inverse pivots, etc.)

17 / 23

Cache-oblivious dense Gaussian Elimination

Needs a bit of globally bookkeeping (inverse pivots, etc.)

17 / 23

Cache-oblivious dense Gaussian Elimination

Needs a bit of globally bookkeeping (inverse pivots, etc.)

17 / 23

Cache-oblivious dense Gaussian Elimination

Needs a bit of globally bookkeeping (inverse pivots, etc.)

17 / 23

Cache-oblivious dense Gaussian Elimination

Needs a bit of globally bookkeeping (inverse pivots, etc.)

17 / 23

Cache-oblivious dense Gaussian Elimination

Differences to the naive approach: I The base cases are not parallelized.

18 / 23

Cache-oblivious dense Gaussian Elimination

Differences to the naive approach: I The base cases are not parallelized. I There are no parallel FOR loops.

18 / 23

Cache-oblivious dense Gaussian Elimination

Differences to the naive approach: I The base cases are not parallelized. I There are no parallel FOR loops. I Instead we need to use a recursive task scheduling:

18 / 23

Cache-oblivious dense Gaussian Elimination

Differences to the naive approach: I The base cases are not parallelized. I There are no parallel FOR loops. I Instead we need to use a recursive task scheduling: pthread: no scheduling, left unbound. Open MP: parallel sections (real tasks should be available in Open MP 4.0) KAAPIC: kaapic spawn Intel TBB: invoke

18 / 23

Timings Timings: test-co-gep-hpac-talk Cache-oblivious GEP uint64 Matrix dimensions: 8192 x 8192

400

Raw sequential pThread 1D Open MP parallel sections KAAPIC Spawn Intel TBB Invoke

350

Real time in seconds

300

250

200

150

100

50

0

1

2

4

8

Number of threads

16

32

64

19 / 23

GFLOPS/sec GFLOPS/sec: test-co-gep-hpac-talk Cache-oblivious GEP uint64 Matrix dimensions: 8192 x 8192

80

Raw sequential pThread 1D Open MP parallel sections KAAPIC Spawn Intel TBB Invoke

70

GFLOPS per second

60

50

40

30

20

10

0

1

2

4

8

Number of threads

16

32

64

20 / 23

Speedup Speedup: test-co-gep-hpac-talk Cache-oblivious GEP uint64 Matrix dimensions: 8192 x 8192

14

Raw sequential pThread 1D Open MP parallel sections KAAPIC Spawn Intel TBB Invoke

12

10

Speedup

8

6

4

2

0

1

2

4

8

Number of threads

16

32

64

21 / 23

1 Basics

2 Naive dense matrix multiplication

3 Naive dense Gaussian Eliminaton

4 Cache-oblivious dense Gaussian Elimination

5 Features of the library

22 / 23

Features of the library

I Detection of available parallel schedulers

23 / 23

Features of the library

I Detection of available parallel schedulers I Userfriendly interface to add new algorithms easily: For example, one can easily drop in ATLAS, OpenBLAS, PLASMA, etc. Easy to use and highly customizable, Python-based benchmarking tools including plotting functionality Publicly available: https://github.com/ederc/F4RT

23 / 23

Features of the library GFLOPS/sec: bench-35adccead66ea99653a407c5a66039e3 Tiled GEP double Matrix dimensions: 32768 x 32768

160

OpenBLAS / GotoBLAS

140

GFLOPS per second

120

100

80

60

40

20

0

1

2

4

8

Number of threads

16

32

64

23 / 23

Features of the library GFLOPS/sec: bench-5f898c444ab6510f97b907dfe30ec69b Tiled GEP double Matrix dimensions: 1024 x 1024 with dimensions doubled in each step using 32 threads

140

120

GFLOPS per second

100

80

60

40

20

0

OpenBLAS / GotoBLAS

0

1

2

3

Number of increasing steps

4

5

23 / 23

Features of the library GFLOPS/sec: bench-5ce3357af4f8f3b6cf377a6eabd0f2db Tiled GEP double Matrix dimensions: 32768 x 32768

30

PLASMA

25

GFLOPS per second

20

15

10

5

0

1

2

4

8

Number of threads

16

32

64

23 / 23

Features of the library GFLOPS/sec: bench-f0ee92bdc4b86593fa79cffc9c29099c Tiled GEP double Matrix dimensions: 1024 x 1024 with dimensions doubled in each step using 32 threads

16

14

GFLOPS per second

12

10

8

6

4

2

0

PLASMA

0

1

2

3

Number of increasing steps

4

5

23 / 23

Features of the library

I Detection of available parallel schedulers I Userfriendly interface to add new algorithms easily: For example, one can easily drop in ATLAS, OpenBLAS, PLASMA, etc. I Easy to use and highly customizable, Python-based benchmarking tools including plotting functionality

23 / 23

Features of the library

I Detection of available parallel schedulers I Userfriendly interface to add new algorithms easily: For example, one can easily drop in ATLAS, OpenBLAS, PLASMA, etc. I Easy to use and highly customizable, Python-based benchmarking tools including plotting functionality I Publicly available: https://github.com/ederc/LA-BENCHER

22 / 23

Bibliography

[PL13]

E. Agullo et al. PLASMA Users’ Guide: Parallel Linear Algebra Software for Multicore Architectures, Version 2.0

[OB13]

Z. Xianyi, W. Quian and Z. Chothia. OpenBLAS, http://xianyi.github.com/OpenBLAS

[CR10]

R. A. Chowdhury and V. Ramachandran. The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation

[WP04]

R. C. Whaley and A. Petitet Minimizing development and maintenance costs in supporting persistently optimized BLAS

[WPD01]

R. C. Whaley, A. Petitet and J. J. Dongarra Automated Empirical Optimization of Software and the ATLAS Project

[WD99]

R. C. Whaley and J. J. Dongarra Automatically Tuned Linear Algebra Software

23 / 23