Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems

FluiDyna GmbH Lichtenbergstraße 8 D-85748 Garching b. München www.fluidyna.com Dr. Bjoern Landmann Culises: A Library for Accelerated CFD on Hybrid ...
15 downloads 0 Views 2MB Size
FluiDyna GmbH Lichtenbergstraße 8 D-85748 Garching b. München www.fluidyna.com

Dr. Bjoern Landmann

Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems

Content • Brief overview on the company and motivation for GPU-computing • Library Culises – current status • Example results • Current and future development

Slide 2

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Company Overview Area of expertise

• Complete package

Rackmount-server

– HPC-hardware – CFD-consulting – HPC-software

Cluster

Workstations

GPUs

Slide 3

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Company Overview CFD-Consulting - Examples Automotive: Car-truck passing maneuver

Pharmaceutics: Stirred tank bioreactors Unsteady simulation (multiphase flow) Small cluster → several weeks of simulation time Medium cluster → week

Steady simulation (one snapshot only) Small cluster → weeks/months of simulation time Medium cluster (512 CPU cores) → week

Slide 4

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Company Overview HPC-Software based on GPU-computing •

Motivation for GPU-accelerated CFD – – – –



Shorter development cycles Larger models → increased accuracy (Automated) optimization … many more …

LBultra – Lattice-Boltzmann method: speedup of 20x comparing a single GPU and a CPU (4 cores)

stand-alone version



plugin for design suite

Culises – Library for accelerated CFD on hybrid GPU-CPU systems

Slide 5

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Library Culises Interface to application

• Implemented as a dynamic library • Application interface – Only transfer solution of expensive linear system(s) from CPUs to GPUs – Assembly of linear system(s) remains on CPUs – E.g. established coupling with OpenFOAM® easy to conduct script-based installation • •

Slide 6

OpenFOAM is a free, open source CFD software package with a large user base across most areas of engineering and science OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Library Culises Schematic overview OpenFOAM® (1.7.1/2.0.1/2.1.0) MPI-parallelized CPU implementation based on domain decomposition

Culises: Solves linear system(s) on multiple GPUs

OpenFOAM: CPU 0

linear system Ax=b

CPU 1

processor partitioning

CPU 2

solution x

Interface: cudaMemcpy(…. cudaMemcpyHostTo Device) cudaMemcpy(…. cudaMemcpyDeviceTo Host)

GPU 0

Culises: PCG PBiCG AMGPCG

GPU 1 GPU 2

MPI-parallel assembly of system matrices remains on CPUs Slide 7

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Library Culises Solvers available • State-of-the-art solvers for linear systems – Multi-GPU – Single or double precision (only DP results are shown)

• Krylov subspace methods – Conjugate or Bi-Conjugate Gradient method for symmetric and non-symmetric system matrices – Preconditioning • Jacobi (DiagonalPCG) • Incomplete Cholesky (DICPCG) • Algebraic Multigrid (AMGPCG)

• Stand-alone Multigrid method under development Slide 8

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Library Culises Parallel approach • 1-1 link between MPI-process/rank and GPU -> CPU partitioning equals GPU partitioning -> peak performance CPU under-utilization of GPUs MPI_Comm_size (comm,&size) • Bunching of MPI-ranks required n-1 linkage option • GPUDirect – Peer-to-peer data exchange CUDA 4.1 IPC – Directly hidden in MPI-implementation release candidates: OpenMPI, MVAPICH2

Slide 9

CPU 0

3-1 1-1

GPU 0

CPU 1

GPU 1

CPU 2

GPU 2

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results Setup

• Amdahl‘s law and theoretical maximum speedup: speedup s

𝑠=

1 𝑓 1−𝑓 +𝑎

1 𝑠𝑚𝑎𝑥 = lim 𝑠(𝑎) = 𝑎→∞ 1−𝑓 𝑠 Efficiency E = 𝑠𝑚𝑎𝑥

fraction of computation that is ported to GPU f Slide 10

acceleration on GPU: a→∞ a = 15 a = 10 a= 5

Example: On CPU solution of linear systen consumes 80% of total CPU time: f = 0.8 a = 10 𝑠𝑚𝑎𝑥 = 5 𝑠 = 3.57 E = 0.71

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results Setup • CFD solver: OpenFOAM® 2.0.1/2.1.0 • Fair comparison: – Best linear solver on CPU vs best linear solver on GPU • Krylov: preconditioned Conjugate Gradient method • Multigrid method

– Needs considerable tuning of solver parameters for both CPU and GPU solvers (multigrid, SIMPLE1 algorithm, …) – Same convergence criterion: specified tolerance of residual

• Hardware configuration: Tyan board with – 2 CPUs: Intel Xeon X5650 @ 2.67 GHz – 8 GPUs: Tesla 2070 (6GB)



1. Semi-Implicit Method for Pressure-Linked Equations

Slide 11

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results Automotive: DrivAER

• Generic car shape model • Incompressible flow – simpleFOAM solver • SIMPLE1 method – Pressure-velocity coupling – Poisson equation for pressure linear system solved by Culises

– k-ω SST turbulence model – 2 computational grids • 3 million grid cells (sequential runs) • 22 million grid cells (parallel runs)

DrivAER geometry Solver control (OpenFOAM®) via config files solvers { p solver PCG preconditioner DIC tolerance 1e-6 ... }

solvers { p solver PCG PCGGPU preconditioner AMG tolerance 1e-6 ... }

1. Semi-Implicit Method for Pressure-Linked Equations

Slide 12

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results DrivAER 3M grid cells • Single CPU vs single CPU+GPU – Converged solution (4000 timesteps) – Validation: comparison of results • DICPCG on CPU • AMGPCG on GPU

DICPCG

Single CPU

AMGPCG

Single CPU+GPU

• Memory requirement – AMGPCG: 40% of 6 GB; 1M cells require 0.80 GB → Tesla 2070: 7.5M cells – DiagonalPCG: 13% of 6 GB; 1M cells require 0.26 GB → Tesla 2070: 23M cells

Slide 13

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results DrivAER 3M grid cells

• Speedup with single GPU Solver CPU

Solver GPU

Fraction 𝑓

Speedup

1

𝑠=

1−𝑓 + Efficieny 𝐄 = GAMG1

DICPCG

Diagonal PCG

AMG PCG

0.55

Diagonal PCG

0.78

Diagonal PCG

0.87

𝑠𝑚𝑎𝑥 =

1 1−𝑓

GPU-acceleration speedup linear solver 𝑎

𝒔 𝒔𝒎𝒂𝒙

1.56

2.22

3.36

4.55

5.8

7.7

11.6

68%

1.

Slide 14

𝑓 𝑎

Theoretical maximum Speedup

2.7 60% 4.9 64%

GAMG: Generalized geometric-algebraic Multigrid solver geometric agglomeration based on grid faces area

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results DrivAER 3M grid cells • Performance with multiple GPUs • Strong scaling: multiple CPUs+GPUs (1-1 linkage) – Scaling of total code versus # of CPUs and # of GPUs – Scaling of linear solver versus # of CPUs and # of GPUs time linear solver

1200

scaling linear solver

3.5 3 2.5 2 1.5 1 0.5 0

AMGPCG solver

1000

Simulation time

scaling total

800 600 400 200

0 0 Slide 15

1

2

3 4 5 # of CPUs = # of GPUs

6

Scaling

total time

7

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results DrivAER 3M grid

• Speedup by adding multiple GPUs (1-1 linkage) Solver CPU vs Solver GPU

Speedup total s 1 CPU

Speedup total s 2 CPUs

Speedup total s 4 CPUs

Speedup total s 6 CPUs

Speedup Linear solver 𝑎 1 CPU

Speedup Linear solver 𝑎 2 CPUs

Speedup Linear solver 𝑎 4 CPUs

Speedup Linear solver 𝑎 6 CPUs

+1 GPU

+2 GPUs

+4 GPUs

+6 GPUs

+1 GPU

+2 GPUs

+4 GPUs

+6 GPUs

GAMG vs AMG PCG

1.56

1.64

1.29

1.27

3.36

3.06

2.38

2.13

DICPCG vs Diagonal PCG

2.7

1.49

1.20

1.45

5.8

1.95

1.46

1.84

Diagonal PCG vs Diagonal PCG

4.9

2.84

1.79

2.03

11.6

4.14

2.39

2.80

Example: computation is 2.84 times faster when running on 2 GPUs + 2 CPUs than running on 2 CPUs only Slide 16

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results DrivAER 22M grid cells Performance with multiple GPUs, for memory reasons minimum 3 GPUs needed (GPU memory usage ≈90%)

6000

total time total time CPU only scaling total scaling total CPU only

GAMG on CPUs only (dashed) AMGPCG on CPUs+GPUs (solid)

5000 Simulation time

time linear solver time linear solver CPU only scaling linear solver scaling linear solver CPU only

2

4000

1.5

3000 1

2000

0.5

1000 0

0 3

Slide 17

2.5

Scaling



4 6 # of CPUs = # of GPUs

8

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results DrivAER 22M grid cells • Speedup by adding multiple GPUs (1-1 linkage) GAMG solver vs AMGPCG solver # of CPUs

3 CPUs

4 CPUs

6 CPUs

8 CPUs

# of GPUs added

+3 GPUs

+4 GPUs

+6 GPUs

+8 GPUs

Speedup s

1.56

1.58

1.54

1.42

Speedup linear solver 𝑎

3.4

2.81

2.91

2.33

Fraction f

0.60

0.59

0.57

0.50

Theoretical max speedup 𝑠𝑚𝑎𝑥

2.50

2.43

2.33

2.00

Efficiency E

62%

65%

66%

71%

• Utilization not optimal Further optimization under development n-1 linkage between CPU-GPU

Slide 18

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results Multiphase flow: ship hull • LTSinterFoam solver – Steady with use of local time stepping method – Volume of fluid (VoF) method – Pressure solver linear system → Culises

• 4M grid cells Solver CPU

Solver GPU

DICPCG

Diagonal PCG

0.43

Diagonal PCG

0.55

Diagonal PCG

Slide 19

Fraction f

Speedup s Efficiency E 1.54

Theoret. maximum speedup

GPU-acceleration linear solver 𝑎

1.75

4.91

2.22

8.66

88% 2.12 95%

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results Heat transfer: heated room • buoyantPimpleFoam solver – Unsteady PISO1 method – Pressure solver linear system → Culises

• 4M grid cells Solver CPU

Solver GPU

DICPCG

Diagonal PCG

0.72

Diagonal PCG

0.80

Diagonal PCG

Fraction f

Speedup S Efficiency E 2.45

Theoret. maximum speedup

GPU-acceleration linear solver 𝑎

3.57

6.11

5.00

9.90

69% 3.59 72% 1. Pressure-Implicit with Splitting of Operators

Slide 20

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results Process industry: flow molding

• pisoFoam solver – unsteady – Pressure solver linear system → Culises – 500K grid cells Solver CPU

Solver GPU

DICPCG

Diagonal PCG

0.84

Diagonal PCG

0.94

Diagonal PCG

Slide 21

Fraction f

Speedup S

Efficiency E 2.65

Theoret. maximum speedup

GPU-acceleration linear solver 𝑎

6.25

3.6

16.7

10.4

42% 6.9 42%

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Example results Pharmaceutical: generic bioreactor • interFoam solver – Unsteady – VoF method – Pressure solver linear system → Culises

liquid surface

• 500k grid cells Solver CPU

Solver GPU

GAMG

AMGPCG

shaking device (off-centered spindle)

Fraction f

0.53

Speedup S Efficiency E 1.44

Theoret. maximum speedup

GPU-acceleration linear solver 𝑎

2.12

2.59

5.26

5.94

68% Diagonal PCG

Slide 22

Diagonal PCG

0.81

3.00 57%

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Summary • Speedup categorized by application 8 6.8

3.4 1.6

7

91% 70%

65% 3

1 1

automotive

multiphase

Speedup

1 1 heat transfer

Acceleration

63%

4.7 42% 1 1

2.22

1.9

1 1

4.27

1 1 pharmaceutics

OpenFOAM® basic

process industry

Efficiency

obtained from (averaged) single GPU test cases

Slide 23

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Future Culises features Under development

• Stand-alone multigrid solver • Multi-GPU usage and scalability – Optimized load balancing via n-1 linkage between CPU-GPU – Optimized data exchange via peer-to-peer (PCIe 2.0/3.0) transfers

Slide 24

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Questions?

Slide 25

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

B. Landmann

Suggest Documents