Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems
FluiDyna GmbH Lichtenbergstraße 8 D-85748 Garching b. München www.fluidyna.com
Dr. Bjoern Landmann
Culises: A Library for Accelerated CFD on Hybrid ...
FluiDyna GmbH Lichtenbergstraße 8 D-85748 Garching b. München www.fluidyna.com
Dr. Bjoern Landmann
Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems
Content • Brief overview on the company and motivation for GPU-computing • Library Culises – current status • Example results • Current and future development
Slide 2
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Company Overview Area of expertise
• Complete package
Rackmount-server
– HPC-hardware – CFD-consulting – HPC-software
Cluster
Workstations
GPUs
Slide 3
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Company Overview CFD-Consulting - Examples Automotive: Car-truck passing maneuver
Pharmaceutics: Stirred tank bioreactors Unsteady simulation (multiphase flow) Small cluster → several weeks of simulation time Medium cluster → week
Steady simulation (one snapshot only) Small cluster → weeks/months of simulation time Medium cluster (512 CPU cores) → week
Slide 4
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Company Overview HPC-Software based on GPU-computing •
Motivation for GPU-accelerated CFD – – – –
•
Shorter development cycles Larger models → increased accuracy (Automated) optimization … many more …
LBultra – Lattice-Boltzmann method: speedup of 20x comparing a single GPU and a CPU (4 cores)
stand-alone version
•
plugin for design suite
Culises – Library for accelerated CFD on hybrid GPU-CPU systems
Slide 5
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Library Culises Interface to application
• Implemented as a dynamic library • Application interface – Only transfer solution of expensive linear system(s) from CPUs to GPUs – Assembly of linear system(s) remains on CPUs – E.g. established coupling with OpenFOAM® easy to conduct script-based installation • •
Slide 6
OpenFOAM is a free, open source CFD software package with a large user base across most areas of engineering and science OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Library Culises Schematic overview OpenFOAM® (1.7.1/2.0.1/2.1.0) MPI-parallelized CPU implementation based on domain decomposition
MPI-parallel assembly of system matrices remains on CPUs Slide 7
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Library Culises Solvers available • State-of-the-art solvers for linear systems – Multi-GPU – Single or double precision (only DP results are shown)
• Krylov subspace methods – Conjugate or Bi-Conjugate Gradient method for symmetric and non-symmetric system matrices – Preconditioning • Jacobi (DiagonalPCG) • Incomplete Cholesky (DICPCG) • Algebraic Multigrid (AMGPCG)
• Stand-alone Multigrid method under development Slide 8
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Library Culises Parallel approach • 1-1 link between MPI-process/rank and GPU -> CPU partitioning equals GPU partitioning -> peak performance CPU under-utilization of GPUs MPI_Comm_size (comm,&size) • Bunching of MPI-ranks required n-1 linkage option • GPUDirect – Peer-to-peer data exchange CUDA 4.1 IPC – Directly hidden in MPI-implementation release candidates: OpenMPI, MVAPICH2
Slide 9
CPU 0
3-1 1-1
GPU 0
CPU 1
GPU 1
CPU 2
GPU 2
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results Setup
• Amdahl‘s law and theoretical maximum speedup: speedup s
𝑠=
1 𝑓 1−𝑓 +𝑎
1 𝑠𝑚𝑎𝑥 = lim 𝑠(𝑎) = 𝑎→∞ 1−𝑓 𝑠 Efficiency E = 𝑠𝑚𝑎𝑥
fraction of computation that is ported to GPU f Slide 10
acceleration on GPU: a→∞ a = 15 a = 10 a= 5
Example: On CPU solution of linear systen consumes 80% of total CPU time: f = 0.8 a = 10 𝑠𝑚𝑎𝑥 = 5 𝑠 = 3.57 E = 0.71
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results Setup • CFD solver: OpenFOAM® 2.0.1/2.1.0 • Fair comparison: – Best linear solver on CPU vs best linear solver on GPU • Krylov: preconditioned Conjugate Gradient method • Multigrid method
– Needs considerable tuning of solver parameters for both CPU and GPU solvers (multigrid, SIMPLE1 algorithm, …) – Same convergence criterion: specified tolerance of residual
1. Semi-Implicit Method for Pressure-Linked Equations
Slide 11
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results Automotive: DrivAER
• Generic car shape model • Incompressible flow – simpleFOAM solver • SIMPLE1 method – Pressure-velocity coupling – Poisson equation for pressure linear system solved by Culises
– k-ω SST turbulence model – 2 computational grids • 3 million grid cells (sequential runs) • 22 million grid cells (parallel runs)
DrivAER geometry Solver control (OpenFOAM®) via config files solvers { p solver PCG preconditioner DIC tolerance 1e-6 ... }
1. Semi-Implicit Method for Pressure-Linked Equations
Slide 12
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results DrivAER 3M grid cells • Single CPU vs single CPU+GPU – Converged solution (4000 timesteps) – Validation: comparison of results • DICPCG on CPU • AMGPCG on GPU
DICPCG
Single CPU
AMGPCG
Single CPU+GPU
• Memory requirement – AMGPCG: 40% of 6 GB; 1M cells require 0.80 GB → Tesla 2070: 7.5M cells – DiagonalPCG: 13% of 6 GB; 1M cells require 0.26 GB → Tesla 2070: 23M cells
Slide 13
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results DrivAER 3M grid cells
• Speedup with single GPU Solver CPU
Solver GPU
Fraction 𝑓
Speedup
1
𝑠=
1−𝑓 + Efficieny 𝐄 = GAMG1
DICPCG
Diagonal PCG
AMG PCG
0.55
Diagonal PCG
0.78
Diagonal PCG
0.87
𝑠𝑚𝑎𝑥 =
1 1−𝑓
GPU-acceleration speedup linear solver 𝑎
𝒔 𝒔𝒎𝒂𝒙
1.56
2.22
3.36
4.55
5.8
7.7
11.6
68%
1.
Slide 14
𝑓 𝑎
Theoretical maximum Speedup
2.7 60% 4.9 64%
GAMG: Generalized geometric-algebraic Multigrid solver geometric agglomeration based on grid faces area
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results DrivAER 3M grid cells • Performance with multiple GPUs • Strong scaling: multiple CPUs+GPUs (1-1 linkage) – Scaling of total code versus # of CPUs and # of GPUs – Scaling of linear solver versus # of CPUs and # of GPUs time linear solver
1200
scaling linear solver
3.5 3 2.5 2 1.5 1 0.5 0
AMGPCG solver
1000
Simulation time
scaling total
800 600 400 200
0 0 Slide 15
1
2
3 4 5 # of CPUs = # of GPUs
6
Scaling
total time
7
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results DrivAER 3M grid
• Speedup by adding multiple GPUs (1-1 linkage) Solver CPU vs Solver GPU
Speedup total s 1 CPU
Speedup total s 2 CPUs
Speedup total s 4 CPUs
Speedup total s 6 CPUs
Speedup Linear solver 𝑎 1 CPU
Speedup Linear solver 𝑎 2 CPUs
Speedup Linear solver 𝑎 4 CPUs
Speedup Linear solver 𝑎 6 CPUs
+1 GPU
+2 GPUs
+4 GPUs
+6 GPUs
+1 GPU
+2 GPUs
+4 GPUs
+6 GPUs
GAMG vs AMG PCG
1.56
1.64
1.29
1.27
3.36
3.06
2.38
2.13
DICPCG vs Diagonal PCG
2.7
1.49
1.20
1.45
5.8
1.95
1.46
1.84
Diagonal PCG vs Diagonal PCG
4.9
2.84
1.79
2.03
11.6
4.14
2.39
2.80
Example: computation is 2.84 times faster when running on 2 GPUs + 2 CPUs than running on 2 CPUs only Slide 16
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results DrivAER 22M grid cells Performance with multiple GPUs, for memory reasons minimum 3 GPUs needed (GPU memory usage ≈90%)
6000
total time total time CPU only scaling total scaling total CPU only
GAMG on CPUs only (dashed) AMGPCG on CPUs+GPUs (solid)
5000 Simulation time
time linear solver time linear solver CPU only scaling linear solver scaling linear solver CPU only
2
4000
1.5
3000 1
2000
0.5
1000 0
0 3
Slide 17
2.5
Scaling
•
4 6 # of CPUs = # of GPUs
8
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results DrivAER 22M grid cells • Speedup by adding multiple GPUs (1-1 linkage) GAMG solver vs AMGPCG solver # of CPUs
3 CPUs
4 CPUs
6 CPUs
8 CPUs
# of GPUs added
+3 GPUs
+4 GPUs
+6 GPUs
+8 GPUs
Speedup s
1.56
1.58
1.54
1.42
Speedup linear solver 𝑎
3.4
2.81
2.91
2.33
Fraction f
0.60
0.59
0.57
0.50
Theoretical max speedup 𝑠𝑚𝑎𝑥
2.50
2.43
2.33
2.00
Efficiency E
62%
65%
66%
71%
• Utilization not optimal Further optimization under development n-1 linkage between CPU-GPU
Slide 18
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results Multiphase flow: ship hull • LTSinterFoam solver – Steady with use of local time stepping method – Volume of fluid (VoF) method – Pressure solver linear system → Culises
• 4M grid cells Solver CPU
Solver GPU
DICPCG
Diagonal PCG
0.43
Diagonal PCG
0.55
Diagonal PCG
Slide 19
Fraction f
Speedup s Efficiency E 1.54
Theoret. maximum speedup
GPU-acceleration linear solver 𝑎
1.75
4.91
2.22
8.66
88% 2.12 95%
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results Heat transfer: heated room • buoyantPimpleFoam solver – Unsteady PISO1 method – Pressure solver linear system → Culises
• 4M grid cells Solver CPU
Solver GPU
DICPCG
Diagonal PCG
0.72
Diagonal PCG
0.80
Diagonal PCG
Fraction f
Speedup S Efficiency E 2.45
Theoret. maximum speedup
GPU-acceleration linear solver 𝑎
3.57
6.11
5.00
9.90
69% 3.59 72% 1. Pressure-Implicit with Splitting of Operators
Slide 20
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results Process industry: flow molding
• pisoFoam solver – unsteady – Pressure solver linear system → Culises – 500K grid cells Solver CPU
Solver GPU
DICPCG
Diagonal PCG
0.84
Diagonal PCG
0.94
Diagonal PCG
Slide 21
Fraction f
Speedup S
Efficiency E 2.65
Theoret. maximum speedup
GPU-acceleration linear solver 𝑎
6.25
3.6
16.7
10.4
42% 6.9 42%
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Example results Pharmaceutical: generic bioreactor • interFoam solver – Unsteady – VoF method – Pressure solver linear system → Culises
liquid surface
• 500k grid cells Solver CPU
Solver GPU
GAMG
AMGPCG
shaking device (off-centered spindle)
Fraction f
0.53
Speedup S Efficiency E 1.44
Theoret. maximum speedup
GPU-acceleration linear solver 𝑎
2.12
2.59
5.26
5.94
68% Diagonal PCG
Slide 22
Diagonal PCG
0.81
3.00 57%
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Summary • Speedup categorized by application 8 6.8
3.4 1.6
7
91% 70%
65% 3
1 1
automotive
multiphase
Speedup
1 1 heat transfer
Acceleration
63%
4.7 42% 1 1
2.22
1.9
1 1
4.27
1 1 pharmaceutics
OpenFOAM® basic
process industry
Efficiency
obtained from (averaged) single GPU test cases
Slide 23
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Future Culises features Under development
• Stand-alone multigrid solver • Multi-GPU usage and scalability – Optimized load balancing via n-1 linkage between CPU-GPU – Optimized data exchange via peer-to-peer (PCIe 2.0/3.0) transfers
Slide 24
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
B. Landmann
Questions?
Slide 25
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems