Enabling High Performance Computational Dynamics in a Heterogeneous Hardware Ecosystem

Enabling High Performance Computational Dynamics in a Heterogeneous Hardware Ecosystem Assistant Prof. Dan Negrut Simulation-Based Engineering Lab De...

Author: Mervin Curtis

6 downloads 2 Views 8MB Size

Report

Download PDF

Recommend Documents

Discontinuous Galerkin for High Performance Computational Fluid Dynamics (hpcdg)

Heterogeneous parallel algorithms for Computational Fluid Dynamics on unstructured meshes

High Performance Control Cables & Hardware

High Performance Computing Systems and Enabling Platforms

The Failure of TCP in High-Performance Computational Grids

Computational Fluid Dynamics 1

COMPUTATIONAL FLUID DYNAMICS MODELING

THEORETICAL AND COMPUTATIONAL APPROACHES ON HETEROGENEOUS NUCLEATION

Parallel Computational Fluid Dynamics 2007

A HIGH PERFORMANCE MRP PART EXPLOSION PROCESS USING COMPUTATIONAL GRID IN A DISTRIBUTED DATABASE ENVIRONMENT

Computational Fluid Dynamics (CFD) Modeling

High breeding performance of European Rollers Coracias. garrulus in a heterogeneous farmland habitat of southern

Heterogeneous (CPU+GPU) Performance Libraries

Portable Performance on Heterogeneous Architectures

Heterogeneous Expectations, Learning and European In ation Dynamics

The ARM ecosystem for scientific computing. Geraint North Architect, High Performance Ecosystem ARM Ltd

High Performance Hardware-Accelerated Flash Key-Value Store

THE DEVIANT DYNAMICS OF DEATH IN HETEROGENEOUS POPULATIONS

Smallholder agriculture in Northeast Brazil: assessing heterogeneous human-environmental dynamics

Model of a tubular perfusion bioreactor using computational fluid dynamics

A Survey of Hardware Performance Analysis Tools

High Performance in Aluminium

COMPUTATIONAL FLUID DYNAMICS The Basics with Applications

an OpenSource Computational Fluid Dynamics Toolbox

Enabling High Performance Computational Dynamics in a Heterogeneous Hardware Ecosystem

Assistant Prof. Dan Negrut Simulation-Based Engineering Lab Department of Mechanical Engineering University of Wisconsin – Madison

Toward High-Performance Computing Support for the Simulation and Planning of Robot Contact Tasks June 27, 2011

Acknowledgements

People:

Alessandro Tasora – University of Parma, Italy Mihai Anitescu – Argonne National Lab Lab Students:

Hammad Mazhar Toby Heyn Andrew Seidl Arman Pazouki Hamid Ansari

Justin Madsen Naresh Khude Makarand Datar Dan Melanz Markus Schmid

Financial support

National Science Foundation, Young Investigator Career Award FunctionBay, S. Korea NVIDIA Microsoft Caterpillar 2

Talk Overview

Overview of the engineering problems of interest

Large-scale Multibody Dynamics

Problem formulation, solution method, and parallel implementation

Overview of Heterogeneous Computing Template (HCT)

Numerical Experiments

Validation efforts

Conclusions 3

Computational Multibody Dynamics

4 Simulation generated in ADAMS

Multi-Physics… Fluid-Solid Interaction: Navier-Stokes + Newton-Euler.

5

Computational Dynamics

6

Rover Mobility on Granular Terrain

Wheeled/tracked vehicle mobility on granular terrain

Also interested in scooping and loading granular material

7

Simulation in Chrono::Engine

Frictional Contact Simulation [Commercial Solution]

Model Parameters:

Spheres: 60 mm diameter and mass 0.882 kg Forces: smoothing with stiffness of 1E5, force exponent of 2.2, damping coefficient of 10.0, and a penetration depth of 0.1 Simulation length: 3 seconds

8

Frictional Contact: Two Different Approaches Considered

Discrete Element Method (DEM) - draws on a “smoothing” (penalty) approach

Lots of heuristics Slow General purpose Used in ADAMS

DVI-based (Differential Variational Inequalities)

A set of differential equations combined with inequality constraints Fast (stable for significantly larger integration step-sizes) Less general purpose Used widely in computer games 9

The Modeling Component

10

Equations of Motion: Multibody Dynamics

11

Traditional Discretization Scheme time step index positions

Mass Mat.

speeds Applied Forces

Reaction impulses

Complementarity Condition

Stabilization term

12

Coulomb 3D fricion model

(Stewart & Trinkle, 1996)

Relaxed Discretization Scheme Used

Relaxation Term

13

(Anitescu & Tasora, 2008)

The Cone Complementarity Problem (CCP)

First order optimality conditions lead to Cone Complementarity Problem

Introduce the convex hypercone...

... and its polar hypercone:

CCP assumes following form: Find γ such that 14

Putting Things in Perspective…

Three key points led to above algorithm:

Friction model posed as an optimization problem Working with velocity and impulses rather than acceleration and forces Contact complementarity expression altered to lead to CCP 15

Implementation

Method outlined implemented using two loops

Outer loop – runs the time stepping

Inner loop – CCP Algorithm (solves CCP problem at each time step)

16

Granular Dynamics: How Parallel Computing is Leveraged Parallel Collision Detection

2.

(Body parallel) Force kernel

3.

(Contact parallel) Contact preprocessing kernel

8.

4.

(Contact parallel) CCP contact kernel

5.

(Constraint parallel) CCP constraint kernel

6.

(Reduction-slot parallel) Velocity reduction kernel

7.

(Body parallel) Body velocity update kernel

Outer Loop

Inner Loop

1.

(Body parallel) Time integration kernel

17

Inner Loop (CCP Algorithm)

18

Large Scale Granular Dynamics

Numerical solution can leverage parallel computing

19

CPU vs. GPU – Flop Rate (GFlop/Sec)

Single Precision Double Precision

1200 Tesla 20-series

1000

Tesla 10-series

800 600

Tesla 20-series Tesla 8-series Westmere 3 GHz

400 Tesla 10-series

200

Nehalem 3 GHz

0 2003 2004 2005 2006 2007 2008 2009 2010

20

CPU vs. GPU– Memory Bandwidth [GB/sec]

160 Tesla 20-series

140 120

Tesla 10-series 100 Tesla 8-series 80 60 40

Nehalem 3 GHz

Westmere 3 GHz

20 0 2003

2004

2005

2006

2007

2008

2009

2010 21

Mixing 40,000 Spheres on the GPU

22

300K Spheres in Tank [parallel on the GPU]

23

1.1 Million Rigid Spheres [parallel on the GPU]

24

A Heterogeneous Computing Template for Computational Dynamics

25

Heterogeneous Cluster

Second fastest cluster at University of Wisconsin-Madison

26

Computation Using Multiple CPUs [DEM solution]

27

Computation Using Multiple CPUs [DEM solution]

28

Computation Using Multiple CPUs [DEM solution]

29

True Heterogeneous Computing

Use 1 GPU per MPI process

Each process uses its GPU to perform the collision detection task during each time step

As before, CPU takes care of communication between sub-domains State data is copied to GPU, CD is performed, and collision data is copied back GPU is used as an accelerator/co-processor

30

Demonstration 16,000 bodies 2 sub-domains CPU: CPU+GPU: 9.43 hrs

31

Heterogeneous Computing Template Five Major Components

Computational Dynamics requires

Domain decomposition Proximity computation Inter-domain data exchange Numerical algorithm support Post-processing (visualization)

HCT represents the library support and associated API that capture this five component abstraction 32

Typical Simulation Results…

LEFT: Infinity norm of the residual vs. iteration index in the CCP solution

Convergence rate (slope of curve) becomes smaller as the iteration index increases.

RIGHT: Infinity norm of the CCP residual after rmax iterations as function of granular material depth (number of spheres stacked on each other). 33

Searching for Better Methods

Frictionless case (bound constraints in place)

Gauss-Jacobi (CE) Projected conjugate gradient (ProjCG) Gradient projected conjugate gradient (GPCG) Gradient projected MINRES (GPMINRES)

Friction case (cone constraints - ongoing)

Newton’s Method for large bound-constrained problems

Uses re-parameterization to handle friction cones (replace with bound constraints)

34

Numerical Experiments

Test Problem: 40,000 bodies ⇒ 157,520 contacts Frictionless

35

Test Problem (MATLAB)

Final Residual Method

Iterations

γmin

γmax

Time [sec]

Norm CE

1000

6.11 x 10-2

0.0

2.0598

1849.5

ProjCG

1002

5.6344 x 10-4

0.0

2.2286

1235.6

GPCG

1600

1.0675 x 10-4

0.0

2.6349

382.3644

GPMinres

1100

9.5239 x 10-5

0.0

2.3090

238.0744

PCG

1000

2.4053 x 10-4

-1.1116

2.5254

27.9686

GMRES

1000

4.5315 x 10-5

-1.1635

2.5227

736.3007

MINRES

1000

1.6979 x 10-5

-1.1316

2.5253

41.5790

36

Proximity Computation

37

GPU Collision Detection (CD)

30,000 feet perspective:

Carry out spatial partitioning of the volume occupied by the bodies

Place bodies in bins (cubes, for instance)

Follow up by brute force search for all bodies touching each bin

Embarrassingly parallel

38

Basic Idea: Search for Contacts in Different Bins in Parallel

Example: 2D collision detection, bins are squares

39

Ellipsoid-Ellipsoid CD: Results Speedup - GPU vs. CPU

45

180

40

160

35

140

30

120

Speedup

Time (seconds)

Time vs. Number of Contacts

25 20 15

100 80 60

10

40

5

20

0

0 0

500,000 1,000,000 1,500,000 Number of Contacts

0

500,000

1,000,000 1,500,000

Number of Contacts

40

Speedup - GPU vs. CPU (Bullet library) [results reported are for spheres]

X Speedup

GPU: NVIDIA Tesla C1060 CPU: AMD Phenom II Black X4 940 (3.0 GHz) 200 180 160 140 120 100 80 60 40 20 0 0

1

2

3 Contacts (Millions)

4

5

6 41

Parallel Implementation: Number of Contacts vs. Detection Time [results reported are for spheres] 4 3.5

Time (sec)

3 2.5 2 1.5 1 0.5 0 0

2

4

6

8

10

12

14

Contacts (Millions) 42

16

18

20

22

24

Multiple-GPU Collision Detection

Assembled Quad GPU Machine

Processor: AMD Phenom II X4 940 Black Memory: 16GB DDR2 Graphics: 4x NVIDIA Tesla C1060 Power supply 1: 1000W Power supply 2: 750W 43

SW/HW Setup Main Data Set 16 GB RAM Results

Open MP

CUDA

Thread Thread Thread Thread 0 1 2 3

GPU 0

GPU 1

GPU 2

GPU 3

Quad Core AMD Microprocessor

Tesla C1060 4x4 GB Memory 4x30720 threads 44

Results – Contacts vs. Time

Time (Sec)

Quad Tesla C1060 Configuration 200 180 160 140 120 100 80 60 40 20 0 0

1

2

3 4 Contacts (Billions)

5

6 45

Conclusions

Work aimed at enabling high-fidelity discrete models using a physicsbased approach

Approach draws on unique CPU + GPU parallel approach, which leverages a Heterogeneous Computing Template

Accomplishments to date

Billion body parallel collision detection Parallel solution of cone complementarity problem, about 12 million unknowns Early validation results encouraging

Aiming at billion bodies simulations 46

Ongoing/Future Work

Massively parallel linear algebra for solution of CCP problem

More general collision detection code

Multiphysics:

Fluid-solid interaction Electrostatics

47

Thank You.

48

49

50

Validation.

51

Simulation is doomed to succeed. (Rod Brooks, roboticist)

Model

Simulate

Validate

Validation at “microscale” – University of Wisconsin-Madison

Work in progress

Validation at “macroscale” – University of Parma, Italy 52

Flat Hopper Tests

Video recording from a test (a case that starts from high crystallization) 53

Flat Hopper Tests

3D rendering from a simulation (4x slower than real-time)

54

Flat Hopper Tests

Comparison experimental - simulated

Experimental

Simulated

55

Validation at Microscale

Sand flow rate measurements Approx. 40K bodies Glass beads Diameter: 100-500 microns

56

Experimental Setup

CPU connection Disruptor beads Nanopositioner controller

Load cell Translational stage Nanopositioner 57

Flow Measurement, 500 micron Spheres

58

Flow Simulation, 500 micron Spheres

59

Weight [N]

Flow Measurement Results, 3mm Gap Size

Time [sec] 60

Weight [N]

Flow Measurement Results, 2.5mm Gap Size

Time [sec] 61

Weight [N]

Flow Measurement Results, 2mm Gap Size

Time [sec] 62

Weight [N]

Flow Measurement Results, 1.5mm Gap Size

Time [sec] 63

Validation Experiment: Repose Angle

Experiment

Simulation

φ = 19.5◦

for µ = 0.39 64

Validation Experiment Flow and Stagnation

65

Validation, Flow and Stagnation

66

Validation, Flow and Stagnation

67

68

69

Spherical Decomposition

Represent complex geometry as a union of spheres

Fast parallel collision detection on GPU

Allows non-convex geometry

NOTE: Used only for CD, the dynamics is performed using original geometry

70

Examples…

Chain model

10 links 7,797 spheres per link

Plow model

31,791 spheres in plow blade model 15,000 spheres representing terrain 71

71

Numerical Results: Pebble Bed Nuclear Reactor

Two types of tests were run

On the GPU

CD: CCP:

NVIDIA 8800 GT NVIDIA Tesla C870

On the CPU

Single threaded Quad Core Intel Xeon E5430 2.66 GHz

The reactor contains spheres which flow out the bottom the nozzle and are recycled to the top of the reactor

Performed simulations with 16K, 32K, 64K, and 128K bodies

72

73

Results: Average Duration for Taking One Integration Step [∆∆t=0.01 s] Pebble Bed Reactor Demo, GPU vs. CPU GPU Average Total Time 100

CPU average Total Time

Linear (GPU Average Total Time)

Linear (CPU average Total Time) y = 0.0007x - 6.3447 R² = 0.996

90 80

Average Total Time

70 60 50 40 30 20 y = 6E-05x - 0.51 R² = 0.9936

10 0 0 74

16000

32000

48000

64000

80000

# of spheres in pebble bed reactor

96000

112000

128000

Algorithmic Challenges: Gauss-Jacobi Doesn’t Scale Well

Convergence stalls for certain classes of problems

Very large problems

Problems where I have many body stacked on each other

Inherent problem with Gauss-Jacobi, pertains the propagation of information

Granular dynamics problem have an intrinsic degree of redundancy

75

Ellipsoid-Ellipsoid CD: Visualization

76

Example: Ellipsoid-Ellipsoid CD d = P1 - P2 = (

1 1 M1 + M 2 )c + (b1 - b 2 ) 2λ1 2λ2

∂d ∂P1 ∂P2 = − ∂α i ∂α i ∂α i

,

∂ 2 P1 ∂ 2 P2 ∂ 2d = − ∂α i ∂α j ∂α i ∂α j ∂α i ∂α j

n1

A : Rotation Matrix

M = AR 2 AT

∂P 1 1 ∂c = ( M − 3 MccT M ) ∂α i ∂α i 2λ 8λ ∂2P 1 3 ∂c ∂c T T = (− 3 M + Mcc M ) c M ∂α i ∂α j 8λ 32λ 5 ∂α j ∂α i −

1 8λ

+(

3

[(cT M

∂c ∂c T ∂c )M + Mc( ) M] ∂α i ∂α i ∂α j

ε2

P1

z

R = diag (r1 , r2 , r3 ) ε b : Translation of ellipsoids center

c

n2

P2

1

α2

y

α1

x

1 4

λ 2 = nT Mn

1 1 ∂ 2c M − 3 MccT M ) 2λ ∂α i ∂α j 8λ

77

Going Back to Practical Problem of Interest

Tracked Vehicle on Granular Terrain…

78

Spherical Decomposition Steps

1: Take shoe element

2: Cubit

3: Spherical Padding

79

Track Components 1,594,908 spheres per track

80

Granular Terrain Model

Represent terrain as collection of discrete particles Match terrain surface profile Capture changing granularity with depth

81

Track Simulation 1 Parameters: • Driving speed: 1.0 rad/sec • Length: 12 seconds • Time step: 0.005 sec • Computation time: 18.5 hours • Particle radius: .027273 m • Terrain: 284,715 particles

82

Track Simulation 2 Parameters: • Driving speed: 1.0 rad/sec • Length: 10 seconds • Time step: 0.005 sec • Computation time: 17.8 hours • Particle radius: .025±.0025 m • Terrain: 467,100 particles

83

Results: Track ‘Footprint’

84

Results: Positions Track Simulation 1

85