Todayβs Agenda: ο§ Introduction ο§ GPU Architecture ο§ CUDA Primer ο§ My First GPU Ray Tracer
Ray Tracing for Games
Introduction Supercomputing for the Masses* A GPU offers substantially more compute power than a CPU: TitanX: 6.6 TFLOPS Radeon R9 Fury: 8.6 TFLOPS Xeon D-1540 (8 core): 256 GFLOPS ($680; April 2016)
*: Supercomputing for the Masses, Rob Farber in DrDobbs, 2008: http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/207200659
Ray Tracing for Games
Introduction Supercomputing for the Masses GPUs also have substantially more bandwidth: AMD R9 Fury: 512GB/s NVidia TitanX: 336.5GB/s Intel Xeon: 118GB/s
Ray tracing requires compute power as well as bandwidth.
Ray Tracing for Games
Introduction Supercomputing for the Masses GPUs have a unique way of dealing with latencies: ο§ CPUs rely on caches to reduce average memory access time ο§ GPUs rely on massive parallelism to hide latencies. The CPU approach works well if memory access is somewhat coherent. The GPU approach works well if there is sufficient parallelism (memory access coherence is irrelevant).
Ray Tracing for Games
Introduction Supercomputing for the Masses And finally, GPUs are optimized for graphics. Although ray tracing is a GPGPU task, we still benefit: ο§ ο§ ο§ ο§ ο§
Texture filtering hardware Very fast sin/cos/tan and square root Efficient conversion between float and int Efficient Interpolation and clamping Fast interop with OpenGL / DirectX
The GPU is a perfect match for ray tracingβ¦ β¦but it comes with some peculiarities.
Todayβs Agenda: ο§ Introduction ο§ GPU Architecture ο§ CUDA Primer ο§ My First GPU Ray Tracer
Ray Tracing for Games
GPU Architecture
CPU
core
core
core
core
GPU
cache
RAM
Small number of cores Optimized for generic tasks Hyperthreading: two threads per core
RAM
Small number of cores (βmultiprocessorsβ) Optimized for parallel tasks Many threads per core, grouped in warps
Ray Tracing for Games
GPU Architecture
Warp:
32 threads, running the same instructions in lock step β SIMT. In case of delays, the stalled warp is swapped for another warp. A single multiprocessor manages several warps (up to 64). Switching between warps is governed by the hardware. Each warp (active or not) has its own registers; switching is βinstantβ. A modern GPU can execute 4 warps simultaneously (while 60 wait). A modern GPU can have up to 24 multiprocessors.
24 Γ 64 Γ 32 = 49152
Ray Tracing for Games
GPU Architecture Feeding the Beast How do we feed such a processor sufficient work?
We feed it many identical tasks. For a ray tracer: ο§ One thread per pixel.
Ray Tracing for Games
GPU Architecture GPU Memory Model NVidia Maxwell architecture*: Registers: 65536 per multiprocessor 1 cycle acess time
GPU Architecture GPU Memory Model Consequences of the memory model: 1. Caches are either very small (L1) or slow (L2). 2. This is compensated by βshared memoryβ, which we have to manage manually. 3. The memory hierarchy is (at least partially) explicit rather than implicit as on the CPU. 4. We have to trade registers per thread for number of threads: at 2048 threads per multiprocessor, each thread can use only 32 registers (a single float4 is four registers). Beyond this count, βregister spillingβ occurs. ο¨ Itβs probably better to feed the GPU small programs. ο¨ We have to be really careful when spending memory.
Todayβs Agenda: ο§ Introduction ο§ GPU Architecture ο§ CUDA Primer ο§ My First GPU Ray Tracer
Ray Tracing for Games
CUDA Primer
Todayβs Agenda: ο§ Introduction ο§ GPU Architecture ο§ CUDA Primer ο§ My First GPU Ray Tracer
Todayβs Agenda: ο§ Introduction ο§ GPU Architecture ο§ CUDA Primer ο§ My First GPU Ray Tracer
Ray Tracing for Games
TOTAL RECAP
Ray Tracing for Games
Lecture 1a Game development Game architecture The Template Tick Realtime Actors World state Scene graph Data ownership Killing a scene graph
Ray Tracing for Games
Lecture 1b Rasterization: limitations Millions of LOC The attraction of ray tracing RT state of the art RT ingredients Intersections Basic RT algorithm Building a ray tracer in a day
Ray Tracing for Games
Lecture 2 The Art of Optimization Bottlenecks & scalability Measure! The God algorithm High level optimization Low level optimization Data centric & caching Data locality Thread level parallelism Instruction level parallelism
Ray Tracing for Games
Lecture 3 RT: Optimizing high level Acceleration: grids Acceleration: nested grids BVH BVH data layout BVH traversal Packet traversal Binned BVH building Ray / box intersection
Ray Tracing for Games
Lecture 4 SIMD SSE _mm_rsqrt_ps AoS & SoA Vectorization Masking: _mm_cmplt_ps SIMD in ray tracing
Ray Tracing for Games
Lecture 5 Top-level BVH Geometry classification Static, rigid, deforming Refitting Agglomerative clustering Ray transform Efficient traversal Ray coherence Ray packets for shadows More optimizations⦠Multithreading
Ray Tracing for Games
Lecture 6 Whitted-style Ray optics Physical basis Snell, Fresnel Rendering equation Monte-Carlo integration Distributed ray tracing Motion blur, depth of field
Ray Tracing for Games
Lecture 7 Sampling Stratification Explicit light paths Importance Cosine-weighted Probability Density Function Resampled Importance Light array
Ray Tracing for Games
Lecture 8 GPU Architecture GPGPU CUDA My First GPU Ray Tracer