Introduction to GPU Computing. Mike Clark, NVIDIA Developer Technology Group
Introduction to GPU Computing Mike Clark, NVIDIA Developer Technology Group
Outline Today Motivation GPU Architecture Three ways to accelerate appli...
CPU Optimized for low-latency access to cached data sets Control logic for out-of-order and speculative execution
GPU Optimized for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation
Small Changes, Big Speed-up Application Code
GPU
Rest of Sequential CPU Code
Compute-Intensive Functions
Use GPU to Parallelize
+
CPU
146X
36X
18X
50X
100X
Medical Imaging U of Utah
Molecular Dynamics U of Illinois, Urbana
Video Transcoding Elemental Tech
Matlab Computing AccelerEyes
Astrophysics RIKEN
GPUs Accelerate Science
149X
47X
20X
130X
30X
Financial Simulation Oxford
Linear Algebra Universidad Jaime
3D Ultrasound Techniscan
Quantum Chemistry U of Illinois, Urbana
Gene Sequencing U of Maryland
NVIDIA GPU Roadmap: Increasing Performance/Watt Sustained DP GFLOPS per Watt
16
Maxwell
14 12
10 8
Kepler
6 4 2
Fermi Tesla 2008
2010
2012
2014
GPU Architecture
GPU Architecture: Two Main Components Global memory
DRAM I/F HOST I/F Giga Thread DRAM I/F
DRAM I/F
Control units, registers, execution pipelines, caches
L2
DRAM I/F
Perform the actual computations Each SM has its own:
DRAM I/F
Streaming Multiprocessors (SMs)
DRAM I/F
Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 6 GB Bandwidth currently up to 177 GB/s for Quadro and Tesla products ECC on/off option for Quadro and Tesla products
GPU Architecture – Fermi: Streaming Multiprocessor (SM) 32 CUDA Cores per SM 32 fp32 ops/clock 16 fp64 ops/clock 32 int32 ops/clock
2 warp schedulers Up to 1536 threads concurrently
4 special-function units 64KB shared mem + L1 cache 32K 32-bit registers