Introduction to GPU Computing. Mike Clark, NVIDIA Developer Technology Group

Introduction to GPU Computing Mike Clark, NVIDIA Developer Technology Group Outline Today Motivation GPU Architecture Three ways to accelerate appli...

Author: Grace Terry

38 downloads 4 Views 5MB Size

Report

Download PDF

Recommend Documents

GPU Physics. Simon Green NVIDIA Developer Technology

GPU Physics. Mark Harris NVIDIA Developer Technology

Introduction to OpenCL. Cliff Woolley, NVIDIA Developer Technology Group

CUDA Overview. Cliff Woolley, NVIDIA Developer Technology Group

NVIDIA GPU COMPUTING THE NEXT ERA OF COMPUTING. December 2016

Introduction to GPU Computing with OpenACC

NVIDIA GPU Microarchitecture

Tesla GPU Computing An introduction. October 2008

Introduction to GPU Programming

Profiling and Tuning OpenACC Code. Cliff Woolley, NVIDIA Developer Technology Group

GPU COMPUTING WITH OPENACC

Adding GPU Computing to Computer Organization Courses

Nvidia GPU Mutexes Must Use Fences

NVIDIA Tesla C1060 Computing Processor

Mit Wolfgang Wegert & Mike Clark

NVIDIA GRID GPU Acceleration for Virtualization

The Evolution of GPU Computing. Steve Scott CTO Tesla, NVIDIA SC12, Salt Lake City

Introduction to Scientific Computing

Introduction to parallel computing

GPU Computing & Architectures 1. Introduction. Ezio Bartocci Vienna University of Technology

Introduction to Distributed Computing

Introduction to Parallel Computing

Introduction to parallel computing

NVIDIA Mosaic Technology

Introduction to GPU Computing Mike Clark, NVIDIA Developer Technology Group

Outline Today Motivation GPU Architecture Three ways to accelerate applications Tomorrow QUDA: QCD on GPUs

Why GPU Computing? 160

1200

Tesla 20-series

140

Tesla 20-series

1000 120 800

100

600

80 Tesla 20-series

60

400

40

200

Nehalem 3 GHz

Westmere 3 GHz

0 2003

Westmere 3 GHz

Nehalem 3 GHz

20

0 2004

2005

2006

2007

2008

2009

2010

2003

2004

2006

2007

2008

2009

2010

GBytes/sec

GFlops/sec Single Precision: NVIDIA GPU Double Precision: NVIDIA GPU

2005

Single Precision: x86 CPU Double Precision: x86 CPU

NVIDIA GPU

ECC off

X86 CPU

Stunning Graphics Realism

Lush, Rich Worlds

Crysis © 2006 Crytek / Electronic Arts

Id software ©

Incredible Physics Effects Hellgate: London © 2005-2006 Flagship Studios, Inc. Licensed by NAMCO BANDAI Games America, Inc.

Core of the Definitive Gaming Platform Full Spectrum Warrior: Ten Hammers © 2006 Pandemic Studios, LLC. All rights reserved. © 2006 THQ Inc. All rights reserved.

Add GPUs: Accelerate Science Applications

CPU

GPU

Nbody GPU versus CPU

Low Latency or High Throughput?

CPU Optimized for low-latency access to cached data sets Control logic for out-of-order and speculative execution

GPU Optimized for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation

Small Changes, Big Speed-up Application Code

GPU

Rest of Sequential CPU Code

Compute-Intensive Functions

Use GPU to Parallelize

+

CPU

146X

36X

18X

50X

100X

Medical Imaging U of Utah

Molecular Dynamics U of Illinois, Urbana

Video Transcoding Elemental Tech

Matlab Computing AccelerEyes

Astrophysics RIKEN

GPUs Accelerate Science

149X

47X

20X

130X

30X

Financial Simulation Oxford

Linear Algebra Universidad Jaime

3D Ultrasound Techniscan

Quantum Chemistry U of Illinois, Urbana

Gene Sequencing U of Maryland

NVIDIA GPU Roadmap: Increasing Performance/Watt Sustained DP GFLOPS per Watt

16

Maxwell

14 12

10 8

Kepler

6 4 2

Fermi Tesla 2008

2010

2012

2014

GPU Architecture

GPU Architecture: Two Main Components Global memory

DRAM I/F HOST I/F Giga Thread DRAM I/F

DRAM I/F

Control units, registers, execution pipelines, caches

L2

DRAM I/F

Perform the actual computations Each SM has its own:

DRAM I/F

Streaming Multiprocessors (SMs)

DRAM I/F

Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 6 GB Bandwidth currently up to 177 GB/s for Quadro and Tesla products ECC on/off option for Quadro and Tesla products

GPU Architecture – Fermi: Streaming Multiprocessor (SM) 32 CUDA Cores per SM 32 fp32 ops/clock 16 fp64 ops/clock 32 int32 ops/clock

2 warp schedulers Up to 1536 threads concurrently

4 special-function units 64KB shared mem + L1 cache 32K 32-bit registers

Instruction Cache Scheduler

Scheduler

Dispatch

Dispatch

Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

Kepler Fermi

Kepler SM Instruction Cache

Instruction Cache Warp Scheduler

Scheduler

Scheduler

Dispatch Unit

CUDA Core Dispatch

Dispatch

Dispatch Port

Dispatch Port

Dispatch Unit

Warp Scheduler Dispatch Unit

Dispatch Unit

Warp Scheduler Dispatch Unit

Dispatch Unit

Warp Scheduler Dispatch Unit

Dispatch Unit

Register File (65,536 x 32-bit)

Operand Collector

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Special Func Units x 4

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Interconnect Network

Core Core Core Core Core Core

LD/ST

SFU

Core Core Core Core Core Core

LD/ST

SFU

Register File

ALU

Result Queue

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Load/Store Units x 16

Uniform Cache

64K Configurable Cache/Shared Mem

64 KB Shared Memory / L1 Cache Interconnect Network

Uniform Cache

3 Ways to Accelerate Applications

Applications Libraries

OpenACC Directives

Programming Languages

“Drop-in” Acceleration

Easily Accelerate Applications

Maximum Flexibility

Libraries: Easy, High-Quality Acceleration Ease of use:

Using libraries enables GPU acceleration without in-depth knowledge of GPU programming

“Drop-in”:

Many GPU-accelerated libraries follow standard APIs, thus enabling acceleration with minimal code changes

Quality:

Libraries offer high-quality implementations of functions encountered in a broad range of applications

Performance:

NVIDIA libraries are tuned by experts

Some GPU-accelerated Libraries

NVIDIA cuBLAS

NVIDIA cuRAND

Vector Signal Image Processing

GPU Accelerated Linear Algebra

IMSL Library

Building-block ArrayFire Matrix Computations Algorithms for CUDA

NVIDIA cuSPARSE

NVIDIA NPP

Matrix Algebra on GPU and Multicore

NVIDIA cuFFT

Sparse Linear Algebra

C++ STL Features for CUDA

3 Steps to CUDA-accelerated application Step 1: Substitute library calls with equivalent CUDA library calls saxpy ( … )

cublasSaxpy ( … )

Step 2: Manage data locality - with CUDA: - with CUBLAS:

cudaMalloc(), cudaMemcpy(), etc. cublasAlloc(), cublasSetVector(), etc.

Step 3: Rebuild and link the CUDA-accelerated library nvcc myobj.o –l cublas

Drop-In Acceleration (Step 1) int N = 1