10. 4 Heterogeneous Computing -> Fusion June Heterogeneous Computing -> Fusion. Definitions. Three Eras of Processor Performance

7/14/10 Definitions Heterogeneous Computing -> Fusion   Heterogenous Computing –  A system comprised of two or more compute engines with signfican...

Author: Aubrey Scott

7 downloads 0 Views 7MB Size

Report

Download PDF

Recommend Documents

Par4All. Open source parallelization for heterogeneous computing

GPU implementations of scheduling heuristics for heterogeneous computing environments

Computing Dynamic Heterogeneous-Agent Economies: Tracking the Distribution

IMPLEMENTING DOMAIN-SPECIFIC LANGUAGES FOR HETEROGENEOUS PARALLEL COMPUTING

Computing in Fusion. Dr Ben Dudson. Don't fear the penguin

BD FACSAria Fusion. The fusion of safety, performance, and sorting

Compiling for a Heterogeneous Vector Image Processor

Heterogeneous (CPU+GPU) Performance Libraries

Heterogeneous Computing in ARM Architecture. Media Processing Division ARM June 25 th 2013

10.) Fission and fusion

HIGH PERFORMANCE AND ENERGY EFFICIENT TASK SCHEDULING ALGORITHM FOR HETEROGENEOUS MOBILE COMPUTING SYSTEM

NVIDIA Tesla C1060 Computing Processor

COMPUTING ATCA AdvancedTCA Processor Blade

Portable Performance on Heterogeneous Architectures

Kuchnia Fusion Fusion cuisine

High Performance Computing

Performance of evaluation methods in image fusion

Towards the optimal synchronization granularity for dynamic scheduling of pipelined computations on heterogeneous computing systems

Keywords Cloud Computing, Grid Computing, Cluster Computing, Utility Computing, Service Computing, Distributed Computing

Hot fusion. Nuclear power: Fusion energy

Heterogeneous Agents

Actel Fusion. Key Fusion Features SUPPORT

Fusion & Fusion C Owner s Manual

Python for High Performance Computing

7/14/10

Definitions

Heterogeneous Computing -> Fusion

  Heterogenous Computing –  A system comprised of two or more compute engines with signficant structural differences –  In our case, a low latency x86 CPU and a high throughput Radeon GPU

Phil Rogers AMD Corporate Fellow

  Fusion –  Bringing together two or more components and joining them into a single unified whole –  In our case, combining CPUs and GPUs on a single silicon die for higher performance and lower power

1 | Heterogeneous Computing -> Fusion | June 2010

  Out of order x86 cores with low latency memory access

  GPU shaders optimized for throughput computing

  Optimized for sequential and branching algorithms

  Ready for emerging workloads

  Runs existing applications very well

  Media processing, simulation, natural UI, etc

Three Eras of Processor Performance Single-Core Era

Multi-Core Era

Enabled by:   Moore’s Law   Voltage Scaling   MicroArchitecture

Enabled by:   Moore’s Law   Desire for Throughput   20 years of SMP arch

Enabled by:   Moore’s Law   Abundant data parallelism   Power efficient GPUs

Constrained by:

  Power

  Complexity

Constrained by:

  Power

  Parallel SW availability

  Scalability

Temporarily constrained by:

  Programming models

  Communication overheads

Other Highly Parallel Workloads

3 | Heterogeneous Computing -> Fusion | June 2010

Single-thread Performance

Graphics Workloads Serial/Task-Parallel Workloads

Heterogeneous Systems Era

o we are here

Time

?

o we are here

Time (# of Processors)

Targeted Application Performance

GPU is ideal for parallel processing

Throughput Performance

AMD Balanced Platform Advantage CPU is ideal for scalar processing

2 | Heterogeneous Computing -> Fusion | June 2010

o we are here

Time (Data-parallel exploitation)

4 | Heterogeneous Computing -> Fusion | June 2010

1

7/14/10

Emerging Application Spaces

GPU SP ALU Performance

Category

Characteristics

Application Examples

Massive Data Mining

Full 64b addressing Huge data sets New data types

Image, Video, Audio processing Pattern analytics and search

Natural User Interfaces

Massive “behind-the-scenes” computing

Face and gesture recognition Real time video & audio proc Physical world interpretation

Visualization

Advanced rendering Interactive physics

Multi-layered Graphics Holographic Displays Scientific visualization & CAD Next generation Gaming

Cloud + Client Applications

HD5870

HD4870

CPU

Seamless Next generation browsers responsiveness HTML5 Apps with Native Workload partitioning Code from JavaScript

5 | Heterogeneous Computing -> Fusion | June 2010

GPU DP ALU Performance

6 | Heterogeneous Computing -> Fusion | June 2010

GPU BW Performance expectations over time 300 

250 

200 

150  HD5870 HD5870

100  HD4870

HD4870

50  CPU 0 

7 | Heterogeneous Computing -> Fusion | June 2010

8 | Heterogeneous Computing -> Fusion | June 2010

2

7/14/10

GPU Computing Efficiency Trend

Thread Processors 14.47 GFLOPS/W

5-way VLIW Architecture 4 Stream Cores and 1 Special Function Stream Core Separate Branch Unit

GFLOPS/W

All 5 cores co-issue Scheduling across the cores is done by the compiler Each core delivers a 32-bit result per clock Thread Processor writes 5 results per clock

9 | Heterogeneous Computing -> Fusion | June 2010

SIMD Engines

10 | Heterogeneous Computing -> Fusion | June 2010

ATI Radeon™ HD 5870 Compute Architecture

 20 SIMD Engines   1600 shader cores  Ultra-Threaded Dispatch Processor  Instruction and Constant Caches  Memory Export Buffer  Fetch path with multi-level caches  Diagram shows 2 SIMD Engines

 Global Data Store

 Each SIMD Unit includes:   16 Thread Processors (80 shader cores) + 32KB Local Data Share   Its own Thread Sequencer which operates a shared set of threads   A dedicated fetch unit with an 8KB L1 cache

11 | Heterogeneous Computing -> Fusion | June 2010

12 | Heterogeneous Computing -> Fusion | June 2010

3

7/14/10

TeraScale 2 Architecture – Radeon HD 5870

Memory Hierarchy  Distributed Memory Controller  Optimized for latency hiding and memory access efficiency  GDDR5 memory at 150GB/s  Up to 272 billion 32-bit fetches/ second  Up to 1 TB/sec L1 texture fetch bandwidth  Up to 435 GB/sec between L1 & L2

13 | Heterogeneous Computing -> Fusion | June 2010

14 | Heterogeneous Computing -> Fusion | June 2010

Comparative Stats on ATI Radeon HD 5870 GPU AMD Opteron™

ATI Radeon™

ATI Radeon™

Model 2435

HD 4870

HD 5870

2

Die Size

2

2

One Year Difference

346 mm

263 mm

334 mm

1.27x

Transistors

904 million

956 million

2.15 billion

2.25x

Memory Bandwidth

12.8 GB/s

115 GB/sec

153 GB/sec

1.33x

SP GFlops

124.8

1200

2720

2.25x

DP GFlops

62.4

240

544

2.25

54

800

1600

2x

ALUs

Yesterday’s Chip Designs Won’t Do

Board Power* Idle

15.5 W

90 W

27 W

0.3x

Max

115 W

160 W

188 W

1.17x

105 million transistors @130nm Compute tasks including video decode

110 million transistors @150nm 2D and 3D gaming Nascent video processing

* Based on internal AMD testing

15 | Heterogeneous Computing -> Fusion | June 2010

16 | Heterogeneous Computing -> Fusion | June 2010

4

7/14/10

Today We Are Evolving

758 million transistors @45nm Multi-tasking Most compute tasks

Tomorrow Will Amaze

2.15 billion transistors @40nm 3D OS Multi-panel HD gaming Full HD video and audio

  ~1 billion transistors @32nm in one design

  Significantly enhances active/ resting battery life

  APU: Fusion of CPU & GPU compute power within one processor

  High-bandwidth I/O

17 | Heterogeneous Computing -> Fusion | June 2010

18 | Heterogeneous Computing -> Fusion | June 2010

AMD Fusion™ APUs Fill the Need

Fusion APUs: Putting it all together

  Established programming and memory model   Mature tool chain   Extensive backward compatibility for applications and OSs   High barrier to entry

  Very efficient hardware threading   SIMD architecture well matched to modern workloads: video, audio, graphics

High Performance Task Parallel Execution

System-level Programmable

OCL/DC Driver-based programs

Power-efficient Data Parallel Execution

Graphics Driver-based programs

GPU Advancement

  Outstanding performance-per watt-per-dollar

Experts Only

  Enormous parallel computing capacity

  Thousands of apps

Programmer Accessibility

  Windows, MacOS and Linux franchises

Unacceptable

GPU Optimized for Modern Workloads

Mainstream

Microprocessor Advancement x86 CPU owns the Software World

Throughput Performance 19 | Heterogeneous Computing -> Fusion | June 2010

20 | Heterogeneous Computing -> Fusion | June 2010

5

7/14/10

PC with Discrete GPU

21 | Heterogeneous Computing -> Fusion | June 2010

Two x86 Cores Tuned for Target Markets

Fusion APU Based PC

22 | Heterogeneous Computing -> Fusion | June 2010

Heterogeneous Computing: Next-Generation Software Ecosystem Increase ease of application development

“Bulldozer”

Load balance across CPUs and GPUs; leverage AMD Fusion™ performance advantages

“Bobcat”

23 | Heterogeneous Computing -> Fusion | June 2010

Drive new features into industry standards

24 | Heterogeneous Computing -> Fusion | June 2010

6

7/14/10

Open Standards:

OpenCL™ and DirectX® 11 DirectCompute

Maximize Developer Freedom and Addressable Market Vendor specific Cross-platform limiters

Vendor neutral Cross-platform enablers

•  Apple Display Connector •  3dfx Glide •  Nvidia CUDA •  Nvidia Cg

  How will developers choose?   DirectX® 11 DirectCompute   Easiest path to add compute capabilities to existing DirectX applications   Windows Vista® and Windows® 7 only   OpenCL™

•  Rambus

  Ideal path for new applications porting to the GPU for the first time

•  Unified Display Interface

  True multiplatform: Windows®, Linux®, MacOS   Natural programming without dealing with a graphics API

25 | Heterogeneous Computing -> Fusion | June 2010

26 | Heterogeneous Computing -> Fusion | June 2010

The Benefits of Fusion

The Fusion Opportunity

  Unparalleled processing capabilities in mobile form factors

  A new architectural and performance balance point for computing

  Shared memory for the CPU and GPU

  A new machine target for research

  Eliminates copies, increasing performance   Reduces dispatch overhead   Lower latency from the GPU to memory   Power efficient design

  A high volume opportunity for new algorithms, new workloads and new applications   The deployment opportunity is especially strong in the consumer market place

  Enables architectural innovations between CPU, GPU and the Memory System   Scalable architecture that can target a broad range of platforms from mobile to data center 27 | Heterogeneous Computing -> Fusion | June 2010

28 | Heterogeneous Computing -> Fusion | June 2010

7