Heterogenous Computing – A system comprised of two or more compute engines with signficant structural differences – In our case, a low latency x86 CPU and a high throughput Radeon GPU
Phil Rogers AMD Corporate Fellow
Fusion – Bringing together two or more components and joining them into a single unified whole – In our case, combining CPUs and GPUs on a single silicon die for higher performance and lower power
1 | Heterogeneous Computing -> Fusion | June 2010
Out of order x86 cores with low latency memory access
GPU shaders optimized for throughput computing
Optimized for sequential and branching algorithms
Ready for emerging workloads
Runs existing applications very well
Media processing, simulation, natural UI, etc
Three Eras of Processor Performance Single-Core Era
Multi-Core Era
Enabled by: Moore’s Law Voltage Scaling MicroArchitecture
Enabled by: Moore’s Law Desire for Throughput 20 years of SMP arch
Enabled by: Moore’s Law Abundant data parallelism Power efficient GPUs
Constrained by:
Power
Complexity
Constrained by:
Power
Parallel SW availability
Scalability
Temporarily constrained by:
Programming models
Communication overheads
Other Highly Parallel Workloads
3 | Heterogeneous Computing -> Fusion | June 2010
Single-thread Performance
Graphics Workloads Serial/Task-Parallel Workloads
Heterogeneous Systems Era
o we are here
Time
?
o we are here
Time (# of Processors)
Targeted Application Performance
GPU is ideal for parallel processing
Throughput Performance
AMD Balanced Platform Advantage CPU is ideal for scalar processing
2 | Heterogeneous Computing -> Fusion | June 2010
o we are here
Time (Data-parallel exploitation)
4 | Heterogeneous Computing -> Fusion | June 2010
1
7/14/10
Emerging Application Spaces
GPU SP ALU Performance
Category
Characteristics
Application Examples
Massive Data Mining
Full 64b addressing Huge data sets New data types
Image, Video, Audio processing Pattern analytics and search
Natural User Interfaces
Massive “behind-the-scenes” computing
Face and gesture recognition Real time video & audio proc Physical world interpretation
Seamless Next generation browsers responsiveness HTML5 Apps with Native Workload partitioning Code from JavaScript
5 | Heterogeneous Computing -> Fusion | June 2010
GPU DP ALU Performance
6 | Heterogeneous Computing -> Fusion | June 2010
GPU BW Performance expectations over time 300
250
200
150 HD5870 HD5870
100 HD4870
HD4870
50 CPU 0
7 | Heterogeneous Computing -> Fusion | June 2010
8 | Heterogeneous Computing -> Fusion | June 2010
2
7/14/10
GPU Computing Efficiency Trend
Thread Processors 14.47 GFLOPS/W
5-way VLIW Architecture 4 Stream Cores and 1 Special Function Stream Core Separate Branch Unit
GFLOPS/W
All 5 cores co-issue Scheduling across the cores is done by the compiler Each core delivers a 32-bit result per clock Thread Processor writes 5 results per clock
9 | Heterogeneous Computing -> Fusion | June 2010
SIMD Engines
10 | Heterogeneous Computing -> Fusion | June 2010
Each SIMD Unit includes: 16 Thread Processors (80 shader cores) + 32KB Local Data Share Its own Thread Sequencer which operates a shared set of threads A dedicated fetch unit with an 8KB L1 cache
11 | Heterogeneous Computing -> Fusion | June 2010
12 | Heterogeneous Computing -> Fusion | June 2010
3
7/14/10
TeraScale 2 Architecture – Radeon HD 5870
Memory Hierarchy Distributed Memory Controller Optimized for latency hiding and memory access efficiency GDDR5 memory at 150GB/s Up to 272 billion 32-bit fetches/ second Up to 1 TB/sec L1 texture fetch bandwidth Up to 435 GB/sec between L1 & L2
13 | Heterogeneous Computing -> Fusion | June 2010
14 | Heterogeneous Computing -> Fusion | June 2010
Comparative Stats on ATI Radeon HD 5870 GPU AMD Opteron™
ATI Radeon™
ATI Radeon™
Model 2435
HD 4870
HD 5870
2
Die Size
2
2
One Year Difference
346 mm
263 mm
334 mm
1.27x
Transistors
904 million
956 million
2.15 billion
2.25x
Memory Bandwidth
12.8 GB/s
115 GB/sec
153 GB/sec
1.33x
SP GFlops
124.8
1200
2720
2.25x
DP GFlops
62.4
240
544
2.25
54
800
1600
2x
ALUs
Yesterday’s Chip Designs Won’t Do
Board Power* Idle
15.5 W
90 W
27 W
0.3x
Max
115 W
160 W
188 W
1.17x
105 million transistors @130nm Compute tasks including video decode
110 million transistors @150nm 2D and 3D gaming Nascent video processing
* Based on internal AMD testing
15 | Heterogeneous Computing -> Fusion | June 2010
16 | Heterogeneous Computing -> Fusion | June 2010
4
7/14/10
Today We Are Evolving
758 million transistors @45nm Multi-tasking Most compute tasks
Tomorrow Will Amaze
2.15 billion transistors @40nm 3D OS Multi-panel HD gaming Full HD video and audio
~1 billion transistors @32nm in one design
Significantly enhances active/ resting battery life
APU: Fusion of CPU & GPU compute power within one processor
High-bandwidth I/O
17 | Heterogeneous Computing -> Fusion | June 2010
18 | Heterogeneous Computing -> Fusion | June 2010
AMD Fusion™ APUs Fill the Need
Fusion APUs: Putting it all together
Established programming and memory model Mature tool chain Extensive backward compatibility for applications and OSs High barrier to entry
Very efficient hardware threading SIMD architecture well matched to modern workloads: video, audio, graphics
High Performance Task Parallel Execution
System-level Programmable
OCL/DC Driver-based programs
Power-efficient Data Parallel Execution
Graphics Driver-based programs
GPU Advancement
Outstanding performance-per watt-per-dollar
Experts Only
Enormous parallel computing capacity
Thousands of apps
Programmer Accessibility
Windows, MacOS and Linux franchises
Unacceptable
GPU Optimized for Modern Workloads
Mainstream
Microprocessor Advancement x86 CPU owns the Software World
20 | Heterogeneous Computing -> Fusion | June 2010
5
7/14/10
PC with Discrete GPU
21 | Heterogeneous Computing -> Fusion | June 2010
Two x86 Cores Tuned for Target Markets
Fusion APU Based PC
22 | Heterogeneous Computing -> Fusion | June 2010
Heterogeneous Computing: Next-Generation Software Ecosystem Increase ease of application development
“Bulldozer”
Load balance across CPUs and GPUs; leverage AMD Fusion™ performance advantages
“Bobcat”
23 | Heterogeneous Computing -> Fusion | June 2010
Drive new features into industry standards
24 | Heterogeneous Computing -> Fusion | June 2010
6
7/14/10
Open Standards:
OpenCL™ and DirectX® 11 DirectCompute
Maximize Developer Freedom and Addressable Market Vendor specific Cross-platform limiters
Vendor neutral Cross-platform enablers
• Apple Display Connector • 3dfx Glide • Nvidia CUDA • Nvidia Cg
How will developers choose? DirectX® 11 DirectCompute Easiest path to add compute capabilities to existing DirectX applications Windows Vista® and Windows® 7 only OpenCL™
• Rambus
Ideal path for new applications porting to the GPU for the first time
• Unified Display Interface
True multiplatform: Windows®, Linux®, MacOS Natural programming without dealing with a graphics API
25 | Heterogeneous Computing -> Fusion | June 2010
26 | Heterogeneous Computing -> Fusion | June 2010
The Benefits of Fusion
The Fusion Opportunity
Unparalleled processing capabilities in mobile form factors
A new architectural and performance balance point for computing
Shared memory for the CPU and GPU
A new machine target for research
Eliminates copies, increasing performance Reduces dispatch overhead Lower latency from the GPU to memory Power efficient design
A high volume opportunity for new algorithms, new workloads and new applications The deployment opportunity is especially strong in the consumer market place
Enables architectural innovations between CPU, GPU and the Memory System Scalable architecture that can target a broad range of platforms from mobile to data center 27 | Heterogeneous Computing -> Fusion | June 2010
28 | Heterogeneous Computing -> Fusion | June 2010