Power Efficient Processor Design and the Cell Processor

Systems and Technology Group Power Efficient Processor Design and the Cell Processor H. Peter Hofstee, Ph. D. [email protected] Architect, Cell Syne...

Author: Marjory Ellis

8 downloads 1 Views 480KB Size

Report

Download PDF

Recommend Documents

Efficient Checker Processor Design

LOW-POWER PROCESSOR DESIGN

Processor Design. Processor: Datapath and Control. Single cycle processor. Multicycle processor. Microprogramming

Page 1. Processor Design. Single Cycle Processor Design. Single cycle processor Datapath and Control

Design of the MIPS Processor

RISC Processor Design

CISC Processor Design

RISC Processor Design

ASIC Design of SAYEH processor

MIPS-Lite Processor Datapath Design

Efficient Sampling Startup for Sampled Processor Simulation

IBM's Micro Processor Design and Methodology

DESIGN AND FPGA IMPLEMENTATION OF HASH PROCESSOR

5-1 Chapter 5 Processor Design Advanced Topics Chapter 5: Processor Design Advanced Topics. Topics

The Pipelined MIPS Processor

Processor Analysis and Tuning

Pentium III Processor Power Distribution Guidelines

A Low Power Asynchronous GPS Baseband Processor

Testing the MIPS Processor

The Kiel Esterel Processor A Semi-Custom, Configurable Reactive Processor

O Processor

PROCESSOR. Processor operation. Introduction to PROCESSOR ORGANIZATION AND OPERATION. Overview of CPU behavior

FS1b Processor Platform Thermal Design Guide

Design Principles for Synthesizable Processor Cores

Systems and Technology Group

Power Efficient Processor Design and the Cell Processor H. Peter Hofstee, Ph. D. [email protected] Architect, Cell Synergistic Processor Element IBM Systems and Technology Group Austin, Texas

© 2005 IBM Corporation

Systems and Technology Group

Agenda Power Efficient Processor Architecture System Trends Cell Processor Overview

2

© 2005 IBM Corporation

Systems and Technology Group

Power Efficient Architecture

3

© 2005 IBM Corporation

Systems and Technology Group

Limiters to Processor Performance Power wall Memory wall Frequency wall

4

© 2005 IBM Corporation

Systems and Technology Group

Power Wall (Voltage Wall) 1000 – Active power – Passive power • Gate leakage • Sub-threshold leakage (sourcedrain leakage)

Power Density (W/cm2)

Power components:

100

Active Power

10

Passive Power 1 0.1 0.01 1994

10S Tox=11A Gate Stack

2004

0.001 1

0.1

0.01

Gate Length (microns)

NET: INCREASING PERFORMANCE REQUIRES INCREASING EFFICIENCY Gate dielectric approaching a fundamental limit (a few atomic layers) 5

© 2005 IBM Corporation

Systems and Technology Group

Memory wall Main memory now nearly 1000 cycles from the processor

– Situation worse with (on-chip) SMP Memory latency penalties drive inefficiency in the design

– Expensive and sophisticated hardware to try and deal with it – Programmers that try to gain control of cache content, but are hindered by the hardware mechanisms Latency induced bandwidth limitations

– Much of the bandwidth to memory in systems can only be used speculatively – Diminishing returns from added bandwidth on traditional systems

6

© 2005 IBM Corporation

Systems and Technology Group

Frequency wall Increasing frequencies and deeper pipelines have reached diminishing returns on performance Returns negative if power is taken into account Results of studies depend on issue width of processor – The wider the processor the slower it wants to be – Simultaneous Multithreading helps to use issue slots efficiently

Results depend on number of architected registers and workload – More registers tolerate deeper pipeline – Fewer random branches in application tolerates deeper pipelines 7

© 2005 IBM Corporation

Systems and Technology Group

Microprocessor Efficiency Recent History: –Gelsinger’s law • 1.4x more performance for 2x more transistors

–Hofstee’s corollary • 1/1.4x efficiency loss in every generation • Examples: Cache size, OoO, Superscalar, etc. etc.

Re-examine microarchitecture with performance per transistor as metric –Pipelining is last clear win 8

© 2005 IBM Corporation

Systems and Technology Group

Attacking the Performance Walls Multi-Core Non-Homogeneous Architecture – Control Plane vs. Data Plane processors – Attacks Power Wall 3-level Model of Memory – Main Memory, Local Store, Registers – Attacks Memory Wall Large Shared Register File & SW Controlled Branching – Allows deeper pipelines (11FO4 … helps power!) – Attacks Frequency Wall 9

© 2005 IBM Corporation

Systems and Technology Group

System Trends

10

© 2005 IBM Corporation

Systems and Technology Group

System Trends toward Integration Memory

Northbridge Memory

Accel

Processor

Cell Processor IO

Southbridge

IO

Increased integration is driving processors to take on many

functions typically associated with systems

– Integration forces processor developers to address offload and acceleration in the design of the processor – Integration of bridge chip functionality Virtualization technology is used to support non-

homogeneous environments

11

© 2005 IBM Corporation

Systems and Technology Group

Next Generation Processors address Programming Complexity and Trend Towards Programmable Offload Engines with a Simpler System Alternative

GPU NIC CPU

Security

… Media

Hardwired Function

12

CPU

Streaming Graphics Processor

64b Power Processor

Network Processor

Synergistic Processor

Security Processor

…

Media Processor

Programmable ASIC

Mem. Contr.

.. .

Config. Synergistic IO Processor

Cell

© 2005 IBM Corporation

Systems and Technology Group

“Outward Facing” Aspects of Cell Cell is designed to be responsive .. to human user – Real-time response – Supports rich visual interfaces

.. to network – Flexible, can support new standards – High-bandwidth – Content protection, privacy & security

Contrast to traditional processors which evolved from “batch processing” mentality (inward focused). 13

© 2005 IBM Corporation

Systems and Technology Group

Cell Overview

14

© 2005 IBM Corporation

Systems and Technology Group

Key Attributes of Cell

Cell is Multi-Core – Contains 64-bit Power Architecture TM – Contains 8 Synergistic Processor Elements (SPE)

Cell is a Flexible Architecture – Multi-OS support (including Linux) with Virtualization technology – Path for OS, legacy apps, and software development

Cell is a Broadband Architecture – SPE is RISC architecture with SIMD organization and Local Store – 128+ concurrent transactions to memory per processor

Cell is a Real-Time Architecture – Resource allocation (for Bandwidth Measurement) – Locking Caches (via Replacement Management Tables)

Cell is a Security Enabled Architecture – SPE dynamically reconfigurable as secure processors

15

© 2005 IBM Corporation

Systems and Technology Group

Cell Chip Block Diagram

SPU SPE

SXU

SXU

SXU

SXU

SXU

SXU

SXU

SXU

LS

LS

LS

LS

LS

LS

LS

LS

SMF

SMF

SMF

SMF

SMF

SMF

SMF

SMF

EIB (up to 96 Bytes/cycle)

L2

PPE

MIC Dual XDRTM

L1 16

BIC FlexIOTM

PXU © 2005 IBM Corporation

Systems and Technology Group

Cell Prototype Die (Pham et al, ISSCC 2005) S P U M I C

P P U

S P U

S P U R B R I A C C

MIB

S P U

17

S P U

S P U

S P U

S P U

© 2005 IBM Corporation

Systems and Technology Group

Cell Highlights Observed clock speed – > 4 GHz

Peak performance (single precision) – > 256 GFlops

Peak performance (double precision) – >26 GFlops

18

Area

221 mm2

Technology

90nm SOI

Total # of transistors

234M © 2005 IBM Corporation

Systems and Technology Group

Element Interconnect Bus EIB data ring for internal communication

– Four 16 byte data rings, supporting multiple transfers – 96B/cycle peak bandwidth – Over 100 outstanding requests

19

© 2005 IBM Corporation

Systems and Technology Group

Power Processor Element PPE handles operating system and control tasks

– 64-bit Power ArchitectureTM with VMX – In-order, 2-way hardware Multi-threading – Coherent Load/Store with 32KB I & D L1 and 512KB L2

20

© 2005 IBM Corporation

Systems and Technology Group

Synergistic Processor Element SPE provides computational performance

– – – –

21

Dual issue, up to 16-way 128-bit SIMD Dedicated resources: 128 128-bit RF, 256KB Local Store Each can be dynamically configured to protect resources Dedicated DMA engine: Up to 16 outstanding request

© 2005 IBM Corporation

Systems and Technology Group

User-mode architecture

SPE Highlights

– No translation/protection within SPU – DMA is full Power Arch protect/x-late LS SFP

Direct programmer control

DP

SPU LS

FXU EVN

– Branch hint

VMX-like SIMD dataflow

FWD

GPR

– Broad set of operations

LS

CONTROL

FXU ODD

– Graphics SP-Float – IEEE DP-Float (BlueGene-like)

LS

Unified register file

CHANNEL

SBI

SMM

BEB

DMA

– DMA/DMA-list

ATO RTB

14.5mm2 (90nm SOI)

SMF

– 128 entry x 128 bit

256kB Local Store – Combined I & D – 16B/cycle L/S bandwidth – 128B/cycle DMA bandwidth

22

© 2005 IBM Corporation

Systems and Technology Group

SPE Organization (Flachs et al, ISSCC 2005)

23

© 2005 IBM Corporation

Systems and Technology Group

SPE PIPELINE (Flachs et al, ISSCC 2005)

24

© 2005 IBM Corporation

Systems and Technology Group

I/O and Memory Interfaces I/O Provides wide bandwidth

– – – –

25

Dual XDRTM controller (25.6GB/s @ 3.2Gbps) Two configurable interfaces (76.8GB/s @6.4Gbps) Flexible Bandwidth between interfaces Allows for multiple system configurations

© 2005 IBM Corporation

Systems and Technology Group

Cell Processor Can Support Many Systems Game console systems

XDRtm

Workstations (CPBW)

XDRtm

XDRtm

XDRtm

HDTV

XDRtm

XDRtm IOIF

XDRtm XDRtm

XDRtm XDRtm

CELL Processor

CELL Processor

BIF

XDRtm XDRtm

XDRtm XDRtm

IOIF 26

IOIF1

CELL Processor

IOIF0

SW

CELL Processor

CELL Processor

BIF

CELL Processor

IOIF

IOIF IOIF

IOIF

Supercomputers

CELL Processor

BIF

Home media servers

© 2005 IBM Corporation

Systems and Technology Group

Cell Processor Based Workstation (CPBW) (Sony Group and IBM)

First Prototype “Powered On” 16 Tera-flops in a rack (est.) – ( equals 1 Peta-flop in 64 racks )

Optimized for Digital Content Creation, including • • • •

27

Computer entertainment Movies Real-time rendering Physics simulation

CELL Processor Board High BW Sys Storage Networks Mgmt I/O Bridge CELL Processor (2-Way SMP)

Memory

16 TFlop rack

…

© 2005 IBM Corporation

Systems and Technology Group

Cell Processor Example Application Areas Cell is a processor that excels at processing of rich media content in

the context of broad connectivity – Digital content creation (games and movies) – Game playing and game serving – Distribution of (dynamic, media rich) content – Imaging and image processing – Image analysis (e.g. video surveillance) – Next-generation physics-based visualization – Video conferencing (3D?) – Streaming applications (codecs etc.) – Physical simulation & science 28

© 2005 IBM Corporation

Systems and Technology Group

Summary Cell ushers in a new era of leading edge processors optimized for digital media and entertainment Desire for realism is driving a convergence between supercomputing and entertainment New levels of performance and power efficiency beyond what is achieved by PC processors Responsiveness to the human user and the network are key drivers for Cell Cell will enable entirely new classes of applications, even beyond those we contemplate today 29

© 2005 IBM Corporation

Systems and Technology Group

Acknowledgements Cell is the result of a deep partnership between SCEI/Sony, Toshiba, and IBM Cell represents the work of more than 400 people starting in 2001 More detailed papers on the Cell implementation and the SPE micro-architecture can be found in the ISSCC 2005 proceedings

30

© 2005 IBM Corporation