Homogeneous multiprocessing for the masses

Homogeneous multiprocessing for the masses Paul Stravers Philips Research 2 Agenda PART 1 Embedded multiprocessors • Pollack’s observation • Power ...

Author: Merry Floyd

3 downloads 0 Views 2MB Size

Report

Download PDF

Recommend Documents

Secrets of the Multiprocessing Module

Regional Masses for the Week

SMT Soldering. For the Masses. Richard Greenway

Femtozellen Base Stations For The Masses

AVANT-GARDE FASHION FOR THE MASSES

MULTIPROCESSING AND SYNCHRONIZATION

VMG. RGB Video Multiprocessing Gateway

An Event That Never Happened: The Holocaust for the Masses

A Homogeneous Temperature Record for Southern Siberia

Robust, High-Speed Network Design for Large-Scale Multiprocessing

MASSES FOR THE WEEK. EUCHARISTIC MINISTERS March READERS March 2013

Homogeneous & Inlaid Sheet Flooring

Homogeneous and heterogeneous combustion

Cosmology Part I: The Homogeneous Universe

Fundamentals of Homogeneous Nucleation

Homogeneous Installation Guide

Homogeneous Equation: General Solution

Generating Homogeneous Poisson Processes

Two Fundamental Limits on Dataflow Multiprocessing

KATRIN: hunting neutrino masses

Biodiversity in China: Lost in the Masses?

Water Masses in the Pacific in CCSM3

Homogeneous Product Oligopoly Models

Neutrino Masses & Mixings

Homogeneous multiprocessing for the masses Paul Stravers Philips Research

2

Agenda PART 1 Embedded multiprocessors • Pollack’s observation • Power efficient processors • Heterogeneous vs. homogeneous multiprocessors PART 2 The shifting balance • Technology scaling • Efficiency gap is narrowing • Platform success depends on flexibility • Predictable system design Philips Research

3

Pollack’s observation Pollack observed that the next processor architecture generation has: Area: 2 … 3x equivalent gates But only: 1.4 …1.7x more performance The situation is getting worse over time… or is it?

Here Pollack made his observation

3.5 3.0 2.5

Normalized Area Increase

2.0

Performance Increase

1.5 1.0 0.5

Moore’s Law

0.0 1.5 Philips Research

1.0

0.7

0.5

.35

.18

.13

.09 Source: Intel keynote PACT’00

4

Power efficiency (1) How to achieve high processor clock frequencies? • Use strong buffers to quickly charge and discharge: wire capacitance

• But the buffers bring their own parasitic capacitances to the circuit, often much more than the wires!

• Typical: 2x faster clock ⇒ 4x more Joules per clock !!!! Philips Research

5

Power efficiency (2) Solution is in parallelism: do multiple independent computations per clock cycle • Pipelining (function parallelism):

register

• Instruction level parallelism (ILP) or data parallelism (SIMD):

Philips Research

6

Power efficiency (3) Compare two industrial examples: • Intel Pentium-4 (Northwood) in 0.13 micron technology – 3.0 GHz – 20 pipeline stages – Aggressive buffering to boost clock frequency – 13 nano Joule / instruction • Philips Trimedia “Lite” in 0.13 micron technology – 250 MHz – 8 pipeline stages – Relaxed buffering, focus on instruction parallelism – 0.2 nano Joule / instruction • Trimedia is doing 65x better than Pentium Philips Research

7

Power efficiency (4) Who cares for power efficiency, anyway? We do! …Because • Cost of electricity. The electricity bill of a modern PC is about 50 euro per year. Really! • A chip package that can handle a hot processor chip costs up to 75 euro. Just the package! • A fan (for air cooling) is expensive and noisy. • The consumer appliance box gets ugly and expensive Ultimately, heat dissipation limits what we can compute Philips Research

8

Parallelism • Processor innovation has been driven for 40 years by fine grain parallelism: – Pipelining – Instruction level parallelism (ILP) – Single instruction, multiple data (SIMD) • We are now confronted with diminishing returns (Pollack) • The next big step is course grain parallelism – Multiprocessing – Hyperthreading (not discussed here) • The key to parallelism is finding independent operations • Course grain parallelism requires user awareness, i.e. the programmer must find the independent functions !!! Philips Research

9

Multiprocessors (1) Replace single big, hot processor with multiple small, efficient processors. Each computes part of the job: Task parallelism: decode video Processor-1

improve picture Processor-2

Data parallelism: decode video

improve picture

Processor-1 (upper half of picture) Philips Research

decode video

improve picture

Processor-2 (bottom half of picture)

10

Multiprocessors (2) Two of the most important architecture characteristics are • Synchronization Tasks must wait on each other. For example: – only read new data after it has actually been produced by the upstream task (task parallelism) – only proceed with next picture when all tasks are done with current picture (data parallelism) • Communication Move data between tasks. Two basic models: – Message passing: explicit action by sender & receiver – Shared memory: implicit access by all tasks to all data Philips Research

11

Heterogeneous vs. homogeneous multiprocessors (1) Another important characteristic is the mix of processors. Homogeneous: Very flexible, moderate design effort, moderate comp.efficiency TM

TM

TM

TM

TM

Heterogeneous: Less flexible, large design effort, high comp.efficiency TM Philips Research

MIPS

audio DSP

mpeg dec

picture improve

12

Heterogeneous vs. homogeneous multiprocessors (2) The following characteristics play well together: Homogeneous ⇔ data parallelism ⇔ shared memory ⇔ dynamic task mapping Heterogeneous ⇔ task parallelism ⇔ message passing ⇔ static task mapping Historically, embedded multiprocessors favor the heterogeneous model; server multiprocessors favor the homogeneous model Philips Research

13

Example: Viper2 • • • • • • • • • •

Heterogeneous Platform based >60 different cores Task parallelism Sync with interrupts Streaming communication Semi-static application graph 50 M transistors 120nm technology Powerful, efficient

Philips Research

MBS VMPG

TM3260

TDCS VIP MIPS PR4450 TM3260

MSP

QVCP5L

MDCS QVCP2L

14

MIPS PR4450 MS

Memory Controller

TM32

RW

RW

M DCS-SEC S

MS

TM32 MS

S DCS-SEC

S DCS-CTR

S PMA-MON

M-GIC S

S PMA-SEC

M-IPC S

S PMA-ARB

CLOCKSS M PCI/XIOR S W

GLOBALS

R W

RESET S

S

DE

S TM1-DBG

S

R IIC1 W

S TM2-DBG

S

UART1 S

S

R IIC3 W

UART2 S

S

R USB W

UART3 S

S

R SMC1 W

S

R SMC2 W

EJTAG M S

R IIC2 W

M-Gate

BOOT M S

S DCS-CTR

PMA

S TM1-IPC

R W VMPG S

S TM1-GIC

R W DVDD S

S TM2-IPC

R W EDMA S R W

S TM2-GIC

VLD S

S DENC

R QVCP2 S R W MBS1 S

M SPDIO R S W

R W MBS2 S

M S

R AIO1 W

R W QTNR S

M S

R AIO2 W

R QVCP1 S

M S

R AIO3 W

W

VIP1 S

M S

R GPIO W

W

VIP2 S

M TUNNELR S W

R W

VPK S

S

R MSP1 W

W TSDMA S

S

R MSP2 W

C-Bridge M-DCS

Philips Research

RW

T-DCS

15

Example: Philips Wasabi • Homogeneous multiprocessor for media applications • First 65 nm silicon expected 1st half 2006 • Two-level communication hierarchy – Top: scalable message passing network plus tiles – Tile: shared memory plus processors, accelerators TM

TM

memory

TM ARM

TM pixel simd

TM video scale

TM picture improve

• Fully cache coherent to support data parallelism Philips Research

16

Static task graph partitioning P1

P5 P3

P2

P4 P6

P1, P2, P3

P4, P5, P6

Kahn API

Kahn API

Run-time scheduler

Run-time scheduler

Wasabi tile hardware

Wasabi tile hardware

Tile-1 Philips Research

Link

Tile-2

17

Limited speedup with task parallel mpeg2 decoder 7 6 5 4 3

speedup

2 1 0

4 # tiles Philips Research

2

1

1

2

4 # CPUs per tile

18

Data parallelism (1) • • • •

We would like to do better than 5x speedup MPEG2: 1920x1088 picture contains at least 68 slices Each slice can be decoded independently from other slices Original code in function decode_picture(): while (next_slice_start_code()) decode_slice(); • New code: while (next_slice_start_code()) { context[i] = copy_context(); new_task(decode_slice, &context[i]); i = i + 1; } wait_for_all_unfinished_tasks(); Philips Research

19

Data parallelism (2) • • •

Create as many tasks as there are slices Run-time scheduler maps tasks to available processors Task surplus is good for load balancing! main task ready main task uses 8th processor

Philips Research

with while loop

20

Data parallelism (3)

Level-off beyond 10 trimedias

Philips Research

21

Data parallelism (4) Analysis: The main task (executing the while loop) cannot find new slices fast enough to keep >10 trimedias busy The underlying problem is a very inefficient implementation of next_slice_start_code() With little effort we fixed the problem, resulting in nearlinear speedup for 20 trimedias and up Surprisingly, only 1 man-week was invested in modifying the original sequential code for data parallelism… Philips Research

22

Multiprocessors summary • • • • •

Pollack’s Rule calls for course grain parallelism Programmer must be aware of course grain parallelism! Multiprocessors can have ~2 orders better power efficiency Heat dissipation limits what we can compute Data parallelism scales better than task parallelism – Requires homogeneous multiprocessor – Easier to program when sequential program available (E.g. 1 man-week vs. 1 man-year for Kahn task graph) • Heterogeneous architectures are even more power efficient but are less flexible and offer no data parallelism

Philips Research

23

Agenda PART 1 Embedded multiprocessors • Pollack’s observation • Power efficient processors • Heterogeneous vs. homogeneous multiprocessors PART 2 The shifting balance • Technology scaling • Efficiency gap is narrowing • Platform success depends on flexibility • Predictable system design Philips Research

24

Effect of technology scaling on architecture In the next few slides we show that technology scaling shifts the balance in favor of homogeneous, software-centered architectures: • Memory increasingly dominates silicon cost (on-chip memory for bandwidth, Straverius’ First Law for capacity) – Smaller weight to area penalty of homogeneous processor array • Platform with large user community is needed to recover non-recurrent silicon costs, and reduce application development costs – Flexible architecture increases application range Philips Research

25

Memory speed trend [Hennessy & Patterson, 1996]

CPU 60%/yr.

1000

Processor-Memory Performance Gap: (grows 50% / year) DRAM 7% / yr.

10 1

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Speed

100

Philips Research

26

Implications (1) CPI ≈ cache_miss_rate * mem_latency cache_miss_rate ≅ 1/sqrt(cache_size) Keeping CPI constant:

⇒ cache_size ≅ (CPU_memory_gap)2 ⇒ cache_size ≅ (1.5)2 / yr ≈ 2.2x / yr Viper-1: 72 kB

Philips Research

Viper-2: 192 kB

27

Embedded DRAM speed • Embedded DRAM frequency scales much better • Straverius’ Law takes over: – new technology node computes S times faster – need S times bigger cache, memory – results in 1.1x / yr cache size increase CPU

balancing Faster CPU needs more memory

CPU memory

New process generation

CPU memory

memory Philips Research

scaling

28

Memory bandwidth trend

20% / year more b andwidth 4.8 GB/s

4 GB/s

DDR366, 2.9 GB/s

’01

’02

’03

Source: JEDEC/AMI2, Intel, Samsung

Philips Research

’04

’05

’06

29

Memory bandwidth trend • Compute speed, bandwidth: • DDR, DDR2 bandwidth:

60% / year 20% / year

Bandwidth mismatch increasing 35% / yr

Philips Research

30

Efficiency gap is narrowing

constant efficiency gap

C ASI DSP

n arcnhovatio in

CPU time

intrinsic efficiency Philips Research

Log area_efficiency

Log area_efficiency

• Memory footprint is predominantly determined by the application, much less by the architecture • Technology scaling gradually replaces the intrinsic computation component with memory -chip n o o g s r e f buf idth for bandw

ASIC

DSP

narrowing gap

CPU time

system efficiency

31

What is a platform (1) A platform is a collection of assets that are shared by a set of products. These assets can be divided into four categories: • • • •

Components Processes Knowledge People and relationships

A successful platform needs all four ingredients! Source: Robertson and Ulrich Philips Research

32

What is a platform (2) • The platform approach contrasts with product-specific development: – define and study the product requirements – then develop optimal technology for it • Such dedicated products are potentially of higher quality than their platform counterpart • But in practice one-time development endeavors often result in product flaws due to lack of experience • Platforms, in contrast, can be improved thru feedback from earlier products

Philips Research

33

Platform benefits • • • • • • •

Greater ability to tailor products Reduced development cost and time Reduced manufacturing cost Reduced production investment Reduced system complexity Lower risk Improved service

Source: Robertson and Ulrich Philips Research

34

Platform risks (1) application-specific RISK

RISK

Product A

Product B

RISK

platform based platform

RISK

Product A Philips Research

RISK

Product B

RISK

Product X

35

Platform risks (2) • Platform benefits only materialize if all ingredients are present – components, processes, knowledge, community • A lot of effort (=money) goes into platform building – components, processes, knowledge • Pay-back only occurs when enough products are derived • This requires a large enough developer community • Typically this means that company-internal platforms do not provide the expected benefits ⇒ Must share the platform with customers, even competitors! Philips Research

36

A stack of platforms in consumer media products

Java

Universal Home API Linux

Nexperia API

TTL streaming

System-on-Chip architecture platform Digital design reuse platform Semiconductor process platform

Philips Research

UPnP

37

Putting it together • Homogeneous systems are inherently more flexible than heterogeneous systems – address broader application range – attract a larger user community • Strong need for successful SoC platform – Non-recurrent costs keep rising – Product innovation rate keeps rising • Efficiency gap is narrowing Future platforms must focus on homogeneous architecture Philips Research

38

Predictable system design • Heterogeneous systems have more predictable behavior than homogeneous systems – Static task graph mapping – Dedicated streaming hardware functions – In general: less resource sharing • Homogeneous embedded systems call for increased focus on shared resource virtualisation – Shared L2 caches: footprint and bandwidth – Bursty, latency critical memory traffic

Philips Research

39

Concluding remarks and hints for future research • Homogeneous multiprocessors provide – A strong base for a successful media platform – Increasingly efficient processing power – Performance scaling, from low cost to high performance – Technology scaling (tiling), complement Moore’s Law – High degree of software reuse (stable SoC architecture) – Very high silicon yield (redundant SoC architecture) • More research needed on composable resource sharing • More research needed on structured approach to shared memory media processing Philips Research