Homogeneous multiprocessing for the masses Paul Stravers Philips Research
2
Agenda PART 1 Embedded multiprocessors • Pollack’s observation • Power efficient processors • Heterogeneous vs. homogeneous multiprocessors PART 2 The shifting balance • Technology scaling • Efficiency gap is narrowing • Platform success depends on flexibility • Predictable system design Philips Research
3
Pollack’s observation Pollack observed that the next processor architecture generation has: Area: 2 … 3x equivalent gates But only: 1.4 …1.7x more performance The situation is getting worse over time… or is it?
Here Pollack made his observation
3.5 3.0 2.5
Normalized Area Increase
2.0
Performance Increase
1.5 1.0 0.5
Moore’s Law
0.0 1.5 Philips Research
1.0
0.7
0.5
.35
.18
.13
.09 Source: Intel keynote PACT’00
4
Power efficiency (1) How to achieve high processor clock frequencies? • Use strong buffers to quickly charge and discharge: wire capacitance
• But the buffers bring their own parasitic capacitances to the circuit, often much more than the wires!
• Typical: 2x faster clock ⇒ 4x more Joules per clock !!!! Philips Research
5
Power efficiency (2) Solution is in parallelism: do multiple independent computations per clock cycle • Pipelining (function parallelism):
register
• Instruction level parallelism (ILP) or data parallelism (SIMD):
Philips Research
6
Power efficiency (3) Compare two industrial examples: • Intel Pentium-4 (Northwood) in 0.13 micron technology – 3.0 GHz – 20 pipeline stages – Aggressive buffering to boost clock frequency – 13 nano Joule / instruction • Philips Trimedia “Lite” in 0.13 micron technology – 250 MHz – 8 pipeline stages – Relaxed buffering, focus on instruction parallelism – 0.2 nano Joule / instruction • Trimedia is doing 65x better than Pentium Philips Research
7
Power efficiency (4) Who cares for power efficiency, anyway? We do! …Because • Cost of electricity. The electricity bill of a modern PC is about 50 euro per year. Really! • A chip package that can handle a hot processor chip costs up to 75 euro. Just the package! • A fan (for air cooling) is expensive and noisy. • The consumer appliance box gets ugly and expensive Ultimately, heat dissipation limits what we can compute Philips Research
8
Parallelism • Processor innovation has been driven for 40 years by fine grain parallelism: – Pipelining – Instruction level parallelism (ILP) – Single instruction, multiple data (SIMD) • We are now confronted with diminishing returns (Pollack) • The next big step is course grain parallelism – Multiprocessing – Hyperthreading (not discussed here) • The key to parallelism is finding independent operations • Course grain parallelism requires user awareness, i.e. the programmer must find the independent functions !!! Philips Research
9
Multiprocessors (1) Replace single big, hot processor with multiple small, efficient processors. Each computes part of the job: Task parallelism: decode video Processor-1
improve picture Processor-2
Data parallelism: decode video
improve picture
Processor-1 (upper half of picture) Philips Research
decode video
improve picture
Processor-2 (bottom half of picture)
10
Multiprocessors (2) Two of the most important architecture characteristics are • Synchronization Tasks must wait on each other. For example: – only read new data after it has actually been produced by the upstream task (task parallelism) – only proceed with next picture when all tasks are done with current picture (data parallelism) • Communication Move data between tasks. Two basic models: – Message passing: explicit action by sender & receiver – Shared memory: implicit access by all tasks to all data Philips Research
11
Heterogeneous vs. homogeneous multiprocessors (1) Another important characteristic is the mix of processors. Homogeneous: Very flexible, moderate design effort, moderate comp.efficiency TM
TM
TM
TM
TM
Heterogeneous: Less flexible, large design effort, high comp.efficiency TM Philips Research
MIPS
audio DSP
mpeg dec
picture improve
12
Heterogeneous vs. homogeneous multiprocessors (2) The following characteristics play well together: Homogeneous ⇔ data parallelism ⇔ shared memory ⇔ dynamic task mapping Heterogeneous ⇔ task parallelism ⇔ message passing ⇔ static task mapping Historically, embedded multiprocessors favor the heterogeneous model; server multiprocessors favor the homogeneous model Philips Research
13
Example: Viper2 • • • • • • • • • •
Heterogeneous Platform based >60 different cores Task parallelism Sync with interrupts Streaming communication Semi-static application graph 50 M transistors 120nm technology Powerful, efficient
Philips Research
MBS VMPG
TM3260
TDCS VIP MIPS PR4450 TM3260
MSP
QVCP5L
MDCS QVCP2L
14
MIPS PR4450 MS
Memory Controller
TM32
RW
RW
M DCS-SEC S
MS
TM32 MS
S DCS-SEC
S DCS-CTR
S PMA-MON
M-GIC S
S PMA-SEC
M-IPC S
S PMA-ARB
CLOCKSS M PCI/XIOR S W
GLOBALS
R W
RESET S
S
DE
S TM1-DBG
S
R IIC1 W
S TM2-DBG
S
UART1 S
S
R IIC3 W
UART2 S
S
R USB W
UART3 S
S
R SMC1 W
S
R SMC2 W
EJTAG M S
R IIC2 W
M-Gate
BOOT M S
S DCS-CTR
PMA
S TM1-IPC
R W VMPG S
S TM1-GIC
R W DVDD S
S TM2-IPC
R W EDMA S R W
S TM2-GIC
VLD S
S DENC
R QVCP2 S R W MBS1 S
M SPDIO R S W
R W MBS2 S
M S
R AIO1 W
R W QTNR S
M S
R AIO2 W
R QVCP1 S
M S
R AIO3 W
W
VIP1 S
M S
R GPIO W
W
VIP2 S
M TUNNELR S W
R W
VPK S
S
R MSP1 W
W TSDMA S
S
R MSP2 W
C-Bridge M-DCS
Philips Research
RW
T-DCS
15
Example: Philips Wasabi • Homogeneous multiprocessor for media applications • First 65 nm silicon expected 1st half 2006 • Two-level communication hierarchy – Top: scalable message passing network plus tiles – Tile: shared memory plus processors, accelerators TM
TM
memory
TM ARM
TM pixel simd
TM video scale
TM picture improve
• Fully cache coherent to support data parallelism Philips Research
16
Static task graph partitioning P1
P5 P3
P2
P4 P6
P1, P2, P3
P4, P5, P6
Kahn API
Kahn API
Run-time scheduler
Run-time scheduler
Wasabi tile hardware
Wasabi tile hardware
Tile-1 Philips Research
Link
Tile-2
17
Limited speedup with task parallel mpeg2 decoder 7 6 5 4 3
speedup
2 1 0
4 # tiles Philips Research
2
1
1
2
4 # CPUs per tile
18
Data parallelism (1) • • • •
We would like to do better than 5x speedup MPEG2: 1920x1088 picture contains at least 68 slices Each slice can be decoded independently from other slices Original code in function decode_picture(): while (next_slice_start_code()) decode_slice(); • New code: while (next_slice_start_code()) { context[i] = copy_context(); new_task(decode_slice, &context[i]); i = i + 1; } wait_for_all_unfinished_tasks(); Philips Research
19
Data parallelism (2) • • •
Create as many tasks as there are slices Run-time scheduler maps tasks to available processors Task surplus is good for load balancing! main task ready main task uses 8th processor
Philips Research
with while loop
20
Data parallelism (3)
Level-off beyond 10 trimedias
Philips Research
21
Data parallelism (4) Analysis: The main task (executing the while loop) cannot find new slices fast enough to keep >10 trimedias busy The underlying problem is a very inefficient implementation of next_slice_start_code() With little effort we fixed the problem, resulting in nearlinear speedup for 20 trimedias and up Surprisingly, only 1 man-week was invested in modifying the original sequential code for data parallelism… Philips Research
22
Multiprocessors summary • • • • •
Pollack’s Rule calls for course grain parallelism Programmer must be aware of course grain parallelism! Multiprocessors can have ~2 orders better power efficiency Heat dissipation limits what we can compute Data parallelism scales better than task parallelism – Requires homogeneous multiprocessor – Easier to program when sequential program available (E.g. 1 man-week vs. 1 man-year for Kahn task graph) • Heterogeneous architectures are even more power efficient but are less flexible and offer no data parallelism
Philips Research
23
Agenda PART 1 Embedded multiprocessors • Pollack’s observation • Power efficient processors • Heterogeneous vs. homogeneous multiprocessors PART 2 The shifting balance • Technology scaling • Efficiency gap is narrowing • Platform success depends on flexibility • Predictable system design Philips Research
24
Effect of technology scaling on architecture In the next few slides we show that technology scaling shifts the balance in favor of homogeneous, software-centered architectures: • Memory increasingly dominates silicon cost (on-chip memory for bandwidth, Straverius’ First Law for capacity) – Smaller weight to area penalty of homogeneous processor array • Platform with large user community is needed to recover non-recurrent silicon costs, and reduce application development costs – Flexible architecture increases application range Philips Research
25
Memory speed trend [Hennessy & Patterson, 1996]
CPU 60%/yr.
1000
Processor-Memory Performance Gap: (grows 50% / year) DRAM 7% / yr.
10 1
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Speed
100
Philips Research
26
Implications (1) CPI ≈ cache_miss_rate * mem_latency cache_miss_rate ≅ 1/sqrt(cache_size) Keeping CPI constant:
⇒ cache_size ≅ (CPU_memory_gap)2 ⇒ cache_size ≅ (1.5)2 / yr ≈ 2.2x / yr Viper-1: 72 kB
Philips Research
Viper-2: 192 kB
27
Embedded DRAM speed • Embedded DRAM frequency scales much better • Straverius’ Law takes over: – new technology node computes S times faster – need S times bigger cache, memory – results in 1.1x / yr cache size increase CPU
balancing Faster CPU needs more memory
CPU memory
New process generation
CPU memory
memory Philips Research
scaling
28
Memory bandwidth trend
20% / year more b andwidth 4.8 GB/s
4 GB/s
DDR366, 2.9 GB/s
’01
’02
’03
Source: JEDEC/AMI2, Intel, Samsung
Philips Research
’04
’05
’06
29
Memory bandwidth trend • Compute speed, bandwidth: • DDR, DDR2 bandwidth:
60% / year 20% / year
Bandwidth mismatch increasing 35% / yr
Philips Research
30
Efficiency gap is narrowing
constant efficiency gap
C ASI DSP
n arcnhovatio in
CPU time
intrinsic efficiency Philips Research
Log area_efficiency
Log area_efficiency
• Memory footprint is predominantly determined by the application, much less by the architecture • Technology scaling gradually replaces the intrinsic computation component with memory -chip n o o g s r e f buf idth for bandw
ASIC
DSP
narrowing gap
CPU time
system efficiency
31
What is a platform (1) A platform is a collection of assets that are shared by a set of products. These assets can be divided into four categories: • • • •
Components Processes Knowledge People and relationships
A successful platform needs all four ingredients! Source: Robertson and Ulrich Philips Research
32
What is a platform (2) • The platform approach contrasts with product-specific development: – define and study the product requirements – then develop optimal technology for it • Such dedicated products are potentially of higher quality than their platform counterpart • But in practice one-time development endeavors often result in product flaws due to lack of experience • Platforms, in contrast, can be improved thru feedback from earlier products
Philips Research
33
Platform benefits • • • • • • •
Greater ability to tailor products Reduced development cost and time Reduced manufacturing cost Reduced production investment Reduced system complexity Lower risk Improved service
Source: Robertson and Ulrich Philips Research
34
Platform risks (1) application-specific RISK
RISK
Product A
Product B
RISK
platform based platform
RISK
Product A Philips Research
RISK
Product B
RISK
Product X
35
Platform risks (2) • Platform benefits only materialize if all ingredients are present – components, processes, knowledge, community • A lot of effort (=money) goes into platform building – components, processes, knowledge • Pay-back only occurs when enough products are derived • This requires a large enough developer community • Typically this means that company-internal platforms do not provide the expected benefits ⇒ Must share the platform with customers, even competitors! Philips Research
36
A stack of platforms in consumer media products
Java
Universal Home API Linux
Nexperia API
TTL streaming
System-on-Chip architecture platform Digital design reuse platform Semiconductor process platform
Philips Research
UPnP
37
Putting it together • Homogeneous systems are inherently more flexible than heterogeneous systems – address broader application range – attract a larger user community • Strong need for successful SoC platform – Non-recurrent costs keep rising – Product innovation rate keeps rising • Efficiency gap is narrowing Future platforms must focus on homogeneous architecture Philips Research
38
Predictable system design • Heterogeneous systems have more predictable behavior than homogeneous systems – Static task graph mapping – Dedicated streaming hardware functions – In general: less resource sharing • Homogeneous embedded systems call for increased focus on shared resource virtualisation – Shared L2 caches: footprint and bandwidth – Bursty, latency critical memory traffic
Philips Research
39
Concluding remarks and hints for future research • Homogeneous multiprocessors provide – A strong base for a successful media platform – Increasingly efficient processing power – Performance scaling, from low cost to high performance – Technology scaling (tiling), complement Moore’s Law – High degree of software reuse (stable SoC architecture) – Very high silicon yield (redundant SoC architecture) • More research needed on composable resource sharing • More research needed on structured approach to shared memory media processing Philips Research