Processor Architecture Design for Smart Cameras

Processor Architecture Design for Smart Cameras Hamed Fatemi Processor Architecture Design for Smart Cameras PROEFSCHRIFT ter verkrijging van de g...

Author: Judith Norman

4 downloads 1 Views 2MB Size

Report

Download PDF

Recommend Documents

Smart Medication Dispenser: Design, Architecture and Implementation

Research and Design in Unified Coding Architecture for Smart Grids

ARM Processor Architecture

VLSI Processor Architecture

PRODUCT INFORMATION. IVC-2D Smart Cameras. High-performance smart cameras for industrial environments

AMD Eighth-Generation Processor Architecture

Network Processor: Architecture and Applications

Smart Grid Architecture Development

SATIRE: A Software Architecture for Smart AtTIRE

A Software Radio Architecture for Smart Antennas

SMART Design for Active Seniors

Design Principles for Synthesizable Processor Cores

LOW-POWER PROCESSOR DESIGN

RISC Processor Design

Efficient Checker Processor Design

CISC Processor Design

RISC Processor Design

Mechanical Design of Retail Cameras

User Mode Execution. Computer Architecture - Overview. Supervisor Mode Execution. Processor Status Register. processor architecture

Page 1. Processor Design. Single Cycle Processor Design. Single cycle processor Datapath and Control

A Processor Architecture for Motion Sensing Systems using Accelerometer

A parallel camera image signal processor for SIMD architecture

A MULTIMEDIA CO-PROCESSOR ARCHITECTURE FOR REAL-TIME VIDEO CODING

A New Network Processor Architecture for High-Speed Communications

Processor Architecture Design for Smart Cameras

Hamed Fatemi

Processor Architecture Design for Smart Cameras PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de Rector Magnificus, prof.dr.ir. C.J. van Duijn, voor een commissie aangewezen door het College voor Promoties in het openbaar te verdedigen op woensdag 21 maart 2007 om 16.00 uur door

Hamed Fatemi geboren te Teheran, Iran

Dit proefschrift is goedgekeurd door de promotor: prof.dr. H. Corporaal

Copromotoren: dr.ir. T. Basten en dr.ir. B. Mesman

CIP-DATA LIBRARY TECHNISCHE UNIVERSITEIT EINDHOVEN Fatemi, Hamed Processor architecture design for smart cameras / by Hamed Fatemi. - Eindhoven : Technische Universiteit Eindhoven, 2007. Proefschrift. - ISBN 978-90-386-1983-5 NUR 959 Trefw.: ingebedde systemen / parallelle processen / computer vision / datacommunicatie. Subject headings: embedded systems / parallel architectures / computer vision / data communication.

Kerncommissie: prof.dr. H. Corporaal (promotor, TU Eindhoven) dr.ir. T. Basten (copromotor, TU Eindhoven) dr.ir. B. Mesman (copromotor, TU Eindhoven) prof.dr.ir. W. Philips (Universiteit Gent, Belgium) dr.ir. P.P. Jonker (TU Delft) prof.dr.ir. J. van Meerbergen (Royal Philips Electronics, TU Eindhoven)

The work in this thesis is supported by the Dutch government in their PROGRESS/STW research program under project EES.5411.

Advanced School for Computing and Imaging

This work was carried out in the ASCI graduate school. ASCI dissertation series number 139.

c

Hamed Fatemi 2007. All rights are reserved. Reproduction in whole or in part is prohibited without the written consent of the copyright owner. Printing: Eindhoven University Press Cover design: S.E. Baha Back Cover: INCA+ camera by Philips Applied Technologies

Contents List of Figures

v

Preface

ix

1 Introduction 1.1 Embedded smart cameras . 1.2 Parallelism in applications . 1.3 Problem statement . . . . . 1.4 Contributions and outline of

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 2 3 5 7

2 Image processing algorithms 2.1 SmartCam applications . . . . . . . . . . . . . . . . . 2.2 Classification and skeletonization of image operations . 2.2.1 Low-level image operations . . . . . . . . . . . 2.2.2 Intermediate-level image operations . . . . . . 2.2.3 High-level image operations . . . . . . . . . . . 2.2.4 Skeletons for image operations . . . . . . . . . 2.3 Case study . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Face detection . . . . . . . . . . . . . . . . . . 2.3.2 Face recognition . . . . . . . . . . . . . . . . . 2.4 Skeletonization of face detection/recognition . . . . . . 2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

9 11 13 13 15 15 16 18 19 20 24 25 28

3 Architecture developments in image processing 3.1 Exploiting parallelism in architectures . . . 3.2 Examples of image processing architectures 3.2.1 Xetal . . . . . . . . . . . . . . . . . 3.2.2 IMAP-board . . . . . . . . . . . . . 3.2.3 TriMedia . . . . . . . . . . . . . . . 3.2.4 Imagine . . . . . . . . . . . . . . . . 3.2.5 Heterogenous platform (INCA+) . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

29 30 34 34 37 39 41 44

. . . . . . . . . . . . . . . . . . the thesis

i

. . . .

. . . .

. . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

45 45 45 47 48 48 49

4 Balancing DLP and OLP 4.1 Architecture . . . . . . . . . . . . . . . . . . . . 4.2 Area model . . . . . . . . . . . . . . . . . . . . 4.3 Performance model . . . . . . . . . . . . . . . . 4.4 Design space exploration (DSE) and evaluation 4.4.1 Multi-objective optimization . . . . . . 4.4.2 Measurements . . . . . . . . . . . . . . 4.5 Conclusions . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

51 52 54 56 57 57 58 61

5 RC-SIMD: a new reconfigurable SIMD architecture 5.1 Related work . . . . . . . . . . . . . . . . . . . . 5.2 Solving the communication bottleneck . . . . . . 5.2.1 Basic architecture . . . . . . . . . . . . . 5.2.2 Clock frequency of RC-SIMD . . . . . . . 5.2.3 Updated architecture . . . . . . . . . . . . 5.3 Automatic scheduling . . . . . . . . . . . . . . . 5.3.1 Conflict model . . . . . . . . . . . . . . . 5.3.2 Facts tools . . . . . . . . . . . . . . . . . 5.3.3 Multi-casting . . . . . . . . . . . . . . . . 5.3.4 Updating the conflict model . . . . . . . . 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Cycle count comparison . . . . . . . . . . 5.4.2 Dependency kernels . . . . . . . . . . . . 5.4.3 Area estimation . . . . . . . . . . . . . . . 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

65 67 68 68 71 71 72 73 74 77 77 78 79 81 81 83

6 Run-time reconfigurability of RC-SIMD 6.1 A reconfigurable architecture . . . . 6.1.1 Reconfigurability . . . . . . . 6.1.2 Flexible clock frequency . . . 6.2 Programming . . . . . . . . . . . . . 6.2.1 Configuration . . . . . . . . . 6.2.2 Timing . . . . . . . . . . . . 6.3 Experiments . . . . . . . . . . . . . . 6.3.1 Performance . . . . . . . . . 6.3.2 Area overhead . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

85 86 86 87 88 88 89 90 90 91

3.3

3.4

3.2.6 Conclusion . . . . . . . . . . Case study . . . . . . . . . . . . . . 3.3.1 Mapping . . . . . . . . . . . . 3.3.2 Face recognition . . . . . . . 3.3.3 Algorithmic complexity . . . 3.3.4 Overall practical performance Conclusions . . . . . . . . . . . . . .

ii

. . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

6.4

6.5

FPGA implementation 6.4.1 Celoxica board 6.4.2 Implementation Conclusions . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

7 DC-SIMD: Dynamic Communication SIMD 7.1 Architecture . . . . . . . . . . . . . . . . . . . 7.1.1 Multiple-read . . . . . . . . . . . . . . 7.1.2 Multiple-write . . . . . . . . . . . . . 7.2 Area . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Area model . . . . . . . . . . . . . . . 7.2.2 Area overhead . . . . . . . . . . . . . 7.3 Lens distortion correction . . . . . . . . . . . 7.4 Evaluation . . . . . . . . . . . . . . . . . . . . 7.4.1 Priority influence . . . . . . . . . . . . 7.4.2 Different numbers of busses and buffer 7.5 Conclusions . . . . . . . . . . . . . . . . . . . 8 Automatic design space exploration 8.1 SmartCam template . . . . . 8.2 Area and energy Model . . . . 8.3 Framework . . . . . . . . . . . 8.3.1 Simulation . . . . . . . 8.3.2 Exploration . . . . . . . 8.4 Case study . . . . . . . . . . . 8.4.1 Skeletonization . . . . . 8.4.2 Baseline . . . . . . . . . 8.4.3 Evaluation . . . . . . . 8.5 Conclusions . . . . . . . . . . .

for . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

92 92 93 96

. . . . . . . . . . . . . . . . . . . . . . . . . . . size . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

99 100 101 103 104 104 105 106 108 108 110 111

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

113 114 116 117 118 118 119 119 122 123 125

SmartCam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

9 Summary and conclusions 127 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 9.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 References

131

Summary

139

Samenvatting

141

Curriculum Vitae

143

Reader’s Notes

145

iii

iv

List of Figures 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15

Smart sensor which contains a CMOS sensor, and a 1D array of processing elements. . . . . . . . . . . . . . . . . . . . . . . . . . . SmartCam components. . . . . . . . . . . . . . . . . . . . . . . . Source code transformation by using a skeleton library (The gray boxes are defined by the user). . . . . . . . . . . . . . . . . . . . . H-Box for identification. It contains a CMOS sensor and ILP processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robocup competition, played by 2*4 robots. . . . . . . . . . . . . . Distorted image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skin region in UV spectrum. . . . . . . . . . . . . . . . . . . . . . Skin-tone result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture of the RBF Neural Network. . . . . . . . . . . . . . . Region of interest for neural network (before normalizing to 64 ∗ 72). IMAP-board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 4 10 11 12 13 19 20 21 24 27

General-purpose processor. . . . . . . . . . . . . . . . . . . . . . . 30 Architecture which exploits TLP. . . . . . . . . . . . . . . . . . . . 31 Example architecture featuring ILP. . . . . . . . . . . . . . . . . . 32 Generalized template for architectures featuring OLP. . . . . . . . 32 Architecture which exploits DLP. . . . . . . . . . . . . . . . . . . . 33 Xetal block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 34 LPA processor of Xetal. . . . . . . . . . . . . . . . . . . . . . . . . 35 IMAP-chip and external memory block diagram. . . . . . . . . . . 38 TriMedia chip block diagram. . . . . . . . . . . . . . . . . . . . . . 40 TriMedia’s VLIW instruction handling. . . . . . . . . . . . . . . . 40 Imagine block diagram. . . . . . . . . . . . . . . . . . . . . . . . . 42 INCA+ camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 INCA+ blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Computational-efficiency of silicon. . . . . . . . . . . . . . . . . . . 46 Run-time mapping and pipelined scheduling of face detection/recognition on INCA+. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 v

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.1

5.2 5.3

5.4

5.5 5.6 5.7 5.8 5.9 5.10

5.11 5.12 5.13 5.14 5.15

SIMD Architecture Template (each PE can be VLIW). . . . . . . . PE with 2 local register files (LRF) per ALU. . . . . . . . . . . . . PE with shared register file. . . . . . . . . . . . . . . . . . . . . . . DSE for rgb2yuv kernel. The vertical axis shows the number of cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DSE for convolution kernel. . . . . . . . . . . . . . . . . . . . . . DSE for binarization kernel. . . . . . . . . . . . . . . . . . . . . . DSE for merging all kernels. . . . . . . . . . . . . . . . . . . . . . . Interesting part of (Figure 4.7), DSE for merging all kernels. It highlights some non Pareto points. . . . . . . . . . . . . . . . . . . Pie-chart of the total area distribution when NP E = 128, NALU = 1, using a local register file, and RF size = 8. . . . . . . . . . . . . . . Locally connected SIMD (LC-SIMD) architecture: each PE can only communicate with direct neighbors, and has only access to its (private) memory slice. . . . . . . . . . . . . . . . . . . . . . . Fully connected network SIMD (FC-SIMD) architecture: each PE is connected to all other PEs. . . . . . . . . . . . . . . . . . . . . SIMD architecture with the communication bottleneck removed; successive PEs do not access the bus simultaneously because of delay registers in the instruction bus. . . . . . . . . . . . . . . . . . . . . Schedule of a 4-tap filter on the SIMD architecture of Figure 5.3, (a) without and (b) with delay line in the instruction distribution. The schedule for (a) is invalid because of the communication bottleneck. Note that an FC-SIMD also solves this problem. . . . . . . . . . . Updated architecture (the maximum neighborhood communication is 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instruction distribution when the maximum neighborhood communication is 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The resource model for a LD+2 operation. . . . . . . . . . . . . . . The extended resource model for a LD+2 operation. . . . . . . . . (a) Schedule example for Facts and (b) Incorrect schedule. . . . . . (a) DFG transform because of conflict LD+2 vs. LD-3, (b) DFG transform because of conflict LD-3 vs. LD +1 and (c) Correct schedule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . There is no conflict at clock cycle 5 because PE0 and PE1 fetch the same value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Invalid schedule of a 4-tab filter because of the resource conflict at clock cycle 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The extended resource model for the border. . . . . . . . . . . . . A valid schedule for a 4-tap filter. . . . . . . . . . . . . . . . . . . . (a) Example for dependency loop kernel, (b) schedule for the LCSIMD and (c) schedule for RC-SIMD with initiation interval 6. . . vi

52 53 53 59 59 60 60 61 62

65 66

69

70 71 72 73 74 75

76 77 78 78 79 80

5.16 The multiplexor area overhead for each PE in (a) LC-SIMD, (b) FC-SIMD, (c) RC-SIMD. . . . . . . . . . . . . . . . . . . . . . . . 5.17 Area overhead (compared to an LC-SIMD). . . . . . . . . . . . . . 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13

Reconfigurable Communication SIMD (RC-SIMD) architecture. . . PE structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Control processor controls the PEs and clock generator. . . . . . . Performance for different maximum communication distance (k). . Area overhead of the reconfigurable part in comparison with non run-time-reconfigurable architecture. . . . . . . . . . . . . . . . . . Components of the Celoxica Board. . . . . . . . . . . . . . . . . . . Assembly and binary code of the threshold algorithm for PEs in the RC-SIMD architecture. . . . . . . . . . . . . . . . . . . . . . . . . . (a) Source image, (b) Edge detection image, and (c) Threshold image. Instruction-format of RC-SIMD. . . . . . . . . . . . . . . . . . . . Example assembly code for doing communication. . . . . . . . . . . Run-time schedule of the program in Figure 6.10. . . . . . . . . . Schematic view of RC-SIMD which includes 3 PEs. . . . . . . . . . Single-cycle implementation of processing element (without control signals). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

DC-SIMD architecture in which each PE can read from all busses when Nbus = 3 (only left communication is drawn). . . . . . . . . . 7.2 Matching PE-ID (for reading data). . . . . . . . . . . . . . . . . . 7.3 Writing data to a register (in multiple-read architecture when Nbus = 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 DC-SIMD architecture in which each PE can write to all busses when Nbus = 3 (only left communication is shown). . . . . . . . . . 7.5 Writing data to register (in multiple-write architecture when Nbus = 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Another architecture which number of busses is less than segmented PEs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Area overhead of DC-SIMD architectures (when Nbus = 3) compared to an LC-SIMD. . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Conversion parameters from pixel domain to distance domain. . . . 7.9 (a) Lens distorted image , (b) Corrected image. . . . . . . . . . . . 7.10 Performance improvement with different Nbus . . . . . . . . . . . . 7.11 Cycle count for different instruction buffer sizes. . . . . . . . . . . 7.12 Combination of RC-SIMD and DC-SIMD. . . . . . . . . . . . . . .

81 83 86 88 88 91 92 93 94 94 95 95 96 97 98

7.1

8.1 8.2

101 102 102 103 103 104 107 108 109 110 111 112

SmartCam architecture template, containing SIMD, ILP, and specialpurpose processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Design Space Exploration. An architecture template is iteratively instantiated based on the performance of an application. . . . . . . 118 vii

8.3

8.4 8.5 8.6 8.7 8.8

9.1

Simulating a SmartCam program. Simulation (1) runs on a workstation to generate a trace. (2) benchmarks each operation on each processor independently, using processor simulators. (3) uses these benchmarks and the trace to simulate the total multiprocessor architecture without actually executing the operations. . . . . . . . . Robocup application detail tasks. . . . . . . . . . . . . . . . . . . . Color segmented image of Robocup application. . . . . . . . . . . . Baseline Pareto-dominated volume for the Robocup application. . Pareto-dominated volume of a typical SPEA2 run (boxes), compared to the baseline (lines). . . . . . . . . . . . . . . . . . . . . . . Convergence of the dominated volume for the SPEA algorithm (10 runs). The Y-axis is the fraction of dominated space of a (0.10s, 100.0mm2 , 0.5J)-box. . . . . . . . . . . . . . . . . . . . . . . . . .

119 120 121 123 124

125

Coarse-grain reconfigurable architecture which can support SIMD, VLIW,... architectures. . . . . . . . . . . . . . . . . . . . . . . . . . 130

viii

Preface Back in late 2001, I received an email from Twan Basten regarding the SmartCam project. At first I was not sure whether it was a good decision to accept it or not. But today I am really pleased of my decision. Starting, going through and finishing this new path would have been impossible without the support and help of many experienced and kind people. So I devote this preface to all those who supported me throughout these few years. First of all, I would like to express my thanks to Prof. Henk Corporaal, who gave me the opportunity of this PhD position. Henk is one of the most well-informed specialists I have seen in this field. I will never forget his help and guidance during these four years which helped me to accomplish my project. Henk was not only a good supervisor but also a good company and great help for all my and my family’s problems in our daily life; the help that made us feel comfortable here in the Netherlands. Such a nice and amicable supervisor is certainly never forgotten. So he is gratefully acknowledged. I should also give my deep thanks to Twan Basten. He was the first person who offered me this project. He provided me many helpful tips and his eager and careful review and correction of my scientific papers was a great asset. I was really lucky to have such a diligent and clever supervisor. My project really got on its fast track when I started working with Bart Mesman. He is one of the cleverest people I have ever met. His new ideas were always important in the progress of my project. I highly appreciate his wise help and guidance. I also thank all the SmartCam members: Pieter Jonker, Richard Kleihorst, Harry Boers, and Wouter Caarls. Their good comments helped me greatly to progress my project. My special thanks go to Richard who gave me the opportunity of working in Philips for nine months. I learned a great deal from him in this unforgettable period. Also Anteneh Abbo who helped me in learning Xetal. Many thanks go to Wouter for jointly performing the research within SmartCam and especially for his help in translating the Dutch summaries, and his helpful comments. The members of the reading committee are specially appreciated for reading my

x

thesis, giving good comments and participating in my defense session. I am highly grateful to Prof. Ralph Otten, the head of the ES group who pays his utmost attention to this group and manages it in the best possible way. I would like to thank my colleagues in the ES group, first Marja and Rian for all their kind and helpful support which started even before I arrived here in the Netherlands until now. Also my special thanks to dear Rian for arranging the Dutch class for us on Wednesdays. Her attempts for teaching us the Dutch language is highly appreciated. I thank my other colleagues Akash, Bart (Theelen), Calin, Dominik, Hao, Jinfeng, Lech, Marc, Mathias, Oana, Patrick, Phillip, Qin, Sander, Szymon, and Valentin for being good friends, sharing good times during these years. I had a very good time with them and I really enjoyed their accompany. I thank Mohammad Mussavi because he was the person who initiated this enjoyable path of my life. He is a very kind friend who helped me and kept my company from the early days of my arrival in the Netherlands in order for me not to feel lonely and homesick. He helped me so much in adjusting myself to my new environment. Also thanks for his help in writing some of my articles. I thank Abdol Saib (Mr. CHETORI) for being such a good companion, we spent really the best times together. Also the group of Iranian friends: Kamyar, Abam, Amir, Siamak, Ehsan, Arash, Payam, Elham, and also the football team members. Their valuable friendship means very much to me. Spending my free time with them in their warm company, made me feel happy and less homesick so I had the energy to work on my project. My special appreciation goes to Ehsan Baha who with his great talent and creativity designed the cover of this thesis for me. He also worked hard to arrange a football team for us, get everybody together to enjoy it every week. Also Amir, Abam, Pouyan, and Hamid, we spent every day, our breaks together, I really enjoyed their interesting discussions and it will never be forgotten. Last but not least, my wholehearted thanks go to my kind, patient and devoted family. First of all I am grateful to my mother and father in law. They have always been supporting and encouraging me in this difficult path, also thanks for their good advices. My sisters and brothers in law as well. They were always a good companion in all my problems. Then comes my mother, father, and brother. I cannot express my thanks in one sentence for all the support and devotion I received from them throughout my whole life. I owe this achievement to them. Finally, I give my thanks to my caring wife, Negar who has always been the best companion for me. I certainly could not get to this position without her unconditional help, support and encouragement. Her love and care always encouraged me to go ahead. Also other thanks go to her for translating and correcting my papers during these years. With love and gratitude, I dedicate this thesis to Negar and my sweet daughter Saba. Eindhoven, March 2007

Hamed Fatemi

Chapter 1

Introduction In 1973 Gordon Moore observed that the number of transistors on a single silicon chip was doubling every 18 months (Moore’s law) [66]. Silicon technology enables massively parallel architectures (by increasing the number of transistor per die). Also on the application side, it is possible to increase functionality and therefore computational requirements (e.g., real-time execution) by exploiting parallelism inside applications like in media application: surveillance, object recognition, 3D graphics [86]. Computing devices and applications have recently emerged to interface with, operate on, and process data (e.g., pixels) from real-world samples classified as media. As media applications operating on these data have come to the forefront, the design of processors which are optimized to operate on these applications has emerged as an important research area. Traditional microprocessors have been optimized to execute applications from desktop computing workloads. Media applications are a workload with significantly different characteristics, meaning that the potential for large improvements in performance, cost, and power efficiency can be achieved by improving processors to exploit these media characteristics. Media applications include workloads from the areas of signal processing, image processing, video encoding and decoding, and computer graphics. These workloads require a large and growing amount of arithmetic performance. For example, many current computer graphics and image processing applications in desktop systems require billions of arithmetic operations per second for real-time performance [81]. As a result, media processors must be designed to provide large amounts of absolute performance. While high performance is necessary to meet the computational requirements of media applications, many media processors will need to be deployed in embedded systems where cost and power consumption are additional key concerns.

2

Chapter 1 Introduction

Fixed-function processors have been able to provide both high performance and good energy-efficiency when compared to programmable image processor (e.g., the MPEG2 [68] decoder chip [20]). In comparison, programmable digital signal processors and microprocessors are several orders of magnitude worse both in absolute performance and in energy efficiency. However, programmability is a key requirement in many systems where algorithms are too complex or change too rapidly to be built into fixed-function hardware. Using programmable rather than fixed-function processors also enables fast time-to-market. Finally, the cost of building chips is growing significantly in deep sub-micron technologies, meaning that programmable solutions also have an inherent cost advantage since a single programmable chip can be used in many different systems. For these reasons, a programmable processor which can provide the performance and energy efficiency of fixed-function processors is desirable. This chapter is organized as follows. Section 1.1 introduces the concept of an embedded smart camera which is the main subject of study on this thesis. The exploration of parallelism inside applications is studied in Section 1.2. We conclude with a thesis problem statement and the main contributions and thesis outline in Sections 1.3 and 1.4.

1.1

Embedded smart cameras

An embedded system is an information processing system that determines or controls to a large extent the behavior of a larger system. Embedded systems play a major role in life; e.g., the number of embedded processors with embedded software used by each person per day is about 50, and this number increases rapidly. The number of embedded processors sold annually surpasses the number of stand-alone processors with about a factor of 20 [15]. In many networked embedded systems, sensing with cameras is combined with processing to achieve certain communication, measurement or control goals. Video camcorders, web cameras, and video-phones are examples of products where the combination of image sensing, intelligent processes, digital storage, and transmission is penetrating the mass electronics market. Other applications can be found in inspection, surveillance, communication (mobile), and robotic applications; more recently embedded cameras are also found in wireless intelligent agents such as AIBOs [89]. In (wireless) embedded applications, the vision sensor can, e.g., be used to directly drive motion control electronics or wireless links. The advent and subsequent popularity of low cost, low power CMOS vision sensors [77] enables us to integrate processing logic (in a single package or board) on the camera, thereby creating a so-called smart sensors (see Figure 1.1). Examples of this integration have been realized by Linkoeping University and a spin-off company IVP [44], and by Philips Natlab in its Xetal project [50]. In these designs a

1.2 Parallelism in applications

3

Figure 1.1 Smart sensor which contains a CMOS sensor, and a 1D array

of processing elements. smart sensor was developed containing a 2D pixel array, combined with a 1D array of ADC’s (Analog to Digital Converter) and a 1D array of Processing Elements (PEs) operating in SIMD (Single Instruction Multiple Data) mode. By using a control processor, simple image processing routines can be executed on the array as soon as one or more lines of the image are converted. This is a great advantage over classical approaches, where the entire image in the worst case is converted to digital, copied a few times and finally processed. Smart sensors enable low power consumption and integrated intelligence. However, the control processor, feeding the SIMD processor array with instructions and supporting basic program control structures. On top of this, a separate powerful instruction level parallelism (ILP) or general-purpose processor (GPP), is usually needed in embedded applications for feature and object processing and control tasks. Integrating all this functionality (possible in a single package) will have a positive effect on the cost, power consumption, latency and inter-processor bandwidth (see Figure 1.2). The result is a low-cost Smart Camera (so-called SmartCam) solution [8].

1.2

Parallelism in applications

Many of the arithmetic operations in embedded media applications can be executed in parallel. This available parallelism in applications can be classified into three categories: instruction or operation level parallelism (ILP or OLP), data-level

4

Chapter 1 Introduction

Figure 1.2 SmartCam components.

parallelism (DLP), and task-level parallelism (TLP).

Program 1.1 Pseudo code demonstrating DLP (Each pixel is incremented with a constant value). for (i = 0; i < height; i ++){ for (j = 0; j < width; j ++){ output[i][j] = input[i][j] + c; } }

The most available parallelism in vision applications is at the data level. DLP refers to performing the same computation on different data elements occurring in parallel; e.g., adding a constant value to all pixels in the image, as shown in Program 1.1. Furthermore, DLP in applications can often be exploited with SIMD (Single Instruction Multiple Data) execution since the same operation (instruction) is typically applied to all data elements at the same time.

1.3 Problem statement

5

Program 1.2 Pseudo code demonstrating ILP (it also contains DLP). The third, fourth, and fifth instructions can be executed in parallel. for (i = 1; i < height-1 ; i ++){ for (j = 1; j < width-1 ; j ++){ temp1 = input[i-1][j-1] × -1 +input[i-1][j] × -2 +input[i-1][j+1] × -1; temp2 = input[i-0][j-1] × -2 +input[i-0][j] × 12 +input[i-0][j+1] × -2; temp3 = input[i+1][j-1] × -1 +input[i+1][j] × -2 +input[i+1][j+1] × -1; output[i][j] = temp1 + temp2 + temp3 ; } } Some parallelism is also available at the instruction/operation level. For example, a convolution filter computes the product of a coefficient matrix with a sequence of pixels. This matrix-vector product includes a number of multiplications and additions that could be performed in parallel (Program 1.2). Such fine-grained parallelism between individual arithmetic operations operating on one data element (pixel) is classified as ILP/OLP and can be exploited in many applications. Finally, some applications also contain task-level, or thread-level, parallelism. For example, in Robocup (soccer for robots), detecting the ball and the goal are separate tasks, and can be executed in parallel. Note that ILP/OLP, DLP, and TLP are all orthogonal types of parallelism, meaning that all three can be supported simultaneously. This thesis mostly concentrates on the exploration of DLP and the SIMD processor concept because in the image processing domain, there is a huge amount of DLP inherent in pixel-type operations. The SIMD processor offers undeniable advantages for control efficiency (e.g., 1 instruction word for 320 parallel PEs [50]) and its repetitive architecture relieves floorplan and lay-out designs. SIMD architectures can achieve high computational performances (e.g., 320 PEs work at the same time) with a very modest power consumption. Therefore, they are good in computational-efficiency (MOPS/W).

1.3

Problem statement

One of the most debatable aspects of an SIMD architecture is its communication infrastructure between PEs. The reason is that in many SIMDs, each PE can access only its direct neighbor, while for most kernels, such as NxN filters, PEs also need to get data/pixels from non-neighbor PEs. In the past, one way to execute NxN filters on these SIMD processors was to decompose them in multiple 3x3 filters (similar to loop unrolling). For calculating the output of a 3x3 filter, each output pixel needs only the pixel values of its direct neighbors in the input image. For example, a 5x5 filter can be decomposed into nine 3x3 filters. A disadvantage

6

Chapter 1 Introduction

of this decomposition is that it increases the code size and the execution time. Another solution is that by doing multiple shifts, a pixel can be moved over arbitrary distances among PEs. However, these shifts leads to cycle overhead. Some other SIMD architectures use a fully connected communication network between PEs, but may cause extreme area overhead. Another problem for the communication between PEs is the fact that the SIMD concept does not match with variable distance communication between PEs. If a particular PE needs to communicate with another PE at a certain distance, all PEs need to communicate with the same distance (due to the SIMD concept). Therefore, a standard SIMD can not efficiently execute certain applications like lens distortion compensation. In this application, pixels in the distorted image have to be moved to the right place. One way to execute lens distortion compensation in current SIMDs, is to communicate data over the maximum distance (per line) needed. However, this causes again severe cycle overhead. Given the above observations and the fact that SIMD processors are a crucial component in SmartCam architectures a major topic in this thesis is the study of novel solutions for organizing the inter-PE interconnection for the SIMD architecture. Furthermore, we investigate new opportunities and contribute to a better and more quantitatively guided design trajectory for an efficient SmartCam template by considering constraints such as power, performance, and cost. It is unclear what the right architectural parameters are for the SmartCam application domain. Configurable parameters for various parts of the SmartCam template include: • The number of PEs in the SIMD processor. • The number of registers per PE. • The number of function units in each PE or ILP processor. • The number of SIMD and ILP processors. • Interconnection between PEs. • Interconnection between processors in the SmartCam template (e.g., a bus, a ring, a fully-connected network). • Bandwidth for transferring data between processors. Finally, given the observation that an architecture with the above components is hard to program, we look into the programmability of SmartCam architectures.

1.4 Contributions and outline of the thesis

1.4

7

Contributions and outline of the thesis

This work makes several contributions to answer the questions raised in the previous section. • Characterizing SmartCam applications and proposing a proper programming model: For finding an efficient solution for a SmartCam, it is necessary to study the applications and find their characteristics. This study determines a set of core algorithms needed for low-, mid- and high-level image processing as applied within SmartCam, and their requirements (Chapter 2). We propose a programming model based on algorithmic skeletons, aiming at ease of programming and code portability. A summarized version of this part has been published in [22, 25]. • We elaborate on some example processor architectures for image processing applications to obtain insights in system design. These examples indicate technology trends. We discuss the merits and shortcomings of the systems (Chapter 3). • Determining an efficient parallel architecture: We explore the limitations and bottlenecks of increasing support for parallelism along the DLP and ILP/OLP axes in isolation and in combination. To scrutinize the effect of DLP and ILP/OLP in the SmartCam architecture/template, an area model (based on the number of ALUs for ILP/OLP and the number of PEs for DLP) as well as a performance model are defined. Based on these models and the template, a set of kernels of image processing applications is studied to find Pareto-optimal architectures in terms of area and number of cycles via multi-objective optimization (Chapter 4). A summarized version of this contribution has appeared in [24, 23]. • Proposal for a new type of SIMD architecture for non-local communication: Our studies led to the design of a new SIMD architecture, called RC-SIMD, with a reconfigurable communication network. This architecture requires only a very cheap communication network while performing almost the same as expensive fully connected SIMD architectures (Chapters 5 and 6). A summary of the RC-SIMD architecture has been published in [28, 27] and it has also been patented (PH005781). • In addition, we propose a new type of SIMD processor which can support dynamic variable distance communication like in lens distortion compensation (Chapter 7). A summarized version of this architecture has been published in [62]. • Finally, we study the efficient mapping of applications in the SmartCam template based on cost, energy, and performance by using design space exploration (DSE). This DSE works by a guided iteration over all SmartCam

8

Chapter 1 Introduction

architectures within a certain template for finding an efficient architecture (Chapter 8).

Chapter 2

Image processing algorithms A SmartCam is a programmable smart camera for image processing applications (e.g., Robocup, face detection/recognition). Current solutions use off-the-shelf digital signal processors, which are not optimally tuned to the image processing domain. Section 1.1 suggested that a combination of SIMD processors and ILP processors should be efficient. An SIMD processor is suitable for simple image processing algorithms and an ILP processor can execute more irregular algorithms such as neural-network-based algorithms, Hough transformation, and position estimation. Before starting to design the SmartCam template, the range of applications which should be implemented on it, has to be determined (for determining instruction, data, and task level parallelism). The next step is to find and extract the algorithmic (loop) kernels inside these applications for finding the operations inside these kernels and characterizing and classifying the algorithms. Programming such a system (containing a combination of SIMD and ILP processors) is laborious since it involves the non-trivial mapping of applications/tasks, synchronization (e.g., start time of task), and communication (between processors) to the architecture. To reduce this effort, a programming model based on the algorithmic skeletons [13] is used to bring parallelism into sequential code of image processing applications (kernels) [45]. Skeletons are algorithmic abstractions that encapsulate different forms of parallelism, common to a range of applications. The aim is to obtain environments or languages that allow easy parallel programming, in which the user does not have to handle problems related to communication, synchronization, deadlocks or non-deterministic program runs [13]. Usually, skeletons are embedded in a sequential host language and are used for coding and hiding the parallelism from the application programmer. In [87], Serot presents a parallel image processing environment, using skeletons on

10

Chapter 2 Image processing algorithms

Figure 2.1 Source code transformation by using a skeleton library (The

gray boxes are defined by the user).

top of the CAML functional language [16]. In [70], a parallel image processing environment has been presented for low-level image operations. The skeletons have been implemented in C by using MPI [69] as the communication library. In this chapter, a skeleton library for image processing operations (low, intermediate and high level) is defined. The skeleton library is embedded in the C programming language. The user is completely shielded from the parallel implementation of the algorithm, providing only the sequential code to process a single datum (by using this library). Figure 2.1 shows how to generate skeleton-code by using the skeleton library. The user should only specify the skeleton instantiation function and operation for that (gray boxes, see Program 2.6). The advantage, apart from providing the developer with a sequential interface, is that this abstraction allows the program to be executed on different processor architectures without changes to the user code, once a skeleton implementation has been provided for the architecture, it is possible to run any instantiation of it. If the skeleton does not exist for a specific architecture, the user can easily add it to the library [9]. This chapter is organized as follows: Section 2.1 treats the selected applications for SmartCam. The algorithms and classification of skeletons for image processing applications are described in Section 2.2. For evaluating the skeleton library, face recognition and detection algorithms are selected as a case study which are presented in Section 2.3. The skeletonization of face detection and recognition is studied in Section 2.4. Evaluation of the skeletons with the case studies is performed in Section 2.5, followed by conclusions in Section 2.6.

2.1 SmartCam applications

11

Figure 2.2 H-Box for identification. It contains a CMOS sensor and ILP

processor.

2.1

SmartCam applications

The application areas of smart cameras range from battery-run toys to high speed industrial applications. To restrict this range, this dissertation only considers smart cameras in which all low-level (pixel-level) and at least some intermediatelevel (object-level) vision processing are done inside the camera, with the option of also integrating high-level (complex object) processing tasks. The outputs of these applications are control decisions, symbolic data, and a small region of interest (ROI). The majority of these applications, however, are extremely energy/power constrained (they need to map on SmartCam which is cheap and low power), require real-time guarantees (e.g., 30 frames/sec), and increasingly often both. This work focuses on the following applications: • Security camera (face detection/recognition):

Recently, face/object detection and recognition are becoming an important application for intelligent cameras or security systems. Figure 2.2 shows the H-box [57] from Philips which is used for identification (contains CMOS sensor and one ILP processor). Such a system can be extended to use stereo setups for depth information [52], or multiple viewpoints in order to cover large areas. The output of a camera would consist of identity and location information of the detected person/object. Face/Object detection and recognition require lots of processing performance if real-time constraints are taken into account, and also low energy dissipation if the camera works

12

Chapter 2 Image processing algorithms

Figure 2.3 Robocup competition, played by 2*4 robots.

on a battery [26]. Face detection/recognition algorithms are explained in Section 2.3. • Robocup: Another application where high speed image processing is important, is robot soccer, in which two teams (four robots in each team) play a soccer game (Figure 2.3). In the robot soccer competition [92], the robots must autonomously navigate in an increasingly unstructured domain, and coordinate their soccer game. As the soccer game is a dynamic game, a fast input is necessary for the robots to react quickly on the changing environment. The amount of processing in this application that is unrelated to vision is quite high, requiring an interface to a separate PC board (each robot is equipped with a computer and camera). Problems solved in robot soccer may also be used for more serious applications, for example, for autonomous fire fighting robots. All vision processing could be done on the SmartCam. More details of the algorithms will be studied in Section 8.4.

• Lens distortion correction The optical lens systems used in imaging equipment suffer from distortion artifacts, which decrease the quality of the images produced (Figure 2.4). In applications such as computer vision, security, and medical applications, the determination of and compensation for distortion is required to enable accurate location and measurement of features in the image. While distortion correction may be applied off-line in many cases, a real-time capability is desirable for systems that must interact with the environment or with a user in real time. Lens distortion correction is studied in Section 7.3. Lens distortion correction can be part of other applications which are using sensors to capture images.

2.2 Classification and skeletonization of image operations

13

Figure 2.4 Distorted image.

This set (there are more applications like localization [83], pick and place, car electronics/intelligent transport systems, etc.[6] but look similar to aforementioned applications) includes devices with different size, power, and processing requirements, as well as a good collection of image processing algorithms/kernels. The following section shows the (loop) kernels of the SmartCam applications for verifying the skeletons.

2.2

Classification and skeletonization of image operations

To find the characteristics of the above applications such as required processing power, degree of parallelism and bandwidth, these applications should be decomposed into image processing (loop) kernels. More general benchmarks such as EEMBC [19] and MediaBench [56] include lots of compression algorithms, which are very specific and generally not applicable to the SmartCam domain. Therefore, this thesis uses the aforementioned applications. The rest of this section describes a number of algorithmic operations (kernels) which are used in the chosen application set. Image processing operations can be classified as low-level, intermediate-level and high-level [51]; based on this classification, it is possible to define a skeleton library for image operations.

2.2.1

Low-level image operations

Low-level image processing operations use the values of image pixels to modify individual pixels/data in the output (the output can be 2D or 1D).

14

Chapter 2 Image processing algorithms

Program 2.1 Pseudo code for color segmentation (simple implementation). for (y = 0; y < HEIGHT; y++) for (x = 0; x < WIDTH; x ++) if (U-MIN-RED < img[y][x][u] < U-MAX-RED && V-MIN-RED < img[y][x][v] < V-MAX-RED ) label[y][x] = RED; else label[y][x] = 0; These operations can be divided into point-to-point, neighborhood-to-point and global-to-point categories [71]. Point-to-point operations depend usually on the values of the corresponding pixels from the input image and parallelizing them is straightforward. Some examples of point-to-point operations are: • Color space conversion. This is usually the first step for segmentation operations, which works better in HSI [39], YUV [94], or dedicated color-spaces rather than in RGB. • Color segmentation. This is typically used for skin-tone detection, e.g., for face detection in security cameras (Program 2.1). Program 2.2 Pseudo code for filtering. for (y = 0; y < HEIGHT; y++) for (x = 0; x < WIDTH; x ++) out[y][x] = 0; for(m = -KERNEL-HEIGHT/2; m < KERNEL-HEIGHT/2; m++) for(n = -KERNEL-WIDTH/2; n < KERNEL-WIDTH/2; n++) out[y][x] = out[y][x] + kernel[m][n] × img[y+m][x+n] ; Neighborhood operations produce an image in which the output pixels depend on a group of neighboring pixels around the corresponding pixel from the input image. Filtering operations like noise reduction, smoothing, sharpening, and edge detection are highly parallelizable. These operations are used in all applications as preprocessing steps. They are local neighborhood operations meaning that for each pixel, information about its immediate neighborhood is needed for its processing. Program 2.2 shows an example of a filter operation. Program 2.3 Pseudo code for histogram. for (y = 0; y < HEIGHT; y++) for (x = 0; x < WIDTH; x ++) hist[img[y][x]] ++;

2.2 Classification and skeletonization of image operations

15

Global operations depend on all the pixels of the input image, like Histogram equalization. Histogram modeling techniques provide a sophisticated method for modifying the dynamic range and contrast of an image by altering that image such that its intensity histogram has a desired shape. Histogram equalization is often used to preprocess images before feeding them to a pattern recognition system, such as in face detection for security cameras. Histogram (Program 2.3) is a global operation because outputs (histogram levels) depend on all pixel values of the input image. Histogram can be parallelized because histogram levels are independent and the reduction is associative [54].

2.2.2

Intermediate-level image operations

Intermediate level image processing extracts objects, features or other information from images, and outputs other data structures, such as detected objects (e.g., faces) or statistics, thereby reducing the amount of information. For example the labeling operation is in this category. The region labeling is used for separating the objects from each other. For example, the same label should be assigned to the pixels which have the same color. Intermediate-level operations can be defined as image-to-object operations. They are more limited from the aspect of data parallelism compared to low-level operations.

2.2.3

High-level image operations

High-level image processing operations work on vector data or objects in the image and return similar complex structures or decision values. They usually have irregular access patterns and as such are difficult to process in a data parallel way. They can be divided into object-to-object and object-to-value operations. Some examples are: • Feed forward neural network. Neural networks are used in face recognition for security cameras, and fault detection in industrial inspection. Feeding a large input vector (5000 inputs is not unheard of in face recognition tasks [40]) through the network requires a vast amount of floating point multiply-and-accumulate operations and activation function evaluations (see Program 2.4; more detail is explained in Section 2.3). • Position estimation. This algorithm, used in the Robocup environment uses the lines found by the Hough transform to first estimate the orientation, and then the position.

16

Chapter 2 Image processing algorithms

Program 2.4 Pseudo code for RBF neural network. for(h = 0; h < HIDDEN-NODE; h++) out-hidden[h] = 0; for (i = 0; i < INPUT-NODE; i ++) out-hidden[h] += input[i] × i2h-weight[h][i] ; out-hidden[h] = ActiveFunc(out-hidden[h] ); for(o = 0; o < OUTPUT-NODE; o++) out-rbf[o] = 0; for(h = 0; h < HIDDEN-NODE; h++) out-rbf[o] += out-hidden[h] × h2o-weight[o][h] ; person = 0 ; max = 0; for(o = 0; o < OUTPUT-NODE; o++) if (out-rbf[o] > max ) max = out-rbf[o] ; person = o;

2.2.4

Skeletons for image operations

It is possible to use the data-parallelism paradigm with the master-slave approach for low-level, intermediate-level and high-level image processing operations [72]. A master processor is selected for splitting and distributing the data to the slaves. The master can also process part of the image (data). Each slave processes its assigned share of the image (data) and then, the master gathers and assembles the image (data) back. Based on the above observation, it is possible to identify a number of skeletons for parallel processing of low-level, intermediate-level and high-level image processing operations. They are named according to the type of the operator. Headers of skeletons are shown in Program 2.5. Each skeleton can be executed on a set of processors. From this set of processors, a host processor is selected to split and distribute the image to other processors. The other processors from the set, receive their part of the image and the image operation which should be applied to it. Then, the computation takes place and the result is sent back to the host processor. The programmer of the image processing application should only select the skeleton from the library and give the appropriate operation as a parameter. Program 2.6 shows a sequential program, the skeleton library, a user instantiation function, and user skeleton instantiation code for a simple algorithm (binarization).

2.2 Classification and skeletonization of image operations

17

Skeleton

Function

Run-time(%)

Static(%)

PixelToPixelOp NeighborToPixelOp GlobalToDataOp ImageToObjectOp ObjectToObjectOp ObjectToValueOp Total

Segmentation Run length encoding Labeling Ball, Field, Robot, Goal detection position of object -

44.8 0 27.4 12.1 12.3 128); // Skeleton library (for PC): void PixelToPixelOp (E IMG in[HEIGHT][WIDTH], E IMG out[HEIGHT][WIDTH], int(*op)()) { for (y = 0; y < HEIGHT; y++) for (x = 0; x < WIDTH; x ++) out[y][x] = op(in[y][x] ); } // Skeleton library (for IMAP in 1DC language): separate unsigned char in[HEIGHT][WIDTH/NUM-PE], out[HEIGHT][WIDTH/NUM-PE] ; void PixelToPixelOp (E IMG in[HEIGHT][WIDTH], E IMG out[HEIGHT][WIDTH], int(*op)()) { for (y = 0; y < HEIGHT; y++) for (s = 0; s < WIDTH/NUM-PE; s++) out[y][s] = op(in[y][s] ); } // User-defined operation: int binarization(int data){ return (data > 128) } // User skeleton instantiation function: PixelToPixelOp(in, out, &binarization);

2.3

Case study

Face recognition is one of the visual tasks which humans can do almost effortlessly while for computers it is a difficult and challenging task [36]. The applications of face recognition are increasing in a number of domains. Recently, a rapidly growing demand has evolved for face recognition as part of a surveillance system or user identification. Therefore, face recognition has been selected for skeletonizing an application as a case study. Also it includes all types of image processing operations

2.3 Case study

19

127 U -127

127 Skin-tone region

V

-127 Figure 2.5 Skin region in UV spectrum.

(low, intermediate, and high level). To recognize a face from an image, first, it is necessary to separate it from the image, and then, it should be recognized from a data base of known faces. Therefore, the face recognition process can be divided into two parts: face detection and face recognition [25].

2.3.1

Face detection

Face detection means detecting and localizing an unknown number (if any) of faces, given an image (from a video sequence). The main part of the procedure entails segmentation, i.e., selecting the regions of possible faces in the image. This is done by color specific selection. Afterwards, the results are made more precise by removing too small regions and enforcing a certain aspect ratio of the selected regions of interest (ROI). Detecting faces in the image is done by searching for the presence of skin-tone colored pixels or groups of pixels. The representation of pixels as they are delivered by the color interpolation routines from the CMOS sensor image are in the RGB form. This is not very suitable for characterizing skin color. The components in RGB space not only represent color but also luminance, that is the brightness of a color image as it would be displayed in a black and white monitor. Luminance varies from situation to situation; by going to a normalized color domain this effect is minimized. The effect (changing situation) on the skin detection is that when the light conditions change, the skin-tone also changes color and due to this effect, the detection decreases its precision. To solve this obstacle, a color domain is to be exploited that separates the luminance from the color. The YUV color domain [94] is suitable to this end because it separates the luminance (Y) from the true colors (UV). Y represents luminance, and U/V are just components of the color signal so that the color image can be reconstructed. The Y value can vary from 0 to 255 whereas the U and the V can have values from -127 to 127.

20

Chapter 2 Image processing algorithms

Figure 2.6 Skin-tone result.

By using the YUV color domain, not only the detection becomes more reliable but the skin-tone indication becomes easier, because skin-tone can now be indicated within a 2 dimensional space. For simplifying the detection part we assume the skin-tone region is a square in the UV spectrum (Figure 2.5) and every color outside this ”skin box” is indicated as non face (Figure 2.6 shows the skin-tone by white color and non skin-tone by black). The face (skin) detection part, separates the skin-tone from the image and sends only the luminance and the coordinate of the skin to the recognition part. If an image is coded in the RGB color space, it is first converted to YUV.

2.3.2

Face recognition

The next step of the process is the recognition part. Through this process, an area of skin, detected in the previous step, is identified with respect to a face database. For this purpose, a Radial Basis Function (RBF) neural network is used [34]. The reason behind using an RBF neural network is its ability for clustering similar images before classifying them (for separating a face from hands, feet) [40]. RBF based clustering received wide attention in the neural network community. Apart from good clustering capabilities RBF networks have a fast learning speed, and a very compact topology.

2.3 Case study

21

Input Nodes Output Nodes Hidden Nodes 1 1 2 1 2 2

o m n

Figure 2.7 Architecture of the RBF Neural Network.

Architecture of RBF Neural Network An RBF neural network structure is demonstrated in Figure 2.7. Its architecture is similar to that of a traditional three-layer feed forward neural network. Classification using a neural network revolves around the weighing of inputs. For each non-input node, all the inputs are weighed and summed, and the output is a function of this value. The input layer of this network is a set of n units, which accepts the elements of an n-dimensional input feature vector. Here, the RBF neural network input is the face region which is gained from a face detection part. Since it is normalized to 64 ∗ 72 pixels, it follows that n = 4608. The input units are completely connected to the hidden layer with m hidden nodes. Connections between the input and the hidden layers have fixed unit weights and, consequently it is not necessary to train them. The purpose of the hidden layer is to cluster the data and decrease its dimensionality. The RBF hidden nodes are also completely connected to the output layer. The number of outputs depends on the number of people to be recognized (for example, for 100 persons o = 100). The output layer provides the response to the activation pattern applied to the input layer. The RBF neural network is in fact a class of neural networks, depending on several parameters. The activation function (basis function) of the hidden units is defined by the distance between the input vector and a prototype vector. The activation function of an RBF hidden node i is given by [40]:

Fi (x) = Gi (k x − ci k2 /σi ),

i = 1, 2, . . . , m

(2.1)

22

Chapter 2 Image processing algorithms

where x is an n-dimensional input feature vector (normalized face 64 ∗ 72), ci is an n-dimensional vector called the center of the RBF hidden node, σi is also an n-dimensional vector called the width (also called radius) of the RBF hidden node and m is the number of hidden nodes. Normally, the activation function G of the hidden nodes is selected as a Gaussian function with mean vector ci and variance vector σi as follows:

Fi (x) = e

−

kx−ci k2 σ2 i

,

i = 1, 2, . . . , m

(2.2)

Because the output units are linear, the response of the k th output unit (among the o number of outputs) for input x is given as:

Outk (x) = Bk +

c X i=1

Fi (x) ∗ W (i, k),

k = 1, 2, . . . , o

(2.3)

where W (i, k) is the connection weight of the ith RBF hidden node to the k th output node and Bk is the bias of the k th output. Training the RBF neural network Training an RBF neural network consists of parameterizing the unknown parameters in a particular RBF neural network. Generally speaking, this means determining: 1. The number of hidden nodes (m). 2. Centers (ci ) and widths (σi ) of each basis function. 3. Output layer weights (W (i, k)) and bias (Bk ). For some algorithms, these steps are carried out separately, while for others, all parameters are found simultaneously. In addition, different techniques can be mixed and matched for training the different parameters. For our purpose, several normalized faces are needed from people (64 ∗ 72 pixels) who we want to recognize, for example 30 pictures for each person. To have a robust neural network, we consider these faces in different gestures (smiling, sad, frowning, etc.) and various environmental conditions (low light, high light, etc.). When this database is ready, we start training the RBF neural network. Determining centers ci : For finding the ci , we use the K-Mean Clustering algorithm [65]. The algorithm consists of the following steps (among m hidden nodes):

2.3 Case study

23

Step1 Choose an initial set of cluster centers (c1 ,c2 ,. . . ,cm ). They represent the center of the hidden nodes. Step2 Assign each of the input images to its nearest cluster by computing the minimum Euclidean distances between each input vector and all cluster centers. Step3 Calculate the new cluster center. The new ci is the average of all input vectors assigned to cluster i. Step4 If the position of any cluster center change in step 3, return to step 2. Otherwise, stop. Determining width σi for hidden nodes: We define the adjacency relation between hidden nodes and inputs as follows: Definition 1 (Adjacency) A hidden node is adjacent to an input if it has the minimum Euclidian distance (among all hidden nodes) to that particular input. Using the above definition we determine the width for the hidden node (σi ) as follows: Step1 Construct vector v1i for every hidden node i, by taking the vector corresponding to the farthest adjacent inputs. Step2 Construct vector v2i for every hidden node i, by taking the vector corresponding to the nearest non-adjacent inputs. Step3 Consider the vector that results from taking the pair-wise maximum coordinates in | v1i − ci | and | v2i − ci | as σi . Thus: σi = M ax(| v1i − ci |, | v2i − ci |), i = 1, 2, . . . , m

(2.4)

Determining output layer weights: Determining output layer weights is rather simple in comparison to the other parameters in the network. In general, the main approach to finding output layer weights can be divided into two main categories of off-line and on-line methods. Off-line methods exploit the well-known Least Mean Squares (LMS) method, while on-line methods use either LMS or the Recursive Least Squares (RLS) algorithm [40]. Determining the number of hidden nodes: Since determining the number of hidden nodes is performed off-line, we chose an easy approach, namely full-search. This means for each number of hidden nodes between 1 and n (n being the number of input nodes) we calculate the number of errors, i.e., the number of wrongly recognized faces, and then we select the best network that has the minimum error. Note that going beyond n does not make sense; it would increase the dimension of the input data instead of reducing it.

24

Chapter 2 Image processing algorithms

Figure 2.8 Region of interest for neural network (before normalizing to

64 ∗ 72).

Using the RBF neural network For the recognition part, a skin area (ROI, Figure 2.8) should be fed to the neural network input. Subsequently, the output should be calculated for each person from the database. The network node has one output node for each person from the database and the maximum value among the output nodes is considered to be the recognized person. For distinguishing a face from other parts of the body and from noise, we have reserved one of the outputs of the neural network [26].

2.4

Skeletonization of face detection/recognition

This section shows how it is possible to skeletonize image processing applications via a skeleton library. According to Section 2.3, face recognition can be divided into two main tasks: • Detecting skin in the image, which can be further divided into two parts: – Finding the skin-tone in the image; we can map this part of the program as low-level image processing operations, because the input of this part is an image and the output is also an image. – Separating the skin-tones from the image as objects, and determining the coordinates of each of these skin-tones. We map this part as intermediate-level image processing operations, because the input is an image and the output is a set of objects (faces).

2.5 Evaluation

25

• Sending each of the skin-tones to the neural network for identification, according to the faces which are in the data base. We map this part as highlevel image processing operation, because the input is an object and the output is the number of the recognized person. Program 2.7 shows the C-code of face recognition. The main parts of the program (the ones which take most time) are the parts which are inside the loops and they have the same operations for each pixel in an image or for each object (face). In order to exploit the data level parallelism of this program, we use skeletons as mentioned in Section 2.2 (Program 2.5). The code can be divided into the following tasks: • Convert color: Since in our setup the input is in RGB, for detecting the skin-tone in the UV domain, the values of U and V should be calculated for each pixel. • Binarization: For each pixel, it should be checked whether it is within the skin-tone box or not (see Figure 2.5). • Labeling: For separating the faces from the image, the same label should be assigned to pixels which are nearby in the skin-tone. • Neural network: The neural network for recognizing the objects which are detected during labeling. The main function of the skeletonized code is shown in Program 2.8. The first three tasks are mapped onto the first three skeletons; the neural network is mapped onto the second three skeletons.

2.5

Evaluation

The IMAP-board [31] is selected for testing the skeleton library (this board supports indirect addressing for memory). In the IMAP-board (see Figure 2.9), image processing is done in parallel on 256 processing elements (PEs) which are controlled by a control processor (more details about the IMAP-board will be given in Section 3.2). Each implemented skeleton follows a standard template: • The control processor reads the image data from the external memory. • The control processor distributes the data between the PEs. • After that, PEs execute operations which are determined in the skeleton. • Finally, the control processor gathers the result from the PEs and writes it in the external memory.

26

Chapter 2 Image processing algorithms

Program 2.7 C-code for Face recognition. // Find skin-tone for (y=1; y < HEIGHT - 1; y++){ for (x=1; x v1 ; v2.x = v0.x + v1.x ; v2.y = v0.y + v1.y; // Put the result in the output stream. out 3) make the architecture expensive (see Equation 7.2) without much gain in performance. Figure 7.11 shows the number of cycles with different buffer sizes (NI−buf f er ) when Nbus = 3. It illustrates that by increasing the buffer size to 32 instructions, we can gain up to almost 10% improvement. Larger buffer sizes (> 32) make the architecture expensive without much gain in performance. Other applications for future SmartCam architectures (with dynamic communication) are considered as future work, to verify the efficiency of our current architecture configurations.

7.5

Conclusions

In this chapter, we have proposed two alternative extensions to the communication architecture of SIMD processors that allow dynamic indirect addressing of PEs. This is a necessary property for compensating lens distortion and other future smart camera functions that display non-linear behavior. The required architectural components, including a bus access arbiter, address comparators, local instruction buffer, and the PEs, are designed for simplicity to constrain the area cost, as verified by detailed cost models. Architecture parameters like arbitration protocols, the number of busses, and the buffer sizes have been explored to yield efficient architecture and a configuration that supports real-time lens distortion compensation (LDC) in parallel to many other functions (67.8% improvement in the performance compared to the implementation on the IMAP-board) with an increase of less than 30% in area. This is an attractive alternative to the allocation of dedicated processors or using an FPGA for LDC.

112

Chapter 7 DC-SIMD: Dynamic Communication SIMD

Figure 7.12 Combination of RC-SIMD and DC-SIMD.

It is also possible to put delay-registers in the instruction bus of DC-SIMD. Then, this architecture has the same additional properties as RC-SIMD, when the maximum neighbor communication is at most the number of busses in the DC-SIMD (for communicating over larger distances, the data should pass through registers). Figure 7.12 shows a combination of Figure 5.5 and Figure 7.4.

Chapter 8

Automatic design space exploration for SmartCam As mentioned in Chapter 1 smart cameras are surveillance-camera sized devices with built-in intelligence. They include programmable on-board processors. Such cameras are used for image processing tasks where volume and power consumption constraints prohibit the use of a general-purpose processing platform such as a PC. Currently, most commercial smart cameras only include one processing device, such as a DSP or an FPGA. Chapter 2 shows that image processing tasks at different levels have different processing requirements, and this solution is therefore not efficient. In this chapter1 , we propose a SmartCam architecture that includes multiple types of processors, to more efficiently exploit the variation in image processing operations. Because such a heterogeneous multiprocessor system raises questions of design costs and ease of programming, we provide a design space exploration framework in which an application-specific multiprocessor smart camera architecture is automatically determined from a single program. Design space exploration (DSE) works by a guided iteration over all smart camera architectures within a certain template. This requires an appropriate processor architecture template for our application domain, an architecture-independent application program, and a fast simulation of this program on different architectures. This chapter is organized as follows: Section 8.1 proposes the SmartCam template. Section 8.2 gives area and energy models. The SmartCam DSE framework is introduced in Section 8.3. Evaluation of this framework is performed in Section 8.4 by using the Robocup application as a case study, followed by conclusions 1 This chapter is based on joint work with Wouter Caarls from TU Delft within the SmartCam project.

114

Chapter 8 Automatic design space exploration for SmartCam

Figure 8.1 SmartCam architecture template, containing SIMD, ILP,

and special-purpose processors.

in Section 8.5.

8.1

SmartCam template

In the SmartCam design space exploration (DSE) environment, an application designer will be able to generate an efficient smart camera hardware configuration for his specific domain, based on his application code and various constraints such as size, cost and power consumption. However, for this approach to be feasible it is necessary to restrict the search space by imposing an architecture template. Based on the SmartCam applications and operations (Chapter 2), the SmartCam architecture template will consist of single instruction multiple data (SIMD), instruction-level parallel (ILP) processor(s), memories, general-purpose processors, I/O for external world, and communications peripherals (Figure 8.1). The SmartCam template can exploit parallelism along three axes: data-level parallelism (DLP), instruction-level parallelism (ILP/OLP) and task-level parallelism (TLP). SIMD processors are perfectly suited for the data parallelism inherent to low-level image processing operations. ILP processors, such as very long instruction word (VLIW) and superscalar processors, can execute multiple independent instructions/operations per cycle, exploiting more irregular level of parallelism than an SIMD. This is necessary because higher-level vision processing tasks are too irregular to execute on an SIMD. Finally, using a network of processors allows us to take advantage of the independence between different image processing tasks, or between different stages in a pipeline. The SmartCam template considers three different interconnects:

8.1 SmartCam template

Description Number of TriMedia Number of SimpleScalar Number of Imagine(150 MHz) Number of Imagine(66 MHz) Number of RC-SIMD(with 4 ALUs) Number of RC-SIMD(with 2 ALUs) Number of Xetal(with 2 ALUs) Interconnect network Bandwidth (MB/s)

115

Values 0,1,2 0,1,2 0,1,2 0,1,2 0,1,2 0,1,2 0,1,2 ring, bus, fully 1, 2.5, 50,10, 25, 50, 100, 250

Table 8.1 SmartCam template parameters which are used in DSE.

• Bus: The bus interconnect attaches all processors to a bus with a certain capacity. • Ring: In the ring interconnect structure, all processors are attached to routing nodes, which are in turn connected in a ring-like fashion. • Fully connected: Full interconnect means that the output of every processor is connected to the input of every other processor. Complexity It is obvious that the number of cycles needed for the execution of a program can be decreased by increasing the number of processors. By increasing the number of processors, the area and energy consumption of the template are also increased. Therefore, improving the number of cycles leads to an increase of the area/energy and vice versa. To investigate this trade-off, we have used multi-objective optimization [30]. The set of solutions for a multi-objective optimization problem consists of all decision vectors for which the corresponding objective vectors cannot be improved in any dimension without degradation in another. These vectors are known as Pareto-optimal. The possible components in our design template are a 166 MHz TriMedia and a 600 MHz SimpleScalar (64-bit, 4-issue, 16 registers) for mid- and high-level operations, and two Imagine instantiations (at 150 MHz and 66 MHz), a 24-MHz Xetal, and two RC-SIMDs (at 24 MHz, with 2 or 4 ALUs) for low-level operations. The RCSIMDs were instantiated with the same amount of PEs, registers, and data-path width as the Xetal. All values were scaled to CMOS18. Table 8.1 shows the chosen template parameters and their possible values. Note that in order to run the main program and scheduler, an instance of this design template will always contain one general-purpose processor. In our case, either a TriMedia or a SimpleScalar.

116

Chapter 8 Automatic design space exploration for SmartCam

Parameter

Description

NALU Nmicro−size NP E Nop Ninstruction Nread−RF Nwrite−RF Ncomm−op IALU b Ew ESRAM EALU Emux2×1 ASRAM

Number of ALUs per PE Micro-controller size total Number of PEs Number of operations (depends to kernel) Number of instructions (depends to kernel) number of reads from RF (depends to kernel) number of writes from RF (depends to kernel) number of communication operations (depends to kernel) Width of VLIW instruction per ALU (bit) Data width of the architecture (bit) Normalized wire propagation energy per wire track SRAM energy access per bit (normalized to Ew ) [47] Energy of ALU operation (normalized to Ew ) Energy of one bit MUX 2 × 1 Area of 1 bit SRAM

Value 2 and 4 1024 320 24 16 1 9 500 13 2

Table 8.2 Parameters of RC-SIMD.

8.2

Area and energy Model

The SmartCam DSE needs to know area and energy of each processor in the template to calculate the total area and energy for finding an efficient architecture. In the following the area and energy of RC-SIMD (similar type of models are used for other SIMD architectures in the template) are explained. We derive formulas for area and energy which are based on a number of parameters of the architecture, shown in Table 8.2. This area and energy model is meant to give an area and energy estimation for the region containing the PEs, inter-PE communication, and micro-controller storage (program memory) of the architecture. The total area and energy of the RC-SIMD are: EtotalRC−SIM D = Emicro + NP E ∗ EP E + Einter−comm

(8.1)

AtotalRC−SIM D = Amicro + NP E ∗ AP E + Ainter−comm

(8.2)

The inter-PE communication unit contains the multiplexors to transfer data for right/left communication. The area of inter-PE communication is the same as Chapter 6, and the energy is: Einter−comm = Ncomm−op ∗ NP E ∗ b ∗ Emux2×1

(8.3)

Ncomm−op contains the number of communication operations and the distance of communication. We assume wraparound communication. The PE is the same as

8.3 Framework

117

the PE in Chapter 6. The PE energy depends on the the ALUs and the register file (RF) : EP E = Etotal−RF access + Eexecution Etotal−RF access = (Nread−RF + Nwrire−RF ) ∗ ERF −access Eexecution = Nop ∗ EALU

(8.4)

The micro-controller provides storage for the kernels’ instructions, and sequences and issues these instructions during kernel execution. Since every PE receives the same instruction, the micro-controller size is constant as the DLP degree is increased. Even when the number of ALUs per PE increases, the code size does not change dramatically. The total number of operations remains roughly constant (assuming not too much speculative code). Therefore, the memory storage part of the micro-controller can remain constant (we do not consider the instruction decoder part).

Emicro = Ereading−instruction + Esending−instruction Ereading−instruction = Ninstruction ∗ Nmicro−size ∗ NALU ∗ IALU ∗√ESRAM Esending−instruction = Ninstruction ∗ NALU ∗ IALU ∗ [(Ph + Pv ) ∗ AP E ] ∗ Ew (8.5) Amicro = Astorage−instruction + Awiring Astorage−instruction = Nmicro−size ∗ IALU √ ∗ ASRAM Awiring = IALU ∗ NALU ∗ [(Ph + Pv ) ∗ AP E ]

(8.6)

We assume that 2D placement of PEs. Ph and Pv are the number of PEs in the horizontal and vertical directions, respectively (Ph = 20 and Pv = 16). Note: The area of the interconnect network between processors is modeled using MOSIS SCMOS layout rules for minimum-width METAL1 wires in CMOS18, at 100 MHz [7].

8.3

Framework

Design space exploration is the guided iteration of an architectural design space in order to optimize an objective function such as performance, power consumption or area. Our design space consists of all those architectures that conform to our architectural template as described in Section 8.1. The design space exploration finds the most suitable architecture by structurally simulating and analyzing the application for different combinations of processors (different instantiations of the template). Finally, the developer can access the

118

Chapter 8 Automatic design space exploration for SmartCam

Architecture Description

Application

Simulate

Instantiate

Architecture Template

Performance Energy

Figure 8.2 Design Space Exploration. An architecture template is itera-

tively instantiated based on the performance of an application. results (performance, energy, cost), and use them to tune his architecture. Figure 8.2 shows the DSE strategy for finding an efficient instantiation template for SmartCam. In the following subsections, the simulation and exploration are discussed.

8.3.1

Simulation

Any automated design space exploration requires a fast evaluation of many architectures. For modeling the data flows in our system, we use the Kahn Process Network (KPN) model of computation [32]. A KPN preserves the functionality of an application regardless of scheduling schemes; in other words, the output of a KPN does not depend on the execution schedule. We exploit this fact for an efficient design space exploration. Figure 8.3 shows the simulation flow. We simulate a single trace of an application. Because the functional behavior is independent of the schedule, we can create this trace on a normal workstation, and save all intermediate results. These intermediate results are used to simulate each operation individually for each processor in an architecture. Such simulations can then be cached if the processor’s micro-architecture does not change. Next, the benchmarked values are used to simulate the trace using a multiprocessor discrete event simulator. Interconnect is modeled as processors which can only copy data. The results are performance and energy figures that are used to guide the exploration.

8.3.2

Exploration

Because design spaces can be huge (see Section 8.1), it is not enough simply to iterate over all possibilities; a multi-objective optimization strategy is necessary. While it is possible to use multiple runs of a single-objective optimization technique with different tradeoffs, the fact that we need a set of outputs indicates that techniques which maintain such a set during the search are more appropriate.

8.4 Case study

Application

119

Skeleton

Architecture

Library

Description

Compile

Stream Program

Coprocessor Programs

Simulate (1)

Simulate (2)

Benchmarks

Simulate (3)

Performance Energy

Trace

Figure 8.3 Simulating a SmartCam program. Simulation (1) runs on a

workstation to generate a trace. (2) benchmarks each operation on each processor independently, using processor simulators. (3) uses these benchmarks and the trace to simulate the total multiprocessor architecture without actually executing the operations. We have chosen to use the SPEA2 [96] evolutionary algorithm to guide the optimization. It is connected to our discrete event simulator using the PISA framework [3]. During fitness assessment, SPEA2 (Strength-Pareto Evolutionary Algorithm 2) prefers non-dominated points, and lacking that, points which have the least amount of dominators (which, in turn, dominated the least amount of points). In order to preserve diversity, points which score equal on these accounts are selected based on the proximity of other points.

8.4

Case study

For testing the SmartCam framework, we have selected the Robocup application (see Chapter 2). It contains low-, mid-, and high-level image processing operations. In the next two subsections, the skeletonisation of the Robocup application and the DSE for finding an efficient architecture for that are explained.

8.4.1

Skeletonization

The Robocup application can be divided into the following tasks (Figure 8.4): • Color conversion: The first step in the Robocup application is to convert the color domain from RGB to YUV. As the classification based on the segmentation should be robust in the face of variations in the brightness, a ratio of red, green and blue would be most useful in the RGB space. Using a ratio in RGB space would imply thresholding in a conical volume, which makes thresholding difficult to execute at a high speed. In YUV,

120

Chapter 8 Automatic design space exploration for SmartCam

Figure 8.4 Robocup application detail tasks.

chrominance is coded in only two of the dimensions while the third dimension contains the intensity. This makes this color space more convenient for this application. Color conversion is a pixel-to-pixel operation (low-level) and the PixelToPixelOp skeleton can be used for that. • Color segmentation: Each object in the Robocup application has a different color (e.g., the ball is red, the field is green, etc.). Figure 8.5 shows the color segmentation of one image. This is also a pixel to pixel operation. • Run-length encoding (RLE): Because labeling objects is faster on a coded data structure, first a run-length code is calculated. This can be seen as essentially an ImageToObjectOp, which only keeps those pixels which are different than their left neighbor. • Labeling: For separating the objects from each other, the same label which has the same color should be assigned to the pixels. To do that, RLE is used to connect groups of pixels which have the same color. For making a better and robust segmentation and labeling, the center pixel of every 3x3 neighborhood of pixels will get the label that is most present in its neighbors. This part can use the ObjectToObjectOp skeleton. • Object recognition: After all pixels have a label they can be processed further into objects. E.g., as the ball is a round object, the pixels with a label ’ball’ will be processed to check if they are circle or not (the goal should be rectangular, etc.). Object detection is an ObjectToValueOp operation.

8.4 Case study

121

Figure 8.5 Color segmented image of Robocup application.

The skeletonized code is shown in Program 8.1. As mentioned in Section 8.3, each of these skeletons needs to benchmarked on each processor (as far as appropriate) in the template for doing DSE. Therefore, Table 8.3 shows the execution time and energy of these skeletons when they are running on each processor. These values will be used for DSE (Figure 8.3). The mid- and high-level operations are only implemented on an ILP processors because it is not efficient to implement them on an SIMD. Table 8.4 also shows the area of each processor. Note: DC-SIMD is not part of the template because in the Robocup application there is no indirect communication algorithm.

Program 8.1 Skeleton code for Robocup. while(1) { capture(rgb) PixelToPixelOp(rgb, yuv, rgb2yuv); PixelToPixelOp(yuv, seg, segment); ImageToObjectOp(seg, rle, encode); ObjectToObjectOp(rle, lbl, label); ObjectToValueOp(lbl, ball, detectball); ObjectToValueOp(lbl, field, detectfield); ObjectToValueOp(lbl, ygoal, detectygoal); ObjectToValueOp(lbl, bgoal, detectbgoal,); ObjectToValueOp(lbl, robot, detectrobot); }

122

Chapter 8 Automatic design space exploration for SmartCam

Skeleton Color Conversion: Time(ms): Energy(mJ): Segmentation : Time(ms): Energy(mJ): RLE : Time(ms): Energy(mJ): Labeling : Time(ms): Energy(mJ): Detectball : Time(ms): Energy(mJ): Detectfield : Time(ms): Energy(mJ): Detectygoal : Time(ms): Energy(mJ): Detectbgoal : Time(ms): Energy(mJ): Detectrobot : Time(ms): Energy(mJ):

TriMedia Simple Scalar

Imagine Imagine 150 MHz 66 MHz

RC-SIMDRC-SIMDXetal 2 ALUs 4 ALUs

8.00 1.08

10.30 130.13

2.59 0.17

5.89 0.23

0.40 0.023

0.26 0.024

0.72 0.020

13.79 1.87

11.10 111.111

10.46 1.46

23.78 1.89

2.92 1.32

1.44 1.36

5.60 1.12

6.91 0.94

6.00 60.06

-

-

-

-

-

4.54 0.62

3.30 33.03

-

-

-

-

-

0.91 0.12

0.80 8.00

-

-

-

-

-

0.68 0.09

0.60 6.00

-

-

-

-

-

1.70 0.23

1.60 16.01

-

-

-

-

-

0.43 0.06

0.30 3.03

-

-

-

-

-

0.62 0.08

0.20 2.00

-

-

-

-

-

Table 8.3 Profiling each skeleton on each processor of the SmartCam

template. Processor Xetal RC-SIMD (2 ALUs) RC-SIMD (4 ALUs) Imagine TriMedia SimpleScalar

Area (mm2 ) 20.0 25.0 29.0 64.0 31.0 18.0

Table 8.4 Area of each processor (CMOS 0.18).

8.4.2

Baseline

Because our case study is quite a small application, we can compare the results of the heuristic search to a brute-force iteration over the design space. Figure 8.6 shows the Pareto-dominated volume of the total design space. From such a graph, the user should interactively evaluate and select the most appropriate architecture. Looking at the right side of the figure, the cheapest architecture in terms of area is

8.4 Case study

123

Figure 8.6 Baseline Pareto-dominated volume for the Robocup applica-

tion. a single SimpleScalar processor connected to the camera by a 10 MB/s bus (point A). The architecture gets faster if the interconnect is upgraded (point B). It also becomes more power-efficient, because the SimpleScalar still consumes power while it is idle waiting for data. The next step replaces the SimpleScalar by a TriMedia processor, which is vastly more power-efficient, although a little slower (point C). Point D is the cheapest architecture to feature an SIMD processor, combining a SimpleScalar processor with a Xetal. Point E again replaces the SimpleScalar with a TriMedia, again being slower but more power efficient. Finally, the staircase constituted by points F1 , F2 and F3 contains three processors; apart from a Xetal, first two SimpleScalars, then a SimpleScalar and a TriMedia, and finally two TriMedias. Table 8.5 shows the overview of these points. Note that the differences in interconnect structure (bus, ring, fully-connected) are negligible with these few devices.

8.4.3

Evaluation

We configured the SPEA2 algorithm to a population size of 50 design points, and ran it for 50 generations. Figure 8.7 shows the resulting Pareto-dominated volume of a typical run, compared to the baseline. Only some minor enhancements at high area and low power are missed.

124

Chapter 8 Automatic design space exploration for SmartCam

Point A B C D E F1 F2 F3

# SimpleScalar

# TriMedia

# SIMD

1 1 0 1 0 2 1 0

0 0 1 0 1 0 1 2

0 0 0 (Xetal) (Xetal) (Xetal) (Xetal) (Xetal)

1 1 1 1 1

Interconnect 10 100 100 100 100 100 100 100

MB/s MB/s MB/s MB/s MB/s MB/s MB/s MB/s

Table 8.5 Details of some Pareto points.

Figure 8.7 Pareto-dominated volume of a typical SPEA2 run (boxes),

compared to the baseline (lines).

Figure 8.8 shows the convergence rate of the SPEA2 algorithm in terms of the total dominated volume. The algorithm needs the entire 50 generations to converge to a reasonable value; generally, configurations at the far end of the area axis are discovered last. Performing a simulation of a single design point takes on the order of a quarter of a second on an AMD Opteron 242. A full run therefore takes 0.25 · 502 = 625 seconds. In this application, however, the average error between the performance prediction made by the mapper [7] and the simulation is 0.14 ms and 0.65 mJ.

8.5 Conclusions

125

Convergence 0.65

Dominated volume quotient

0.64

0.63

0.62

0.61

0.6

0.59

SPEA2 standard deviation SPEA2 average Baseline

0.58

0

5

10

15

20

25

30

35

40

45

50

Generation

Figure 8.8 Convergence of the dominated volume for the SPEA algorithm

(10 runs). The Y-axis is the fraction of dominated space of a (0.10s, 100.0mm2 , 0.5J)-box. These are small enough to permit using prediction for the exploration, which is orders of magnitude faster than simulation. Recall, though, that the prediction works on partial process networks if the application contains data dependencies. In that case, it is not possible to base the exploration on prediction.

8.5

Conclusions

Based on the SmartCam applications and operations, we have derived SmartCam template. The template contains different SIMD and ILP processor and the programming of this template is based on algorithmic skeletons. We have presented the DSE framework, which finds an efficient instantiation of the template for a particular application. An example (Robocup) has shown the iterative process in which the user transforms his source code to allow parallelization, and the DSE finds the efficient configuration of the template based on performance, cost, and energy. For example the cheapest solution (in area) contains only one SimpleScalar. Fastest solution contains one Xetal and one SimpleScalar, and a more power efficient solution contains one TriMedia and one Xetal.

126

Chapter 8 Automatic design space exploration for SmartCam

Chapter 9

Summary and conclusions To finalize this thesis, Section 9.1 recapitalizes the most important conclusions of the individual chapters. At the end, Section 9.2 explains further research which can be done related to this thesis.

9.1

Summary

The last few years have seen the advent of Smart Cameras, surveillance-camera sized devices with onboard programmable logic. The size of the image is often very large, the processing time has to be very small and usually real-time constraints have to be met. Therefore, there has been an increasing interest in the development and the use of parallel algorithms and architecture. As mentioned in Chapter 1, it is possible to explore parallelism along five axes: data-level parallelism (DLP), operation-level parallelism (OLP), instruction-level parallelism (ILP), task-level parallelism (TLP), and parallelism in time (pipelining). In the image processing area DLP is the most important, but also OLP can be exploited well. Chapter 2 presented and evaluated a method for introducing parallelism into SmartCam application. The method is based on algorithmic skeletons for low, medium and high level image processing operations. They provide an easy-to-use parallel programming interface. To evaluate this approach, face recognition was implemented twice on a highly parallel processing platform, once via skeletons, once directly and highly optimized. It was demonstrated that the skeleton approach is extremely convenient from a programmers point of view, while the performance penalty of using skeletons is below 10% in our case study. Chapter 3 presented a concise overview of relevant computing architectures, which

128

Chapter 9 Summary and conclusions

provides the most important input for the rest of this thesis. The overview of various processor architectures briefly outlines ways for improving parallelism in SmartCam architectures. The mapping of face recognition on a heterogenous smart camera (INCA+) was also shown. The results showed that by tuning the application algorithms, and using a proper multi-processor architecture (SIMD + VLIW), face recognition can be performed real-time, up to 230 faces per second. Chapter 4 explored the limitations and bottlenecks of increasing support for parallelism along the OLP and DLP axes in isolation and in combination. To scrutinize the effect of DLP and OLP in the architecture, an area model based on the number of ALUs (OLP) and the number of processing elements (DLP) in the architecture were defined, as well as a performance model. Based on these models, a set of kernels of SmartCam applications has been studied to find Pareto-optimal architectures in terms of area and number of cycles via multi-objective optimization. By looking at the Pareto points in the design space, it is observed that the most interesting architecture points have 2-64 PEs, one or two ALUs per PE with local register files. By increasing the number of PEs beyond 64, the area of the inter-PE communication unit (which is a fully connected crossbar in this study) dominates the total area. It turns out that local inter-PE communication is cheaper, but too restrictive. To overcome this communication problem, a new type of SIMD architecture was introduced in Chapters 5 and 6, called RC-SIMD, with a reconfigurable communication network. It uses a delay-line in the instruction bus, causing the accesses to the communication network to be distributed over time. This architecture requires only a very cheap communication network while performing almost the same as expensive fully connected SIMD architectures. RC-SIMD causes irregular resource conflicts. Therefore, a conflict model was introduced, which existing schedulers are able to cope with. Experimental results show that, on average (compared to locally connected SIMDs), RC-SIMD requires 21% fewer cycles than locally connected architecture without the delay-line, while the area overhead is at most 10% compared to such an architecture. A second inter-PE communication problem is the fact that the SIMD concept does not match with variable distance communication between PEs. If a particular P En needs to communicate with, e.g., P En+3 , all PEs need to communicate over the same distance. The lack of supporting the communication of pixel data over variable distances has forced designers to allocate dedicated hardware or FPGAs for this type of communication like in compensating lens distortion and other non-linear functions. Chapter 7 proposed two alternative hardware extensions to SIMD processors that enable dynamic communication. A different number of busses, arbitration policies, and instruction buffer sizes were explored to yield a configuration that supports real-time lens distortion compensation in parallel to many other functions with a 67.8% improvement in performance and an increase

9.2 Future work

129

of less than 30% in area compared to a straightforward locally connected SIMD, as verified by detailed area cost models of the architectural components. Finally, in Chapter 8, the SmartCam template was shown which contains various types of SIMD and ILP processors to exploit DLP, OLP/ILP, and TLP parallelism inside SmartCam applications. It also presented a design space exploration (DSE) methodology to find an efficient architecture (instantiated from the template) for a specific application with respect to performance, energy, and area. The Robocup application was used as a case study for evaluating this DSE methodology. For example, it was demonstrated that the cheapest template contains only one SimpleScalar, the fastest template contains two SimpleScalars and one Xetal, and the most energy efficient template contains only one TriMedia.

9.2

Future work

DC-SIMD has the potential to have an impact on the platform choice for image processing hardware. However, in order to realize this impact, an FPGA implementation will have to prove the correctness, and a feasibility study of a VLSI implementation of this architecture is needed. Furthermore, a development environment needs to be created including compiler, mapping, and simulation tools in order to make the processor architecture available for industry. There are several research issues which can be studied for extending DC-SIMD: • Studying different number of busses and number of PEs in one segment register (see Figure 7.6). • Exploiting out-of-order execution of instructions in the instruction buffer. • Testing more applications like the bucket processing [73], shadowing, etc. • Exploiting different instruction buffer sizes per PE. • Adding more communication modes like one-to-many (multi-casting) or manyto-one. • Handling incoming messages (interrupt, separate communication and computation, etc.). There are also several research issues which are related to this work: • Adding more skeletons in the skeleton library. • Building a tiny operating system for running multiple tasks.

130

Chapter 9 Summary and conclusions

Figure 9.1

Coarse-grain reconfigurable architecture which can support SIMD, VLIW,... architectures.

• Testing more applications with our DSE methodology and also adding more processors in the template. • Integrating a number of different processors is commonly the bulk of the design effort, due to differences in interfaces, designing the communication infra-structure, and validating the design. A solution is to use FPGA, but currently available FPGAs are not sufficiently powerful and certainly not cost-effective, due to the small grain of reconfigurability. Therefore, we propose a coarse-grain reconfigurable architecture which means configuring a group of PEs to support SIMD, VLIW, pipeline architectures (Figure 9.1).

References [1] XAPP130 Using the Virtex Block SelectRAM+ http://www.xilinx.com/bvdocs/appnotes/xapp130.pdf.

Features

v1.4.

[2] Anteneh Abbo and Richard Kleihorst. Smart Cameras: Architectural Challenges. In Proceedings of Advanced Concepts for Intelligent Vision Systems (ACIVS), pages 6–13, Ghent, Belgium, September 2002. [3] S. Bleuler, M. Laumanns, L. Thiele, and E. Zitzler. PISA — a platform and programming language independent interface for search algorithms. In Carlos M. Fonseca, Peter J. Fleming, Eckart Zitzler, Kalyanmoy Deb, and Lothar Thiele, editors, Evolutionary Multi-Criterion Optimization (EMO 2003), Lecture Notes in Computer Science, pages 494 – 508, Berlin, 2003. Springer. [4] Matthew Bowen. Handel-C Language Reference Manual. Technical report. [5] E. Oran Brigham. The fast Fourier transform and its applications. Prentice Hall International, 1988. [6] Wouter Caarls. Testbench algorithms for SmartCam. Technical report, Delft University of Technology, The Netherlands, 2003. [7] Wouter Caarls. Automated Design of Application-Specific Smart Camera Architectures. PhD thesis, University of Delft, Delft, The Netherlands, 2007. [8] Wouter Caarls, Pieter Jonker, and Henk Corporaal. SmartCam: Devices for Embedded Intelligent Cameras. In proceedings of PROGRESS 2002, 3rd seminar on embedded systems, pages 1–4 (CD–ROM), Utrecht, The Netherlands, 24 October 2002. [9] Wouter Caarls, Pieter Jonker, and Henk Corporaal. SmartCam Design Framework. In Proceedings of PROGRESS 2003, 4th seminar on embedded systems, pages 1–8 (CD–ROM), Nieuwegein, The Netherlands, October 2003. [10] Wouter Caarls, Pieter Jonker, and Henk Corporaal. Benchmarks for SmartCam Development. In Proceedings of Advanced Concepts for Intelligent Vision Systems (ACIVS), pages 81–86, Ghent, Belgium, September 2003. 131

132

References

[11] Celoxica homepage. http://www.celoxica.com/. [12] P. P. Chang, S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. W. Hwu. IMPACT: An architectural framework for multiple-instruction-issue processors. ACM Computer Architecture News, SIGARCH, 19(3):266–275, 1991. [13] Murray Cole. Algorithmic skeletons: structured management of parallel computation. MIT Press, Cambridge, MA, USA, 1991. [14] Henk Corporaal. Transport Triggered Architectures. PhD thesis, University of Delft, Delft, The Netherlands, 1995. [15] Henk Corporaal and Pieter Jonker. SmartCam: Devices for Embedded Intelligent Cameras, STW /PROGESS project: EES.5411, May 2000. [16] Guy Cousineau and Michel Mauny. The Functional Approach to Programming with Caml. Cambridge University Press, UK, 1998. [17] W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonte, , N. Jayasena, U. J. Kapasi, A. Das, J. Gummaraju, and I. Buck. Merrimac: Supercomputing with streams. In SC’03, Phoenix, Arizona, USA, November 2003. [18] Koen Van Eijk, Bart Mesman, Carlos A. Alba Pinto, Qin Zhao, Marco Bekooij, Jef Van Meerbergen, and Jochen Jess. Constraint analysis for code generation: basic techniques and applications in facts. ACM Trans. Des. Autom. Electron. Syst., 5(4):774–793, 2000. [19] Embedded Microprocessor http://http://www.eembc.org/.

Benchmark

Consortium,.

[20] J. Fandrianto. Single chip mpeg2 decoder with integrated transport decoder for set-top box. page 469, 1996. [21] H. Farid and A.C. Popescu. Blind removal of lens distortions. Journal of the Optical Society of America, 18(9):2072–2078, 2001. [22] Hamed Fatemi, Henk Corporaal, Twan Basten, Pieter Jonker, and Richard Kleihorst. Implementing face recognition using a parallel image processing environment based on algorithmic skeletons. In Proceedings of the 10th Annual Conference of the Advanced School for Computing and Imaging (ASCI), pages 351–357, Port Zelande, The Netherlands, June 2004. ASCI, Delft, The Netherlands. [23] Hamed Fatemi, Henk Corporaal, Twan Basten, Richard Kleihorst, and Pieter Jonker. Parallelism Support in SIMD/VLIW Image Processing Architectures. In Proceedings of the 11th Annual Conference of the Advanced School for Computing and Imaging (ASCI), pages 291–296, Heijen, The Netherlands, June 2005. ASCI, Delft, The Netherlands.

References

133

[24] Hamed Fatemi, Henk Corporaal, Twan Basten, Richard Kleihorst, and Pieter Jonker. Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures. In Proceedings of Advanced Concepts for Intelligent Vision Systems (ACIVS), pages 689–696, Antwerp, Belgium, September 2005. Springer-Verlag, Berlin, Germany, 2005. [25] Hamed Fatemi, Richard Kleihorst, Henk Corporaal, and Pieter Jonker. Real time face recognition on a smart camera. In Proceedings of Advanced Concepts for Intelligent Vision Systems (ACIVS), pages 222–227, Ghent, Belgium, September 2003. [26] Hamed Fatemi, Hammed Ebrahim Malek, Richard Kleihorst, Henk Corporaal, and Pieter Jonker. Real-Time Face Recognition on a Mixed SIMD VLIW Architecture. In Proceedings of PROGRESS 2003, 4th seminar on embedded systems, pages 1–6 (CD–ROM), Nieuwegein, The Netherlands, October 2003. [27] Hamed Fatemi, Bart Mesman, Henk Corporaal, Twan Basten, and Pieter Jonker. Run-Time Reconfiguration of Communication in SIMD Architectures. In Proceedings of 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS), pages 1–4 (CD–ROM), Rhodes Island, Greece, April 2006. IEEE Computer Society. [28] Hamed Fatemi, Bart Mesman, Henk Corporaal, Twan Basten, and Richard Kleihorst. RC-SIMD: Reconfigurable Communication SIMD Architecture for Image Processing Applications. Journal of Embedded Computing, 2(2):167– 179, 2006. [29] J.R. Fischer and J.E. Dorband. Applications of the MasPar MP-1 at NASA/Goddard. In Proceedings of COMPCON, pages 278–282, San Francisco, CA, February 1991. IEEE Computer Society. [30] V. Fonseca and P. J. Fleming. An overview of evolutionary algorithms in multiobjective optimization. Evolutionary Computation, 3(1):1–16, 1995. [31] Yoshihiro Fujita, Sholin Kyo, Nobuyuki Yamashita, and Shin’ichiro Okazaki. A 10 GIPS SIMD Processor for PC-based Real-Time Vision Applications — Architecture, Algorithm Implementation and Language Support. In In Proceedings of the 4th International Workshop of the Computer Architecture for Machine Perception, (CAMP), pages 22–32, Washington, DC, USA, October 1997. IEEE Computer Society. [32] Marc Geilen and Twan Basten. Requirements on the execution of kahn process networks. In Proceedings of Programming Languages and Systems, 12th European Symposium on Programming, ESOP, pages 319–334, Warsaw, Poland, April 2003.

134

References

[33] Patrick Gelsinger. Microprocessors for the New Millennium: Challenges, Opportunities and New Frontiers. In Proceedings of International Solid-State Circuits Conference (ISSCC), pages 22–25, San Francisco, CA, February 2001. IEEE Computer Society. [34] Javad Haddadnia, Karim Faez, and Majid Ahmadi. A Neural Based Human Face Recognition System Using an Efficient Feature Extraction Method with Pseudo Zernike Moment. Journal of Circuits, Systems, and Computers, 11(3):283–304, 2002. [35] Steve Haga, Yi Zhang, Andrew Webber, and Rajeev Barua. Reducing Code Size in VLIW Instruction Scheduling. Journal of Embedded Computing, 1(3):415–433, 2005. [36] E. Hjelmas and B.K. Loo. Face detection: a survey. Computer Vision and Image Understanding, 83:236–274, 2001. [37] Jan Hoogerbrugge. Code Generation for Transport Triggered Architectures. PhD thesis, University of Delft, Delft, The Netherlands, 1996. [38] R.M Michael Hord. The ILLIAC IV, the first supercomputer. Computer Science Press, 1982. [39] HSI Color Space Color http://www.blackice.com/colorspaceHSI.htm.

Space

Conversion.

[40] Y.H. Hu and J.N. Hwang. Handbook of neural network signal processing. CRC Press, 2002. [41] Imagine project, Stanford university. http://cva.stanford.edu/projects/imagine. [42] Imagine tools. http://cva.stanford.edu/projects/imagine/project/im arch.html. [43] Intel 8080. http://en.wikipedia.org/wiki/Intel 8080. [44] IVP. http://www.sickivp.se/sickivp/en.html. [45] Pieter Jonker and Wouter Caarls. Application driven design of embedded realtime image processing. In Proceedings of Advanced Concepts for Intelligent Vision Systems (ACIVS), pages 1–8, Ghent, Belgium, September 2003. [46] Brucek Khailany. The VLSI Implementation and Evaluation of Area- and Energy-Efficient Streaming Media Processors. PhD thesis, Stanford University, June 2003. [47] Brucek Khailany, William Dally, Scott Rixner, Ujval Kapasi, John Owens, and Brian Towles. Exploring the vlsi scalability of stream processors. In Proceedings of the Ninth Symposium on High Performance Computer Architecture (HPCA), pages 153–164, Anaheim, California, USA, February 2003. IEEE Computer Society.

References

135

[48] Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles, and Andrew Chang. Imagine: Media Processing with Streams. IEEE Micro, 21(2):35–46, April 2001. [49] Richard Kleihorst, Harry Broers, Hammed Ebrahim Malek, Hamed Fatemi, Henk Corporaal, and Pieter Jonker. An SIMD-VLIW Smart Camera Architecture for Real-Time Face Recognition. In Proceedings of ProRISC 2003, pages 1–7 (CD–ROM), Veldhoven, The Netherlands, November 2003. [50] R.P. Kleihorst, A.A. Abbo, A. van der Avoird, M.J.R. Op de Beeck, and L. Sevat. Xetal: A Low-Power High-Performance Smart Camera Processor. In IEEE Int. Symposium on Circuits and Systems (ISCAS), pages 215–218, Sydney, NSW, Australia, May, 2001. IEEE Computer Society. [51] E. Komen. Low-level Image Processing Architectures. PhD thesis, University of Delft, Delft, The Netherlands, 1990. [52] Z. Koutsogianni. The use of xetal for depth estimation from scenes observed with a stereo camera. Technical report, Philips Research Labs, Eindhoven, The Netherlands, 2002. [53] David J. Kuck. A Survey of Parallel Machine Organization and Programming. ACM Computing Surveys, 9(1):29–59, 1977. [54] S. Kyo and K. Sato. Efficient Implementation of Image Processing Algorithms on Linear Processor Arrays using the Data Parallel Language 1DC. In IAPR Workshop on Machine Vision and Applications (MVA), pages 160–165, Tokyo, Japan, November 2006. [55] Sholin Kyo. A 51.2GOPS Programmable Video Recognition Processor for Vision based Intelligent Cruise Control Applications. In In Proceedings of the 2002 IAPR Workshop on Machine Vision Applications), pages 632–635. International Association for Pattern Recognition,, December 2002. [56] Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. Mediabench: A tool for evaluating and synthesizing multimedia and communicatons systems. In International Symposium on Microarchitecture, pages 330–335, 1997. [57] Hammed Ebrahim Malek. H-box. Technical report, Philips research, Eindhoven, The Netherlands. [58] Sanu Mathew, Ram K. Krishnamurthy, Mark A. Anders, Rafael Rios, and K. Soumyanath. Sub-500-ps 64-b ALUs in 0.18 SOI/Bulk CMOS: Design and Scaling Trends. IEEE Journal of Solid-State Circuits, 36(11):1636–1646, November 2001.

136

References

[59] Peter Mattson. A Programming System for the Imagine Media Processor. PhD thesis, Stanford University, USA, 2001. [60] Sreejith Menon and Priti Shankar. Space/time tradeoffs in code compression for the TMS320C62x processor . Technical report, Indian Institute of Science, India, 2004. [61] Bart Mesman. Constraint Analysis for DSP Code Generation. PhD thesis, Eindhoven University of Technology, Eindhoven, The Netherlands, 2001. [62] Bart Mesman, Hamed Fatemi, Henk Corporaal, and Twan Basten. DynamicSIMD for lens distortion compensation. In Proceedings of the 17th IEEE Conference on Application-specific Systems, Architectures, and Processors (ASAP), pages 261–264, Steamboat Springs, Colorado, USA, September 2006. IEEE Computer Society. [63] Subhasish Mitra, LaNae J. Avra, and Edward J. McCluskey. Efficient multiplexer synthesis techniques. IEEE Design and Test of Computers, 17(4):90– 97, 2000. [64] Matthijs Molen and Sholin Kyo. Documentation for the IMAP-VISION image processing card and the 1DC language. Technical report, NEC Incubation Center, Kawasaki, Japan, 1999. [65] J. Moody and C. Darken. Fast Learning in Networks of Locally-Tuned Processing Units. Technical Report YALEU/DCS/RR-654, Dept. of Computer Science, Yale University, New Haven, CT, 1989. [66] Gordon E. Moore. The microprocessor: Engine of the technology revolution. Communication of the ACM, 40(2):112–114, 1997. [67] Sebastien Mouy. XTC: Language for programming Xetal. Technical report, Philips Research Labs, Eindhoven, The Netherlands, 2004. [68] Moving Picture Experts Group. http://en.wikipedia.org/wiki/MPEG. [69] MPI. http://www.mpi-forum.org/docs/docs.html. [70] Cristina Nicolescu. Embedding data and task parallelism in image processing applications. PhD thesis, University of Delft, Delft, The Netherlands, 2003. [71] Cristina Nicolescu and Pieter Jonker. EASY PIPE - An EASY to use Parallel Image Processing Environment based on algorithmic skeletons. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS), page 114, San Francisco, U.S.A., April 2001. IEEE Computer Society.

References

137

[72] Cristina Nicolescu and Pieter Jonker. A Data and Task Parallel Image Processing Environment. In Proceeding of the 8th European Parallel Virtual Machine and Message Passing Interface (PVM/MPI), volume 2131 of Lecture Notes in Computer Science, pages 393–408, Greece, September 2001. Springer. [73] Eddy Olk. Distributed Bucket Processing. PhD thesis, University of Delft, Delft, The Netherlands, 2001. [74] PA-RISC. http://en.wikipedia.org/wiki/PA-RISC. [75] Veena Parashuram. Low-level Algorithm Mapping on SmartCam Architectures. Technical report, Philips Centre for Industrial Technology, Eindhoven, The Netherlands, 2005. [76] Pentium. http://en.wikipedia.org/wiki/Pentium. [77] Peter Clarke. CCD advocate Philips turns to CMOS image sensors. http://www.eetimes.com/story/OEG20000228S0037, 2000. [78] PowerPC. http://en.wikipedia.org/wiki/PowerPC. [79] S. Purcell. The impact of Mpact 2. 15(2):102–107, 1998.

IEEE Signal Processing Magazine,

[80] S. Rathnam and G. Slavenburg. An architectural overview of the programmable multimedia processor, tm-1. In Compcon ’96. ’Technologies for the Information Superhighway’ Digest of Papers, pages 319–326, February 1996. [81] Scott Rixner. Stream Processor Architecture. Kluwer Academic Publishers, Boston, MA, 2001. [82] Scott Rixner, William J. Dally, Brucek Khailany, Peter Mattson, Ujval J. Kapasi, and John D. Owens. Register Organization for Media Processing. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA), pages 375–386, Toulouse, France, January 2000. IEEE Computer Society. [83] R.L. Lagendijk and P.J. van Vliet,. http://www.cactus.tudelft.nl, 2002.

CACTUS impulse research project,

[84] E. Roza. Systems-on-chip: what are the limits? Electronics & Communication Engineering Journal, 13(6):249–255, 2001. [85] T. Sakurai and A.R. Newton. Alpha-power law MOSFET model and its applications to CMOS inverterdelay and other formulas. IEEE Journal of Solid-State Circuits, 25(2):584–594, 1990.

138

References

[86] R. Sasanka, M. Li, S. V. Adve, Y.-K. Chen, and E. Debes. ALP: Efficient Support for All Levels of Parallelism for Complex Media Applications. Technical report, University of Illinois at Urbana-Champaign, UIUCDCS-R-2005-2605, July 2005. [87] F. Serot, D. Ginhac, and J. Derutin. Skipper: A skeleton-based programming environment for image processing applications. In Proceeding of the 5th International Conference on Parallel Computing Technologies, pages 296–305, St. Petersburg, Russia, September 1999. Springer-Verlag, London, UK. [88] Silicon Hive. http://www.siliconhive.com. [89] Sony. http://support.sony-europe.com/aibo/index.asp. [90] Y. Sumi, S. Obote, N. Kitai, R. Furuhashi, Y. Matsuda, and Y. Fukui. Pll frequency synthesizer with an auxiliary programmable divider. In Proceedings of the International Conference on ISCAS (2), pages 532–536, Orlando, Florida, USA, May 1999. IEEE. [91] T.H Szymanski, Honglin Wu, and A. Gourgy. Power complexity of multiplexer-based optoelectronic crossbar switches. Very Large Scale Integration (VLSI) Systems, IEEE Transactions, 13:604–617, May 2005. [92] The RoboCup Federation. Official website. http://www.robocup.org/. [93] TriMedia Technologies. http://www.semiconductors.philips.com. [94] YUV Color. http://softpixel.com/˜cwright/programming/colorspace/yuv. [95] S. A. Zenios and R. A. Lasken. The connection machines CM-1 and CM-2: solving nonlinear network problems. In Proceedings of the 2nd International Conference on Supercomputing (ICS), pages 648–658, Saint Malo, France, 1988. [96] Eckart Zitzler, Marco Laumanns, and Lothar Thiele. SPEA2: Improving the Strength Pareto Evolutionary Algorithm. Technical Report 103, Gloriastrasse 35, CH-8092 Zurich, Switzerland, 2001.

Summary Processor Architecture Design for Smart Cameras Many networked embedded systems combine sensing using cameras with processing to achieve certain communication, measurement or control goals. Video Camcorders, web cameras and video phones are examples of products where the combination of image sensing, digital storage and transmission is penetrating the mass electronics market. Other applications can be found in inspection, surveillance and robotic applications. Many of these applications easily require tens of billions of arithmetic operations per second of sustained performance, yet also have tight power constraints in many systems. These requirements make the design very challenging. Often, digital signal processors or general-purpose microprocessors are used for these applications, but the field of image processing allows for many architectural optimizations, such as the use of single instruction multiple data (SIMD) processors for pixel-level operations, and instruction level parallelism (ILP) processors for feature-extraction and object-based operations. In this dissertation, we foresee a further integration, resulting in a combination of at least one or more sensors, SIMD processors and ILP processors. The result is a low-cost smart camera (socalled SmartCam) solution. Constraints such as processing speed, power consumption and cost vary wildly between applications, and thus there is no single solution that fits all needs. We are interested in quantifying the design flow of application-specific smart cameras via the use of simulation and analysis in a design space exploration (DSE) environment, and in the development of an intuitive programming model. It is totally unclear what the right architectural parameters are for a given application domain. There are many parameters, like number of processing elements (PEs) in SIMD processors, number of SIMD processors, number of ILP processors, inter-PE communication organization, number of arithmetic logic units (ALUs) in each PE, etc. For finding appropriate values for these parameters, we propose a DSE framework to find an efficient architecture for a SmartCam with respect to constraints such

140

Summary

as area, performance and energy. As a programming model for SmartCam solutions, we propose a framework based on algorithmic skeletons. An algorithmic skeleton implements an image processing operation for a specific SmartCam architecture, hiding the parallelism for the programmer. Algorithmic skeletons provide ease of programming and code portability at the cost of only a small performance loss. As mentioned for image processing applications, SIMD architectures can be very efficient. However, one of the problems in current SIMD processors is efficient inter-PE communication. Often the PEs of an SIMD processor are only locally connected (LC-SIMD). This may result in a communication bottleneck (many communication operations are needed). One way to solve this is to use a fully connected communication network between PEs (FC-SIMD). However, this solution leads to an excessive communication area cost, low communication network utilization, and scalability problems. In this thesis, we introduce a new type of SIMD architecture, called RC-SIMD, with a run-time reconfigurable communication network. It uses a delay-line in the instruction bus, causing the accesses to the communication network to be distributed over time. This architecture requires only a very cheap communication network (the area overhead is about 10-12% in comparison with LC-SIMD) while performing much better than LC-SIMD and often the same as expensive FC-SIMD architectures. An additional problem for the communication between PEs is the fact that the SIMD concept does not match with variable distance communication between PEs. If a particular PE needs to communicate with another PE at a certain distance, all PEs need to communicate with the same distance (due to the SIMD concept). Therefore, traditional SIMD processors can not implement efficiently certain applications, like lens distortion compensation. In this thesis, we consider two variants of the communication infra-structure of SIMD processors that enable dynamic distance communication of pixel data (called DC-SIMD). The results show that variable distance communication can be achieved at a reasonable cost of about 30% in area and substantial performance improvement (67.8% for lens distortion compensation). Thus DC-SIMD processors provide for certain algorithms a good alternative compared to ILP or general-purpose processors.

Samenvatting Ontwerp van Processorarchitecturen voor Intelligente Camera’s Veel netwerken van embedded systemen combineren het waarnemen van de omgeving via camera’s met beeldverwerking om bepaalde communicatie-, meet- of aansturingsdoelen te halen. Draagbare videorecorders, webcams en videotelefoons zijn voorbeelden van producten waarbij de combinatie van beeldopname, digitale opslag en communicatie de massaproductiemarkt voor elektronica betreedt. Andere toepassingen kunnen worden gevonden in industri¨ele inspectie, bewaking en robotica. Veel van deze toepassingen hebben meer dan tientallen miljarden berekeningen per seconde nodig, terwijl ze ook stricte beperkingen hebben op het energieverbruik. Deze eisen zorgen ervoor dat het ontwerp zeer uitdagend is. Vaak worden digitale signaalprocessoren of gewone microprocessoren gebruikt voor deze toepassingen. Echter, voor het specifieke doel van beeldverwerking zijn veel optimalisaties mogelijk, zoals het gebruik van enkele-instructie-meervoudige-data (single-instruction multiple-data, SIMD) processoren voor pixeloperaties, en instructie -parallele (instruction-level parallel, ILP) processoren voor kenmerkherkenning en objectgebaseerde operaties. In dit proefschrift voorzien we een verdere integratie, resulterend in een combinatie van sensor, SIMD processoren en ILP processoren. Het resultaat is een goedkope intelligente camera (SmartCam). Beperkingen zoals snelheid, energieverbruik en kosten zijn erg afhankelijk van de toepassing, en daarom is er geen oplossing die alle behoeften tegelijkertijd kan vervullen. We zijn ge¨ınteresseerd in het kwantificeren van het ontwerptraject van intelligente camera’s die gericht zijn op ´e´en bepaalde toepassing, daarbij gebruik makend van ontwerpruimte-verkenning (design space exploration, DSE) en een intu¨ıtieve programmeeromgeving. Het is onduidelijk wat de juiste architectuurparameters zijn voor een bepaald toepassingsdomein. Er zijn veel van dergelijke parameters, zoals het aantal verwerkingseenheden (processing elements, PE’s) in SIMD processoren, het aantal SIMD processoren, het aantal ILP processoren, de structuur van het communicatienetwerk, de functies van elke verwerkingseenheid,

142

Samenvatting

etc. Om de juiste waarden voor deze parameters te vinden presenteren we een DSE-aanpak, zodat we een effici¨ente SmartCam architectuur kunnen vinden die voldoet aan alle eisen wat betreft grootte, snelheid, en energieverbruik. Voor het schrijven van SmartCam toepassingen presenteren we een programmeermodel gebaseerd op algoritmische skeletten. Een algoritmisch skelet implementeert een bepaald type beeldverwerkingsoperatie voor een SmartCam architectuur, waarbij het parallelisme wordt verborgen voor de programmeur. Door het gebruik van algoritmische skeletten is het programmeren makkelijk, en het programma eenvoudig om te zetten naar verschillende architecturen, terwijl er maar weinig snelheid ingeleverd wordt. Zoals genoemd, zijn SIMD processoren zeer efficient voor beeldbewerkingsappapplicaties. Echter, een van de problemen met huidige SIMD processoren is effici¨ente communicatie tussen de PE’s. Vaak zijn de PE’s alleen lokaal verbonden (LC-SIMD). Dit kan een communicatie-knelpunt tot gevolg hebben (als er veel communicatie nodig is). E´en oplossing is om gebruik te maken van volledig verbonden PE’s (FC-SIMD), maar dit leidt tot overmatig gebruik van chipoppervlak, een lage benuttingsgraad, en schalingsproblemen. In dit proefschrift introduceren we een nieuw type SIMD architectuur, RC-SIMD genoemd, met een run-time herconfigureerbaar communicatienetwerk. RC-SIMD maakt gebruik van een vertragingsregister in de instructiebus, zodat het gebruik van het communicatienetwerk over de tijd verspreid kan worden. Deze architectuur gebruikt slechts een zeer goedkoop communicatienetwerk (het oppervlak is ongeveer 10-12% meer dan dat van een LC-SIMD) terwijl het veel beter presteert dan een LC-SIMD, vaak zelfs even goed als een FC-SIMD. Een ander probleem voor de communicatie tussen PE’s is dat SIMD niet goed overweg kan met variabele communicatieafstanden tussen PE’s. Als een bepaalde PE met een andere PE op een bepaalde afstand informatie uit moet wisselen, moeten alle PEs over die afstand communiceren (vanwegen het SIMD model). Daarom kan een traditionele SIMD processor bepaalde toepassingen, zoals lenscorrectie, niet uitvoeren. In dit proefschrift beschouwen we twee varianten van de communicatieinfrastructuur van SIMD processoren die w´el communicatie over variabele afstand mogelijk maken (DC-SIMD genoemd). De resultaten laten zien dat communicatie over variabele afstand mogelijk is met een acceptabele hoeveelheid extra oppervlak, rond de 30%, en een substanti´ele verbetering van prestaties (67.8% voor lenscorrectie). DC-SIMD processoren kunnen voor bepaalde algoritmen dus een aantrekkelijk alternatief zijn in vergelijking met ILP of standaard microprocessoren.

Curriculum Vitae Hamed Fatemi was born on January 2, 1977 in Tehran, Iran. In 1994, he graduated from Nikan High School in Mathematics and Physics. In September 1998, he got his bachelor degree in Electronics Engineering from the University of Tehran, Iran. In September 2001, he received his master degree from Khajeh Nasiredin University of Technology in the field of Electrical Engineering TelecommunicationSystem. In January 2000, he was honored with the first place in the Kharazmi Festival for Iranian Research Innovations, for the designing of a Base Transceiver Station (BTS). Since August 2002, he has been a Ph.D. student in the Electrical Engineering Department of the Eindhoven University of Technology. His research was funded by STW within the SmartCam project. It has led amongst others to several publications, two patents, and this thesis. Hamed is currently a post-doc researcher, continuing his research in the Electronic System (ES) group at the Electrical Engineering Department of the Eindhoven University of Technology.

144

Curriculum Vitae

Reader’s Notes