MORA: A New Coarse-Grain Reconfigurable Array for High Throughput Multimedia Processing

MORA: A New Coarse-Grain Reconfigurable Array for High Throughput Multimedia Processing Marco Lanuzza, Stefania Perri, and Pasquale Corsonello Departm...

Author: Martha O’Connor’

1 downloads 0 Views 349KB Size

Report

Download PDF

Recommend Documents

High Throughput Aerial Photography, Ortho & 3D Processing

Ultrafast parallel laser processing of materials for high throughput manufacturing

GeoChip 4: a functional gene-array-based high-throughput environmental technology for microbial community analysis

A MULTIMEDIA TUTOR FOR DIGITAL IMAGE PROCESSING FOR REMOTE SENSING

Parallel Array Processing

VPX for Array Signal Processing System

Data Management for High-Throughput Genomics

High Throughput VLSI Architecture for RC5 Algorithm

New Technology in Laboratory Medicine. Continuous-Specimen-Flow, High-Throughput, 1-Hour Tissue Processing

Process Modeling for High-Throughput Biology

Reliable, proven solutions for high-throughput applications

High Throughput Screening System

Reliable instrumentation for high-throughput PCR projects

A fractal-based fibre for ultra-high throughput optical probes

A System for High-Throughput Spam Analysis and Clustering

High-speed links: A new field in high-throughput, energy-efficient communications?

Using reconfigurable hardware to distribute network traffic for parallel processing

Macrocell Builder: IP-Block-Based Design Environment for High-Throughput VLSI Dedicated Digital Signal Processing Systems

T High-Throughput Bioinformatics: Introduction

High-Throughput RFIC Wafer Testing

High Throughput Analysis of Networks

Introduction to high throughput sequencing

High-resolution Patterning Technologies using Ink-jet Printing and Laser Processing for Organic TFT Array

A MAC protocol for high-speed multimedia WPANs

MORA: A New Coarse-Grain Reconfigurable Array for High Throughput Multimedia Processing Marco Lanuzza, Stefania Perri, and Pasquale Corsonello Department of Electronics, Computer Science and Systems University of Calabria, Arcavacata di Rende - 87036 - Rende (CS), Italy {lanuzza, perri}@deis.unical.it, [email protected]

Abstract. This paper presents a new coarse-grain reconfigurable array optimized for multimedia processing. The system has been designed to provide a dense support for arithmetic operations, wide internal data bandwidth and efficiently distributed memory resources. All these characteristics are combined into a cohesive structure that efficiently supports a block-level pipelined dataflow, which is particularly suitable for stream oriented applications. Moreover, the new reconfigurable architecture is highly flexible and easily scalable. Thanks to all these features, the proposed architecture can be drastically more speed- and area-efficient than a state of the art FPGA in executing multimedia oriented applications. Keywords: Reconfigurable applications.

systems,

coarse-grain

array,

multimedia

1 Introduction Modern multimedia applications, including image processing, digital signal processing, video stream operations and others, demand high-performance computations alongside the capability of matching the rapid evolution of the algorithms. The simultaneous demand for high computational speed and flexibility makes reconfigurable architectures attractive solutions. In fact, they provide performances similar to Application Specific Integrated Circuits (ASICs), with maintaining a level of flexibility not available with more traditional custom circuitry. In reconfigurable computing, a key role is covered by fine-grained FieldProgrammable Gate Arrays (FPGAs). Commercially available FPGAs consist of a matrix of reconfigurable logic cells, with bit-level granularity, interacting through a very flexible programmable routing network. Thanks to this structure, FPGAs offer a high degree of on-chip parallelism; user control over low-level resources definition and allocation; and user-defined data format represented efficiently in hardware. As a drawback, owing to bit-level granularity, many resources have to be used to support multi-bit operations. This leads to a large routing overhead and to a low silicon area efficiency of FPGA-based computing solutions. Another disadvantage of the FPGAs is the large amount of configuration data needed for configuring logic cells and routing switches. This is particularly limiting in terms of required reconfiguration S. Vassiliadis et al. (Eds.): SAMOS 2007, LNCS 4599, pp. 159–168, 2007. © Springer-Verlag Berlin Heidelberg 2007

160

M. Lanuzza, S. Perri, and P. Corsonello

time and power dissipation especially when multiple hardware reconfigurations are needed an application process [1]. Such characteristics make FPGAs too expensive or not efficient enough when supporting multimedia applications. In order to overcome the above drawbacks, Coarse-Grain Reconfigurable Architectures (CGRAs) use multiple-bit (typically 8/16-bits) wide arithmetic-oriented processing elements (PEs) in conjunction with faster and more area- and powerefficient routing structures [1-2]. As a consequence, greater efficiency is achieved in executing arithmetic-dominant applications (such as multimedia applications) at lower power, area, and configuration time with respect to FPGAs [3]. From an architectural point of view, CGRAs can be classified as systems based on a linear array or on 2D mesh-based architectures [1]. Linear array based architectures, such as Piperench [4] and RaPiD [5], aim to speed-up highly regular computation-intensive applications by deep data-level pipelines. Their 1D architectural organization is particularly efficient for computations that can be easily linear pipelined. On the contrary, it appears inappropriate to support block-based applications [6], which are very common in multimedia processing. Because of the greater flexibility, 2D meshbased architectures [6-13] have received more success at both commercial and academic levels. All these systems are based on a 2D array of arithmetic-oriented functional unit, but differ often greatly in the special features provided to enhance the execution of computing intensive-applications. In this paper a novel 2D coarse-grain reconfigurable array, called MORA (Multimedia Oriented Reconfigurable Array), is proposed. The new architecture merges some promising characteristics of the previously proposed reconfigurable systems with a block-level pipelined computational data flow resulting in a very efficient platform to support the target applications. The remainder of the paper is organized as follows: in Section 2, an architectural overview of the new CGRA is presented; afterwards, the supported computational models are described in Section 3; examples of applications mapping are presented and compared to FPGA implementations in Section 4; finally, conclusions are given in Section 5.

2 Overview of the Proposed Architecture As shown in Figure 1, the workhorse of the proposed architecture consists of a scalable 2D array of identical Reconfigurable Cells (RCs) organized in 4X4 quadrants and connected through a hierarchical reconfigurable network. In order to simplify the diagram, the interconnection scheme is not drawn. However, the interconnections topology will be detailed in Section 2.2. Differently from many competitors, such as [6] and [8], the proposed architecture does not use a centralized RAM system. Storage for data is partitioned among the RCs by providing each of them with an internal data memory. This solution supplies a high memory access bandwidth to efficiently support massively-parallel computations, while maintaining both generality and scalability. The external data exchange is managed in a centralized way by an I/O Data Controller, which can access the memory space of the RCs by using standard memory interface functions (i.e. performing read and write operations), whereas internal data

MORA: A New Coarse-Grain Reconfigurable Array

161

flow is controlled in a distributed manner through a handshaking mechanism which drives the interactions between the RCs. Finally, the integration with external and/or embedded microprocessors and systems is guaranteed by a general I/O system interface including a Host Interface and an External Memory Interface. The former is used to manage the device configuration and I/O data transfer, whereas the latter is provided to supplement the on chip memory (when needed). Addr.

Data

Config. & Elab. Data

External Memory Interface

Host Interface

I/O DATA & CONFIGURATION CONTROLLER Config. Data

Elab. Data DM RAM PEPECL

DM DM DM RAM RAM RAM PE CL PE CL PE CL PE PE PE

DM RAM PEPECL

DM DM DM RAM RAM RAM PE CL PE CL PE CL PE PE PE

RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

PE

PE

PE

PE

PE

PE

PE

PE

DM RAM PEPECL

DM DM DM RAM RAM RAM PEPECL PEPECL PE CL PE

DM RAM PEPECL

DM DM DM RAM RAM RAM PEPECL PEPECL PE CL PE

RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

PE

PE

PE

PE

PE

PE

PE

PE

DM RAM PEPECL

DM DM DM RAM RAM RAM PEPECL PEPECL PE CL PE

DM RAM PEPECL

DM DM DM RAM RAM RAM PEPECL PEPECL PE CL PE

RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

PE

PE

PE

PE

PE

PE

PE

PE

DM RAM PE CL PE

DM DM DM RAM RAM RAM PEPECL PEPECL PE CL PE

DM RAM PE CL PE

DM DM DM RAM RAM RAM PEPECL PEPECL PE CL PE

RAM

RAM

RAM

RAM

RAM

RAM

RAM

RAM

PE

PE

PE

PE

PE

PE

PE

PE

Fig. 1. The top level architecture

2.1 The Reconfigurable Cell The block diagram of the RC is depicted in Figure 2. AddrA/B_ext

Configuartion Data

Data_InA/B_ext

Operand Registers

Config. Handshake Signals Data

Operands Dispatcher Mult1 (8x4-bit)

Mult2 (8x4-bit)

Dual Port SRAM (256*8-bit)

Addition Stage (16-bit)

Control Unit PE (8-bit) Output Stage

Control signals

Output Register

Instr. Counter

Ram Interface op_code

#ops

Instruction Decoder Control Signals

Address Descriptors Addresses Generator & Hanshake Control

Handshake Signals

Counter Controls

Config. Memory

Input Stage

AddrA_int AddrB_int Addr_ext

Addr_Out_ext

Data_OutA/B_ext

Fig. 2. The Reconfigurable Cell

The I/O interface includes two pairs of data/address input ports, two output data ports, a single output address port, a configuration port, and additional interface signals needed to synchronize communication between the RCs. As visible in Figure 2, the

162

M. Lanuzza, S. Perri, and P. Corsonello

main building elements of the circuit are: a 256*8-bit SRAM acting as an internal data buffer; an 8-bit Processing Element (PE); and a Control Unit incorporating the Configuration Memory. As explained above, MORA is proposed to efficiently support high-throughput multimedia applications, in which the most frequently required operations are addition, subtraction, accumulation, multiplication and multiply-accumulation. Possible low-area and low-power architectures of PEs able to support some of these arithmetic operations are those presented in [14] and [15]. The latter do not appear as the most appropriate for use in MORA, because they require two clock cycles for performing a 8x8 multiplication (thus limiting the achievable throughput), and do not support the multiply accumulation. For these reasons, the novel PE depicted in Figure 2 has been purpose-designed for MORA. The proposed PE consists of I/O registers, two 8x4-bit multipliers, an addition stage and some auxiliary logic needed for data exchange between the arithmetic blocks. This simple structure allows performance of single clock cycle 8-bit operations by exploiting hardware reuse. The Control Unit is responsible of all the RC operations. It includes a Configuration Memory containing the program (it consists of up to 16 vector/block instructions, which are loaded during the configuration phase of the system) to control the RC elaboration, an Instruction Counter, an Instruction Decoder, and a Handshake Control Logic. The Instruction Counter is used to sequentially step through the configured instructions. The Instruction Decoder generates the configuration signals for the PE at run-time. The Addresses Generator produces input and output addressing patterns, whereas the Handshake Control Logic manages the communication between the RCs of the array. Each configured instruction defines the execution of vector/block operations on a large data stream. In order to enable this feature, the instructions consist of different fields: the op_code specifies the operation code; the #ops specifies the number of the operations to be performed in the current instruction; the address descriptors specify the operands organization in the memory. The address descriptors are used by the Internal Address Generator to establish the appropriate memory addresses to be used for both operands and results during the execution of a given vector/block instruction. Each RC has two possible operative states: loading and executing. When the RC is in the loading state, packets of data can be inputted through one or both input ports to and then stored in the internal SRAM. The latter is dual-ported, thus enabling two

RAM

RAM

RAM

RAM

PE

PE

PE

PE

(a)

(b)

(c)

(d)

Fig. 3. Functionality of the reconfigurable cell: a) feed-forward mode; b) feed-back mode; c) route-through mode; d) route-through mode (double throughput)

MORA: A New Coarse-Grain Reconfigurable Array

163

independent write or read operations per clock cycle. Only when all the required operands are available, the RC switches to the executing state. As illustrated in Figure 3, when the generic RC is in the executing state, it can operate in four different modes. In the feed-forward mode, the packets of data coming from the internal memory are elaborated by the PE and produced results are dispatched to one or more RCs using one or both the output data ports. In the feedback mode, elaboration results are internally stored to be used by the same cell for future computations. Note that, each RC can be used also as a route-through cell. This operation mode is particularly useful to simplify the application mapping process. The designed RC has some aspects in common with the MATRIX Basic Functional Unit [7], mostly concerning the top-level organization. On the contrary, the circuital implementation and the controlling strategy are quite different. Anyway, the proposed system strongly differs from MATRIX [7] from an architectural point of view, especially considering the array organization and the interconnection topology. 2.2 The Interconnections Topology In order to allow the greatest applicability and the expandability of the new reconfigurable array, a custom interconnection network has been designed. As shown later, the interconnection structure is highly flexible and easily scalable. Similar to commercial FPGAs, all routing resources are static, thus the communications between the RCs are determined during the “configuration phase” of the system and cannot be changed at run-time. This choice does not require a centralized routing controller with benefits in terms of performance and area. The proposed interconnection scheme consists of a hierarchical reconfigurable network organized on two levels, each routing 8-bit data and address buses plus the needed synchronization signals.

NW

N

Switch Bus

NE

Read Cell Write Cell

W

SW

E

S

Read/Write Cell

SE

neighbor interconnections interleaved interconnections (a)

(b)

Fig. 4. Interconnections topology: a) the level 1 interconnection scheme; b) the level 2 interconnection scheme

164

M. Lanuzza, S. Perri, and P. Corsonello

The level 1 interconnections are used within each 4x4 quadrant. As depicted in Figure 4a, these interconnections provide nearest neighbours with horizontal, vertical and diagonal connectivity. Interleaved horizontal and vertical connectivity with length two is also furnished. However, each RC can receive input data from at most two cells (one for each input port) and it can send output data to at most four cells (two for each output port). Although bidirectional communication is more flexible, a unidirectional (with cyclic continuation at borders) approach has been used to reduce area occupancy and power consumption. Data and controls exchange between the quadrants is guaranteed by the level 2 interconnections scheme that, as depicted in Figure 4b, is a combination of long unidirectional buses and Programmable Bus Switches. Note that access to global buses is allowed only to peripheral cells of the quadrants. This greatly simplifies the structure of the communication network inside the quadrants with consequent advantages in terms of occupied area and power dissipation. It is worth noting that the adopted interconnection strategy makes the reconfigurable array easily scalable by hierarchical extending the level 2 interconnection scheme, thus remaining still cost- and power-efficient. L(i-1) LM(i-1) RAM(i-1)

RC(i-1)

LP(i-1)

Load Execute Load Execute Load Execute Load L(i) LM(i) LP(i)

PE(i-1)

RC(i)

Load Execute Load Execute Load Execute L(i+1) LM(i+1) LP(i+1)

RAM(i)

RC(i+1)

PE(i)

Load Execute Load Execute Load M

RAM(i+1)

PE(i+1)

Fig. 5. Block-level pipelining of the data flow

3 The Computational Model Applications running on MORA can achieve a very high performance by exploiting parallelism on different levels. First of all, the RC’s structure enables complex tasks execution exploiting block level pipelining parallelism. Additionally, many parallel elaboration data flows can be mapped within several portions of the array. As is illustrated in Figure 5, block-level pipelining is the natural elaboration model supported by the proposed architecture. The computation is organized in concurrently executing block-level pipelining stages where each stage is implemented by a single RC. The generic RC(i) elaborates its internal data and produces an output data frame which is transferred at run-time to RC(i+1). Only when all the required input data are internally available, the RC(i+1) can start its execution phase producing data for the subsequent processing stage. Note that RC(i)’s elaboration and RC(i+1)’s data loading are always overlapped. As a consequence, the latency LM(i+1) due to the data

MORA: A New Coarse-Grain Reconfigurable Array

165

loading into the memory space of RC(i+1) is always hidden by the processing latency LP(i) of the previous cell. It is important to point out that, through the block level pipelining, three key objectives are achieved. First, it is possible to maintain concurrency of each processing stage while providing correct synchronization in data exchange between the RCs; second, since control signals and elaboration data become local, higher performances can be achieved by minimizing routing; third, thanks to the data storage distribution, a high memory bandwidth is also guaranteed. Another important feature of the proposed architecture is the flexibility offered in balancing the computational load of the RCs involved in the elaboration. As illustrated in Figure 6, two strategies can be exploited (also simultaneously) to balance the computational load of a given RC: spatial computational load balancing achieved via data parallelism; temporal computational load balancing achieved by increasing the number of block pipelining stages. Note that increasing the number of block-level pipelining stages introduces an additional latency LT due to the transfer of unprocessed data. This technique is always applicable whereas exploiting spatial computational parallelism is constrained by the viable input/output cell ports (two only input/output ports are available per cell).

RAM(1)

LP(1)

PE(1)

RAM(2)

PE(3)

LP(1)

RAM(1)

PE(1)

LP(2)

PE(2)

RAM(3)

RAM(1)

RAM(2)

PE(1)

LP(2)/2 +LT

PE(2)

LP(3)

RAM(3)

LP(1)

RAM(2)

LP(2)/2

PE(2)

LP(2)/2+LT

RAM(3)

LP(2)/2

PE(3)

RAM(4)

LP(3)

PE(3) PE(4)

RAM(4)

LP(4)

PE(4)

(a)

(b)

(c)

Fig. 6. Examples of computational load balancing on the application basis where LP(2)/2 >> LP(1), LP(3) a) Straightforward application mapping (throughput=LP(2), app. latency= LP(1) + LP(2) + LP(3) ) b) Temporal computational load balancing (throughput=LP(2)/2+LT, app. latency= LP(1) + LP(2) + LP(3) +2LT) c) Spatial computational load balancing (throughput=LP(2)/2, app. latency= LP(1) + LP(2) /2 + LP(3) )

4 Application Mapping Results In order to validate the proposed architecture, a parametrical software circuit emulator was designed. As a benchmark, three computationally demanding tasks belonging to our target application domain were considered. The first one is the YCrCb to RGB color space conversion necessary in many video applications. The second task is a 2D separable filtering with many applications in medical imaging systems. Finally, the

166

M. Lanuzza, S. Perri, and P. Corsonello

third application is the 2D-DCT which is extensively used for image and video compression purposes. For each of the considered applications, two solutions were evaluated. The first one leads to a low-area implementation, whereas the second one is optimized for high throughput elaboration. In the following they are labeled as LA and HT, respectively. However, owing to the high flexibility offered by the new architecture in balancing the computational load of a given elaboration, some other mappings are possible to achieve the targeted resource-performance trade-off. LA and HT implementations carried out using MORA were compared to core generated circuits optimized for the XILINX Virtex-4 devices family [16]. Implementations within a XILINX XC4VLSX200 device with -11 speed grade were analyzed using the Integrated Software Environment (ISE) 7.1. Throughputs and occupied resources are summarized in Table 1. Considering that the CORE Generated circuits often offer the best achievable area-speed trade-off, comparison results demonstrate that MORA is very competitive. In fact, for all the evaluated benchmarks, MORA can always reach throughput higher than its counterpart. Table 1. Resources usage/performance trade-off comparisons: MORA to Virtex-4 FPGA MORA

Virtex-4 FPGA

Algorithm

Reconfigurable Cells (#PEs/Mem.[Kbit])

Throughput [Samples/cycle]

Color Space Conversion 2D separable 4x4 FIR 2D-DCT (8x8)

7/14 (LA) 16/32 (HT) 12/24 (LA) 20/40 (HT) 15/30 (LA) 25/50 (HT)

0.32 0.95 0.60 0.90 0.57 0.92

Resources #Slices 436 440 786

#Block Rams 2 (36 kbit) 2 (36 kbit) 3 (54 kbit)

Throughput [Samples/cycle] 0.85 0.64 0.85

Comparisons with FPGAs were made also in terms of silicon area occupancy and computational time. Clock speeds achieved by the FPGA implementations were evaluated through static timing analysis. On the contrary, their silicon area occupancy was evaluated considering that the generic Virtex4 slice implemented with a 90nm CMOS technology process occupies about 3442 µm2 [17]. The silicon area occupied by the generic 18Kb block RAM in the referenced FPGA device was also measured. To this aim, a 18Kb memory module was purpose-implemented with the commercial ST 90nm CMOS technology. It was found that the generic block RAM occupies a silicon area of about 71356 µm2. Also the generic RC used in MORA has been implemented using the ST 90nm CMOS technology. A critical path delay of about 1.5 ns and area occupancy of 37900 µm2 were measured by Synopsys Design Compiler. Figure 7 demonstrates that MORA always exhibits silicon area occupancy lower than FPGAs. This advantage comes from the use of high silicon-efficient domainspecific data-paths and small distributed memories, instead of fine-grained logic and relatively large embedded block memories. It is worth underlining that, data reported in Figure 7 do not include the area occupancy due to routing resources. However, as discussed in Section 3.2, MORA uses much less complex interconnections schemes than FPGAs. Therefore, it can be expected that for FPGAs the area overhead owing to routing resources is much higher than MORA.

MORA: A New Coarse-Grain Reconfigurable Array

167

Figure 8 shows that the proposed HT implementations always outperform the optimized FPGA circuits. In particular, for color space conversion, 2D separable FIR and 2D-DCT algorithms the circuits realized with MORA are about 6, 7 and 4.6 times faster, respectively, than their FPGA counterparts. Also the LA implementations are up to 4.6 times faster than FPGAs. 1,0

7 MORA_LA

MORA_LA MORA_HT

0,8

MORA_HT

6

FPGA

FPGA

5

0,6

4 3

0,4

2 0,2

1 0,0

0 Color Space Conversion

4x4 Separable FIR

2D-DCT

Fig. 7. Normalized Area Comparison

Color Space Conversion

4x4 Separable FIR

2D-DCT

Fig. 8. Normalized Performance Comparison

Figure 9 shows a comparison in terms of performance per area. The HT implementation reaches the best performance-area trade-off for color space conversion, whereas the LA implementations exhibit the best performance-area tradeoff for the other two considered applications. The high-level evaluations discussed above demonstrate potential significant advantages over commercial FPGAs. Even larger benefits are expected once the new architecture is fully implemented. 18 16

MORA_LA MORA_HT

14

FPGA

12 10 8 6 4 2 0 Color Space Conversion

4x4 Separable FIR

2D-DCT

Fig. 9. Normalized Performance/Area Comparison

5 Conclusions In this paper a new coarse-grain reconfigurable array for high-throughput multimedia processing has been presented. The architecture has been evaluated in terms of performance and area occupancy for several image processing algorithms. Results demonstrate impressive advantages with respect to conventional FPGA implementations.

168

M. Lanuzza, S. Perri, and P. Corsonello

References 1. Hartenstein, R.: A Decade of Reconfigurable Computing: a Visionary Retrospective. In: Proc. of Design, Automation and Test in Europe (DATE), pp. 642–649, March 13-16, 2001 Munich, Germany (2001) 2. Ristimaki, T., Nurmi, J.: Reconfigurable IP blocks: a survey. In: Proc. of Int. Symp. on System-on-Chip (SoC), pp. 117–122, November 16-18, 2004 Tampere, Finland (2004) 3. Marshall, A., Stansfield, T., Kostarnov, I., Vuillemin, J., Hutchings, B.: A reconfigurable arithmetic array for multimedia applications. In: Proc. of Int. Symp. on FieldProgrammable Gate Arrays (FPGA), pp. 135–143, February 21-23, 1999 Monterey, California, USA (1999) 4. Schmit, H., Whelihan, D., Tsai, A., Moe, M., Levine, B., Taylor, R R.: PipeRench: A virtualized programmable datapath in 0.18 micron technology. In: Proc. of the IEEE Conf. on Custom Integrated Circuits (CICC), pp. 63–66, May 12-15, 2002 Orlando, Florida, USA (2002) 5. Cronquist, D.C., Fisher, C., Figueroa, M., Franklin, P., Ebeling, C.: Architecture design of reconfigurable pipelined datapaths. In: Proc. of 20th Anniversary Conf. on Advanced Research in VLSI (ARVLSI), pp. 23–40, March 21-24, 1999, Atlanta, Georgia, USA (1999) 6. Singh, H., Lee, M.-H., Lu, G., Kurdahi, F.J., Bagherzadeh, N., Filho, C.: MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers 49(5), 465–481 (2000) 7. Mirsky, E., DeHon, A.: MATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deployable Resources. In: Proc. of the IEEE Symp. on FPGAs for Custom-Computing Machines (FCCM), Napa, California, USA, April 17-19, 1996, pp. 157–166 (1996) 8. Miyamori, T., Olukotun, K.: A Quantitative Analysis of Reconfigurable Coprocessors for Multimedia Applications. In: Proc. of the IEEE Symp. On FPGAs for Custom-Computing Machines (FCCM), Napa, California, USA, April 14-17, 1998, pp. 2–11 (1998) 9. Veredas, F.J., Scheppler, M., Moffat, W., Bingfeng, M.: Custom implementation of the coarse-grained reconfigurable ADRES architecture for multimedia purposes. In: Proc. of 15th Int. Conf. on Field Programmable Logic and Applications (FPL), Tampere, Finland, August 24-26, 2005, pp. 24–26 (2005) 10. Elixent Ltd. http://www.elixent.com 11. Motomura, M., Dynamically, A.: Reconfigurable Processor Architecture, Microprocessor Forum, October 10,2002, California, USA (2002) 12. Baumgarte, V., Ehlers, G., May, F., Nuckel, A., Vorbach, M., Weinhardt, M.: PCT-XPP A Self-Reconfigurable Data Processing Architecture. The Journal of Supercomputing 26(2), 167–184 (2003) 13. MathStar™, http://www.mathstar.com/products.html 14. Lanuzza, M., Perri, S., Margala, M., Corsonello, P.: Low-cost fully reconfigurable datapath for FPGA-based multimedia processor. In: Proc. of 15th Int. Conf. on Field Programmable Logic and Applications (FPL), Tampere, Finland, August 24-26, 2005, pp. 13–18 (2005) 15. Lanuzza, M., Margala, M., Corsonello, P.: Cost-effective low-power processor-inmemory-based reconfigurable datapath for multimedia applications. In: Proc. of Int.l Symp. on Low Power Electronics and Design (ISLPED), San Diego, California, USA, August 8-10, 2005, pp. 161–166 (2005) 16. Virtex-4 User Guide, http://www.xilinx.com 17. Ebeling, C., Fisher, C., Guanbin, X., Manyuan, S., Liu, H.: Implementing an OFDM receiver on the RaPiD reconfigurable architecture. IEEE Transactions on Computers 53(11), 1436–1448 (2004)