MULTIMEDIA applications such as video and image processing

1454 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 11, NOVEMBER 2008 Dynamic Memory Access Management for High-Perf...
Author: Marilyn Bridges
1 downloads 2 Views 2MB Size
1454

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 11, NOVEMBER 2008

Dynamic Memory Access Management for High-Performance DSP Applications Using High-Level Synthesis Bertrand Le Gal, Emmanuel Casseau, and Sylvain Huet

Abstract—Multimedia applications such as video and image processing are often characterized by a huge number of data accesses. In many digital signal processing applications, array access patterns are regular and periodic. In these cases, optimized architectures using pipelined memory access controllers can be generated. In this paper, we focus on implementing memory interfacing modules that can be automatically generated from a high-level synthesis tool and which can efficiently handle predictable address patterns as well as random ones (i.e., dynamic address computations). The benefits of balancing dynamic address computations from datapath to dedicated computation units in the memory controller is also analyzed as well as operator bitwidth optimization and data locality to save power consumption and reduce latency. Index Terms—Graph model, high-level synthesis, memory sequencer, multimedia applications.

I. INTRODUCTION ULTIMEDIA applications such as video and image processing are often characterized by a high number of data accesses. In these data transfer and storage intensive applications, memory access is often the limiting factor for the computation speed of digital signal processing (DSP) processors. Performance is closely linked to memory architecture (hierarchy, number of banks) together with the way data are placed and transferred [1], [2]. Memory design also has a significant impact on power consumption, which is a critical feature in embedded systems. Research based on hardware implementations of DSP algorithms under real-time constraints attempts to exploit operation parallelism by budgeting computations on the cadence/latency constraints. Optimized architectures are usually obtained for regular algorithms only with no random operations. This is also true for memory architectures when memory access sequences are predictable. Nevertheless, in most multimedia applications, the entire memory access sequence is not known a priori. In fact, current research in multimedia applications aims to reduce the computation complexity of algorithms using ad-hoc

M

Manuscript received August 7, 2006; revised April 24, 2007. Current version published October 22, 2008. B. Le Gal is with the IMS Laboratory UMR-CNRS 5218, ENSEIRB Computer and Electronic Engineering School, University of Bordeaux 1, Talence, 33405 Cedex, France (e-mail: [email protected]). E. Casseau is with the CAIRN Team, IRISA Lab., ENSSAT Engineering School, Universit de Rennes 1, BP 80518, 22305 Lannion Cedex, France (e-mail: [email protected]). S. Huet is with GIPSA-Lab., CNRS, INPG, Grenoble, 38031 Grenoble Cedex, France (e-mail: [email protected]). Digital Object Identifier 10.1109/TVLSI.2008.2000821

solutions composed of conditional computations (for example, transforming the full search algorithm for block matching to a three step search algorithm [3]). With such incomplete deterministic applications, all memory accesses are not statically known during the synthesis. This prevents efficient handling of the application’s repetitive memory access sequences. This paper makes the following contributions towards incorporating memory access management in a high-level synthesis flow. New address memory sequencer architectures which can perform pipeline memory accesses for static and dynamic access sequences. The target architecture of the suggested design flow is first outlined. An extended data-flow graph handling the required semantics and the transformation steps to optimize the dynamic address computation locality is presented. Modifications to common high-level synthesis design flows are proposed to take advantage of the sequencer. This allows computation dominated applications with not-fully predictable memory access sequences to be optimized by well-known data-flow optimizations. The approach targets the design of hardware parts of complex systems such as digital signal processing IP cores with specific performance constraint. This paper is organized as follows. Section II presents work related to memory access optimizations in usual design flows and their constraints in high-level synthesis design flows. Section III presents the design flow proposed. The global circuit architecture and the new memory sequencer architecture are presented in Section IV. The graph model is detailed in Section V. The modified high-level synthesis process generating a register transfer level architecture from this model under a real-time constraint is detailed in Section VI. Experimental results presented in Section VII show that it is possible to exploit specific application and scalar optimization techniques to generate efficient architectures for not-fully deterministic applications. Definition 1 (Deterministic Access Sequences): Deterministic access sequences are those access sequences where every data (for read and/or write operations) are known a priori before the execution of the application. These kind of access sequences are also called static memory accesses in the literature, i.e., memory accesses which do not depend on the execution context. Definition 2 (Indeterminate Access Sequences): Indeterminate access sequences are those access sequences where a part of the data is unknown before the execution of the application. The memory access sequences are thus composed of static accesses and dynamic accesses. Dynamic memory accesses are computed during the execution of the application. For example,

1063-8210/$25.00 © 2008 IEEE

GAL et al.: DYNAMIC MEMORY ACCESS MANAGEMENT FOR HIGH-PERFORMANCE DSP APPLICATIONS

1455

in the access sequence , if the targeted data array is known, accesses to this array are indeterminate for the second and the third ones because addresses and have to be computed during the execution of the application. II. RELATED WORK During the last decade, a lot of research has been done on the datapath and memory architecture implementations, showing that memory architecture design and datapath design are interdependent. Each design flow produces constraints on the other, reducing the optimization range that can be applied according to hardware implementation order. The following two synthesis flows are usually used. • Memory architecture is developed before the datapath architecture: memory cost (power consumption, area, etc.) is optimized but the computation parallelism that can be exploited during datapath synthesis is generally reduced, i.e., application latency is increased. • Datapath architecture is developed before the memory architecture: datapath is optimized for a particular performance but data access requirements restrict the organization and optimization of the memory. This usually leads to high data bandwidth and power-costly implementations. A. Area and Low-Power Memory Optimization To tackle the complexity of low-cost (area, power) memory design, researchers have worked around the traditional design flows in different ways. Several memory related issues have been addressed, such as: memory allocation [4], memory packing [5], [6], estimation [7], and selection [8]. The area and the power consumption of the memories are first minimized before datapath synthesis, providing constraints to the datapath synthesis. In the same way, Wuytack [9] presents a technique that limits data bandwidth between datapath and memory under timing constraints. Memory architecture is defined before the detailed scheduling, so the selected architecture has to provide sufficient memory bandwidth for the application to be scheduled within the timing budget. The subsequent memory allocation/assignment tasks [4], [10] have to generate a memory architecture that satisfies several parallel accesses without producing data access conflicts. B. Memory and High-Level Synthesis Flow In the context of a high-level synthesis (HLS) assisted design flow, scheduling techniques that include memory issues can be used. Some of these schedule the memory accesses [11], [12]. They include precise temporal models of memory accesses. However, they do not consider simultaneous access conflicts. In [13], memory accesses are represented as multi-cycle operations. Memory vertices are scheduled as operative vertices by considering conflicts among data accesses. This technique is used to handle off-chip memory accesses. Seo [14] performed a first scheduling on a data flow graph; the memory accesses are then rescheduled after the memory selection and allocation to reduce the overall memory cost. This optimization is restricted by the datapath scheduling.

Fig. 1. Datapath and memory units using a sequencer based architecture for static data transfer sequences.

Park [15] and Corre [16] minimize the number of simultaneous memory accesses by considering which data are being accessed simultaneously to optimize the memory access conflict graphs. C. Datapath and Memory Unit Interfacing 1) Control Intensive Architecture Approach: Most of the research on control-dominated applications is based on random access memories. Binary coded addresses are decoded using built-in decoders with row and column select signals. Such memory architectures are necessary for applications where the memory access sequences are not predictable. These architectures are used in general purpose processor designs. Memory optimizations often consist in exploring cache hierarchy to improve performance [17]. 2) Sequencer-Based Approaches: In many digital signal processing applications, the array access patterns are predictable, regular, and periodic [18]. The order in which the accesses should occur in memory can thus be determined statically. Address patterns can be efficiently generated directly from a memory address sequencer as presented in Fig. 1. The sequencer can be implemented using dedicated counters or shift registers depending on the constraints [19]. In [15], streamed data applications are considered. Pipeline accesses to RAM are improved by creating specialized hardware components to generate addresses and pack and unpack data items (see Fig. 2). Similar work addressing automatic memory sequencer generation for image processing applications is presented in [20] and [21]. Index equation extraction for applications described as nested loops are used to automatically generate a dedicated datapath for all the memory access computations. It easily handles video applications described only as a nested loop but does not handle other applications and does not benefit from well-known scalar optimization techniques. 3) Sequencer Optimizations for Low-Power Consumption: Using sequencer architecture decorrelates datapath unit and memory unit synthesis and optimization. As shown in Fig. 2, only data are transferred between the memory unit and the datapath unit. Address transfers are thus reduced as their sequences are static a priori. Custom address generators can be used to obtain optimal memory architectures from access patterns. These generators can be optimized by applying bus power consumption minimizing techniques. For example, the activity on memory address buses can be reduced by analyzing access patterns and organizing the arrays in memory by the following:

1456

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 11, NOVEMBER 2008

Fig. 2. Park’s sequencer architecture for pipelined memory accesses in streamed applications.

• looking at data transfers on buses in order to limit line commutation [22]; • accessing adjacent data in memory to limit the commutations on address buses [23], [24]. Various bus encoding schemes [25]–[29] have been proposed to decrease the number of transitions. 4) Sequencer Optimizations for Performance: As the memory access sequence is deterministic, some techniques for performance optimization can be applied. For example, in [19], the impact on area and performance of memory access related circuitry using simplified architectures for address generation is examined. This technique is applicable only on deterministic and regular address sequences. D. Datapath/Memory Design Flow Conclusion The common approaches that create a centralized memory scheduler moving data from a memory unit to the datapath are based on predictable access patterns. Our approach provides a new sequencer architecture handling hazardous memory accesses, i.e., dynamic address computations. In the next sections, we present an address decoder-decoupled memory model that separates the built-in address decoder from memory and incorporates an address generator. This address generator considers dynamic access patterns as well as static ones. A design methodology is proposed in which the memory architecture is optimized first during the datapath scheduling process and again during access pattern sequencer generation. These techniques enable dynamically calculated addresses to be accessed in a data-flow fashion. E. High-Level Synthesis The increasing DSP application complexity coupled to the time-to-market constraint encourages designers to increase the abstraction level using automatic design tools. High-level synthesis [30], [31] is analogous to software compilation transposed to the hardware domain. The source specification is written in a high-level language (MATLAB, C, SystemC, etc.) that models the algorithmic behavior of the application to be implemented. An automatic refinement process enables the described behavior to be mapped onto a specific technology target depending on the targeted constraints. Thanks to formal proven automation algorithms, HLS tools generate an RTL architecture which respects both the designer and the system constraints and which

Fig. 3. High-level synthesis design flow.

is reliable (errorless) compared to a hand-coded design. It especially claims to speed-up design time versus register transfer level (RTL) hand-coding. HLS is a constraint-based synthesis flow (see Fig. 3): hardware resources are selected from technology-specific libraries of components designed and characterized for a specified target. HLS can also be constrained to limit hardware complexity (i.e., the number of allocated resources) and/or reach a given performance. The HLS refinement process follows a top down approach: 1) source specification analysis (identifying computations); 2) hardware resource selection and allocation for each kind of operation; 3) operation scheduling; and 4) optimized architecture generation, including a datapath and a control finite-state machine. Thanks to its high-level of abstraction, a behavioral description for HLS can be customized through functional parameters. Each set of supported parameters and synthesis constraints enables the HLS tool to a generate different dedicated architecture to fulfill specific functional requirements and achieve a specific performance. Many commercial and academic high-level synthesis tools can be used: Catapult-C [32] (Mentor Graphics), GAUT [33], SPARK [34], PICO-NPA [35], etc. The synthesis tool employed in our experiments is GAUT. GAUT1 is dedicated to the synthesis of signal and image processing applications under realtime execution constraints. This tool performs synthesis under latency constraints, memory mapping and data communication consumption/production dates. Thus, it allows the designer to accurately stipulate system interaction and constraints with the algorithm to be synthesized. III. DESIGN FLOW In this paper, as a case study, we propose to integrate our methodology handling random memory accesses in a data flow fashion in a design flow using a common high-level synthesis tool to generate the datapath and the memory units. Our starting point is an algorithmic description that specifies the circuit functionality at the behavioral level disregarding any potential 1[Online].

Available: http://web.univ-ubs.fr/gaut

GAL et al.: DYNAMIC MEMORY ACCESS MANAGEMENT FOR HIGH-PERFORMANCE DSP APPLICATIONS

1457

banks. These computation transfers improve the characteristics of the post-synthesis circuit (area and power consumption). Finally, we use a high-level synthesis tool to generate the datapath and the memory sequencer. The HLS tool takes into account the necessary timing constraints for dynamic data addressing and data transfers using the graph annotations. Selection, allocation and scheduling are operated on the datapath and the sequencer at the same time without over constraining one of the units with respect to the other. During the scheduling and binding processes, a second optimization step can be applied to unallocated data in memory (data without mapping a priori). Section IV presents the new memory sequencer architectures targeted in this design flow. IV. TARGETED ARCHITECTURES A. Circuit Architecture

Fig. 4. Design flow for automatic datapath and memory sequencer synthesis.

Our methodology targets custom digital signal processors dedicated to computation intensive applications. The targeted architecture is composed of the following three distinct units as presented in Fig. 5: 1) processing unit (PU) contains the datapath (registers, operators, etc.) and its controller which performs the required computations; 2) memory unit (MemU) manages pipeline accesses to memories, using, if necessary, preventive read accesses; 3) communication unit (ComU) sends and receives data to and from the rest of the system. B. Dynamic Address Sequencer Architecture

Fig. 5. Targeted architecture.

implementation solutions and transformation methods. The designer may also provide memory mapping in order to specify where the data are mapped into the physical memory. The memory architecture can be defined in the first step of the overall design flow. To achieve this task, the designer can use advanced compilers such as Rice HPF compiler, Illinois Polaris, or Stanford. Otherwise, the designer may decide to let the HLS tool freely organize a part (or all) of the data in memory under performance constraints. Our methodology can independently handle the two approaches in a unique transformation flow. The first step of our design flow (see Fig. 4) is generating an extended data flow graph. This step aims to handle the timing requirements for data and address transfers from one unit to the others. This transformation can be guided using the memory mapping information given by the designer if the data placement has been done before synthesis. The transfer annotations give information on the location of data memorization (in memory or in a datapath register) and operations. The second step applies a dynamic address computation balancing algorithm to the previously annotated graph in order to move some dynamic address computations from the datapath to the memory sequencer unit. This optimization attempts to exploit the extended sequencer architecture proposed in this paper to reallocate some address computations near the memory

Our goal is to generate an efficient memory sequencer architecture supporting dynamic addressing capabilities in a mainly deterministic data transfer-based architecture. An architecture allowing dynamic addressing is first presented. Then an extended sequencer architecture able to provide internal address computation capabilities, avoiding repetitive address transfers between the datapath (PU) and the sequencer units (MemU), is proposed. Computing addresses in the sequencer unit enables latency to be decreased and bitwidth optimizations to be applied since addresses are generally coded with fewer bits than data in DSP applications. 1) New Memory Sequencer Architecture: With this architecture it is assumed that all the dynamic addresses are calculated in the datapath unit and then transferred using common data buses to the memory sequencer. The buses between these units only transmit data information, the meaning of these data (addresses or computation results) depends on the time slot when they are received or emitted. The sequencer proposed in Fig. 6 is composed of four different units: a memory access scheduler (plus the router), a dynamic address controller, an address generator, and an address translation table. The datapath access buses are connected to the memory using a multiplexed crossbar. The crossbar is controlled by the memory access scheduler, which knows the memory access sequence. The scheduler controls the address generator progress in a synchronous manner from the datapath point of view. A dynamic address access to the memory will go through the address translation table. This table translates the logical

1458

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 11, NOVEMBER 2008

Fig .9. Dynamic access sequencer with an integrated datapath for address computations.

Fig. 6. Memory unit architecture with the dynamic access sequencer.

Fig. 7. Address translation table usage.

Fig. 8. Translation table example for a multi-bank vector binding allowing parallel memory accesses.

address of the data which should be read or written to a dataset (memory bank, physical address) as shown in Fig. 7. This translation allows free pre/post-optimizations of the memory by letting the designer (or the design tool) bind noncontiguous pieces of a vector into different memories to better exploit data-access parallelism [16], [36]. A noncontiguous vector bind for a 1-D array is presented in Fig. 8. The dynamic access controller steers the correct command signals (read/write) and the physical address to the right memory bank according to the required dynamic access. During a dynamic access all the memory banks where vector elements are bound are locked for potential accesses. In fact, in order to avoid memory conflicts and random accesses during the execution of the application, these memory banks have been reserved (locked) during the scheduling step of the application synthesis.2 This sequencer architecture allows dynamic memory accesses in a static sequencer-based approach. Moreover, it allows the designer to freely bind the split vectors in different memory banks, in a noncontiguous fashion, thereby better exploiting the potential computation parallelism. The sequencer architecture is limited to one dynamic access per clock cycle, and routing controllers are centralized, but this architecture can be decomposed using a hierarchical sequencer2During the scheduling step, when a dynamic memory access is required, all the memory banks containing scalar data from this array are locked for this clock cycle.

based approach to limit decoder complexity. In this case, the architecture becomes one of the overall units which compose the overall circuit architecture. The decomposition is based on existing relations between the memory banks: unlinked memory banks, i.e., banks without array scalar dependencies, may be managed by smaller independent sequencer units. For example, in (see Fig. 6) a sequencer unit can be associated to a memory bank, so the global sequencer architecture is composed of small and independent sequencer units. This approach reduces the interconnection cost in the design by allowing better localization of the sequencer unit near the hardware operators using its data. This also helps to reduce latency and increase performance, however, it increases hardware requirements and the chip area. 2) Sequencer Architecture Improvement: The architecture previously presented allows dynamic address accesses using a sequencer-based architecture approach, thus making local optimizations between this unit and the memory banks possible. The need to transfer all the memory address sequences from the datapath to the memory sequencer is reduced. These transfers are further reduced by inserting dedicated computation hardware operators in the memory sequencer. This additional hardware is dedicated to performing dynamic address computations which were previously implemented in the datapath unit. This approach provides gains in high performance design: for example, for pipeline architectures, the data transfers between the datapath unit and the sequencer can take more than one clock cycle. In these applications, localizing some address computations in the sequencer will provide important latency gain avoiding unnecessary address transfers between units. The address traffic reduction between the datapath and the sequencer decreases memory bandwidth requirements and also has an impact on the line switching. In data-dominated applications, this can significantly reduce power consumption and system bus loading. This new sequencer architecture is presented in Fig. 9. Operator and register bitwidths are adapted to requirements and consequently the internal datapath is optimized for address computations with respect to the datapath unit constraints. Generally, the number of bits required to code addresses (address bitwidth) is smaller than that to code data. Another advantage of this approach comes from the locality (in memory) of the constants or variables generally used in address computations, thus avoiding their transfers to the datapath unit. The sequencer datapath is composed of operators and registers like a conventional datapath as shown in Fig. 10. The registers store the addresses, variables, and constants needed for dynamic address computations. The results are then transferred to the address translation table.

GAL et al.: DYNAMIC MEMORY ACCESS MANAGEMENT FOR HIGH-PERFORMANCE DSP APPLICATIONS

1459

Fig. 11. Read and write dynamic accesses.

Fig. 10. Dynamic address computation datapath.

These dedicated hardware resources are shared among all address computations during the execution of the application. V. GRAPH MODEL Because of the lack of such a graph model in the literature dealing with dynamic memory accesses, a formal graph model has been defined in order to take advantage of the new sequencer capabilities. This graph handles the behavioral semantics and the constraints to be considered for the synthesis process, i.e., computations, control, and data structures in a dataflow fashion. Fig. 12. Behavioral code and its EDFG.

A. Computations and Data Design synthesis is performed using an internal formal graph model based on a signal flow graph called extended data-flow graph (EDFG). This graph handles the data and address computations, the data transfers, and the condition statements (for computation, addressing, and transfers). This model allows data and conditional semantics to be managed in the same way to maximize the exploitation of parallelism. Well-known data-flow algorithms for the optimization and the synthesis steps can be used with potential management of the control structures to implement mutually exclusive scheduling methods [37], [38]. Definition 3 (Extended Data-Flow Graph): An EDFG is a , where is the finite, directed, weighted graph vertex set of computation and memorization nodes; is the edge set, representing precedence relations among the is a function with the computation time nodes; of node . A path in is a connected sequence of nodes and edges.

named is renamed after a writing operation. This transformation is necessary to remove ambiguous dependency accesses for static scalar store and load operations during synthesis. is modeled In Fig. 11, a dynamic read access to the data where the address is unknown a using the addressing node priori. In the same way, a dynamic write access to the data is presented using the addressing node where its new value is . This operation generates a new structure node called representing the structure modified by this write access. The next read or write operations to the array will be performed on the structure node . Different access nodes depending on the read or write operations enable different timing constraints to be exploited depending on the access nature. Pipeline access modes supplied by current RAM components may be exploited to optimize access sequences during the scheduling process of the datapath and memory synthesis.

B. Dynamic Memory Accesses Specialized nodes are used to model the static and dynamic array accesses for read and write operations. In order to handle memory access dependencies, structure nodes that model all the data contained in the concerned array are defined. A structure node for each data structure is thus defined at the beginning, disregarding its number of dimensions. For example, a structure to node called will model all the scalar data from if the array is one dimension. Otherwise, it models all the scalar data from to for a dimension structure. These specific structure nodes are used to take into account write after write (WaW) and read after write (RaW) dependencies for dynamic accesses. After a write access to the structure, a new structure node is createdthe original node is cloned and renamedto model the structure change for the next accesses. For example, on the right-hand side of Fig. 11, the structure node

C. Conditional Structures In order to take into account conditional memory accesses and transfers to reduce the bus commutations, conditional nodes are defined to handle conditioned operations and memorizations. The conditional node dependencies are modeled like data dependencies. This means that conditioned computations depend on the result of a computation of a condition. An example of such a graph is presented in Fig. 12. These nodes allow all the behavioral semantics of the application to be handled in a dataflow fashion with the view to optimize the computation parallelism. with Definition 4 (Condition Value): is a function with the condition value of the condition of node under which node has to be executed. Node is executed if and only if all its predecessor nodes have been completed and if the value of its condition is true.

1460

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 11, NOVEMBER 2008

D. Data Transfer Between Different Units Using this graph model, information about the data transfer timing constraints for data moving from one unit to the others can be inserted. An example showing the data transfer handling method is illustrated in Fig. 14. Definition 5 (Transfer Node): Transfer nodes named are operation nodes that represent data transfers inside the armodels the transfer chitecture. For these nodes, the function time from the predecessor data node to its successor. is a function with the source location of the data . In the same way transfer with is a function with the destination location of the data transfer . For all nodes is a function with , the location of the node (PU or MemU) in the final architecture. Condition data need to be transferred to the controllers. Their source and destination locations can be refined . FSM here means the unit controller. This locality modification is used to model the necessary state register transfer in conditional computations, i.e., modeling the pipeline stages for decision. VI. SYNTHESIS PROCESS The extended data-flow graph enables the application to be synthesized, taking into account the memory constraints such as transfer of data and dynamic access in read or write modes. The transformations to be performed in the synthesis process are now presented. Formal transformations of the graph model to optimize the locality of the dynamic address computations and reduce their impact on the final design are also addressed. A. Node Location Annotations The first step of the design flow for generating a dynamic address sequencer is to annotate the internal model. This step aims to handle the timing requirements for data and address transfers from one unit to the others (communication unit, memory unit, and datapath unit) by identifying the locality of each node in the architecture. This step can be guided using intrinsic information (an input to the system comes from the communication unit, etc.) or using the memory mapping information, given by the designer if the data placements were done before the synthesis. In this first annotation step, all operations, including dynamic address computations, are considered as implemented in the datapath unit (PU). Their location attributes are thus set to the PU value. Using the designer’s full or incomplete memory mapping information the data nodes are also annotated. This step is illustrated in Fig. 13 where it is considered that comes from the system (ComU), is data stored in memory (MemU), and the result of the computation is stored in a datapath register (PU) waiting for another computation. The addition computation is located in the PU datapath. B. Modeling Data Transfers Inter-unit transfer timing constraints are then added to the graph model using the transfer nodes defined previously. Transfer nodes must be inserted to transform the located graph into a coherent one.

Fig. 13. Node location annotations.

Fig. 14. Modeling data transfers between units.

Definition 6 (Coherent Sequence of Nodes): A sequence of nodes is said to be coherent if the data transfers among units be the set are explicitly modeled using transfer nodes. Let of transfer nodes. A sequence of nodes is coherent if , the location with a function with the list of the predecessor nodes of , and a function with the list of the successor nodes of . In the other cases, the sequence is incoherent. The graph is defined as coherent if all the sequences of are coherent. To transform an annotated graph into a coherent one, the location of each node of the graph is checked. If a node does not have the same location as all its predecessors and successors, then a transfer node is added. Fig. 14 illustrates this transformation. The graph on the left is incoherent because the two inputs of the operation node are located in different locations, MemU and ComU, and result C is in the PU. Transfer nodes have to be added and data nodes A and B are duplicated to transform this graph into a coherent one, as shown on the right-hand side of Fig. 14. C. Address Computation Balancing The dynamic address computation balancing algorithm is then applied to the annotated graph. Dynamic address computations are moved from the datapath unit to the memory sequencer unit if the potential performance of the design is increased. The decision metric used to select the address computations which have to be balanced takes into account the following different criteria: 1) number of data-transfers needed to calculate the address in the sequencer as opposed to in the datapath unit; 2) time increase/decrease of critical paths to optimize performance; 3) bitwidth of the datapath operators and registers compared to the required address computation bitwidth for area and switching optimizations; 4) usage rate and the potential parallelism that can be exploited for computations and data-transfers.

GAL et al.: DYNAMIC MEMORY ACCESS MANAGEMENT FOR HIGH-PERFORMANCE DSP APPLICATIONS

1461

Fig. 16. Modified high-level synthesis flow for a single synthesis step.

Fig. 15(d) shows the result of the balancing method where the dynamic address computation of index is transferred in the dedicated datapath of the memory sequencer. When analyzing the two graphs shown in Fig. 15(c) and (d), in the first one, 5 data transfers, 12 data memorizations, and a 7-clock cycle latency (considering 1-cycle operation nodes) occur. In the second one, there are only 3 data transfers, 10 data memorizations, and the latency is 6 clock cycles. This illustrative example shows that extended data-flow graph modeling and transformations can be very useful to implement applications with dynamic address computations. Fig. 15. Dynamic address balancing process. (a) Initial graph model. (b) Graph model annotated by the locality. (c) Graph model including the transfer constraints. (d) Graph model optimized by dynamic address balancing.

The weight associated to a particular criterion depends on the primary goal of the optimization (transfer, latency, unit load average, etc.). In practice, the decision is taken according to designer preferences (reducing the latency, area, or power consumption) trying to minimize the effect of other criteria. The dynamic address computation balancing transformation is applied in a static manner to the graph model. Each address computation is evaluated for a balancing decision. If the balancing decision optimizes the system, then the graph is transformed: the location of the dynamic address computation is changed. If the graph becomes incoherent, transfer nodes are added or removed. Fig. 15 presents a complete example of the annotation and optimizing steps. Fig. 15(a) shows the graph obtained first after with ). Depending the compilation step ( on the intrinsic nature of the data, the nodes can be annotated by their locality. For this example, it is assumed that and are respectively an input and an output of the application. Therefore, they are located in the ComU, the constant node (value equals 3) is implemented in the datapath (PU) and the data and the structure are stored in memory (MemU). After this intrinsic annotation, all the other graph nodes are annotated in the PU (a priori location). The result is shown in Fig. 15(b). The graph is now incoherent because successive nodes are not in the same locality. The graph is made coherent by including transfer nodes between the nodes from different locations. The result of this step is shown in Fig. 15(c). Finally,

D. HLS Synthesis Process The GAUT HLS tool was used to perform the synthesis based on the extended data-flow graph of the application to implement. In order to take into account the proposed methodology the following synthesis steps were modified as shown in Fig. 16. 1) Operator Selection and Allocation: The first step is to select and allocate the hardware resources. The HLS tool counts the number of each type of operation in the graph, and then computes the average number of resources required to execute the application under the latency constraint. This algorithm has been modified to count independently the operations located in the datapath and those located in the sequencer unit (dynamic address computations). The datapath allocation is unchanged. For the memory sequencer allocation, the average number of resources is not used: the maximum possible number of parallel computations for allocating the operators is preferred. This enables the parallelism of the memory accesses and indirectly the parallelism of the computations in the datapath to be unlimited. 2) Scheduling and Binding: The scheduling algorithm used is a static-list scheduling method chosen for its low complexity thus providing results in a short runtime. The extended data-flow graph using data-flow only semantics enables a static list-scheduling algorithm to be used to take advantage of the overall parallelism of the application, i.e., without basic block boundaries as in control data flow implementation methods. Memory unit operations are scheduled at the same time as processing unit operations. Hardware operators have to be differentiated to implement nodes corresponding only to their location.

1462

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 11, NOVEMBER 2008

TABLE I GRAPH TRANSFORMATIONS AND OPTIMIZATIONS FOR THE THREE KINDS OF TARGETED ARCHITECTURES

This design flow differs from those proposed in [14]–[16], [30], and [31] because both memory sequencer and processing unit syntheses are considered in only one scheduling and binding step to manage the data and address transfer constraints more easily. E. Limitations of the Approach As far as the synthesis of DSP applications is concerned, specific performance (latency or throughput) is usually targeted. The goal is to generate a dedicated architecture which respects the constraints and at the same time optimizes area and power as far as possible. In this context, data-flow graph-based approaches which implement loop unrolling and scalar oriented techniques are necessary. However, when complex applications or systems are considered, large graphs are generated. In such cases, control data flow graph-based models are to be preferred at the system level where overall optimizations are considered. The extended data-flow graph based approach is therefore limited to the design of the hardware parts of complex systems such as IP cores (filters IPs, video processing blocks, etc.). VII. EXPERIMENTS Current research in multimedia applications is trying to reduce the algorithmic computation complexity using ad-hoc solutions composed of conditional computations. These algorithmic improvements may lead to random memory access sequences appearing. This kind of technique is used in block matching techniques where the computation complexity of the exact full search algorithm is optimized in computation complexity with the three step search algorithm [3] involving conditional motion vector selections. Our approach was applied on different block matching algorithms where un-deterministic access sequences occur: the three step search algorithm, the cross search algorithm [39], and the orthogonal search algorithm [40]. No level of transparency is used, i.e., encoding systems such as MPEG-1 are considered. The following three architectures were developed to test the benefits of the methodology on these algorithms.

1) The first architecture is based on a control-like approach without using a memory sequencer (conventional processor). Address transfers for the reference macroblock plus window accesses are considered. 2) The second architecture is based on the first sequencer architecture, i.e., without address computations in the sequencer. Application knowledge is used to avoid the first block transfers (before the first motion vector choice) which are known a priori as the reference block. 3) The third architecture is based on the second sequencer architecture. Dynamic address computations are directly implemented in the memory sequencer datapath. In this case, only the macroblock base addresses depending on the selected vector are thus transferred between the processing unit and the memory unit. Table I shows the graph characteristics for the three applications. The experimental results show that the computation complexity of memory sequencer-based approaches is 30% smaller than the traditional one with such a memory access intensive application. Because all the memory address sequences are not transferred between the processing unit and the memory unit using a memory sequencer based architecture, data transfers between units are reduced from 70% up to 80% for the first sequencer architecture. Using the extended architecture, data transfers are reduced up to 90%. In that case, address computations are performed in the memory unit which again reduces address traffic. Moreover, using the memory sequencer and particularly the extended memory sequencer makes it possible to decrease the required number of memorizations during the execution of the algorithms from 30% up to 46%. The extended sequencer removes the duplicate memorization of the same data (addresses computed in the datapath and then transferred into the memory sequencer) during the dynamic address computation balancing step. The two architectures based on memory sequencers were synthesized using the GAUT HLS tool for the three step search algorithm. The timing constraint used was extracted from the HDTV format (768 576 pixels with 25 pictures/s). We used a Xilinx Virtex-II Pro XC2VP100 FPGA as the target technology.

GAL et al.: DYNAMIC MEMORY ACCESS MANAGEMENT FOR HIGH-PERFORMANCE DSP APPLICATIONS

1463

TABLE II HARDWARE ALLOCATION AFTER HIGH-LEVEL SYNTHESIS

For this experiment, all the vector data were bound in the same memory bank owing to the current development of the GAUT tool. This constraint limits the possibility of exploiting the parallelism during synthesis. The macroblocks considered are 8 8 pixels coded using 12 bits. The conventional control-based architecture was not synthesized because the HLS tool cannot handle this kind of architecture. Table II presents the allocated hardware resources. The number of arithmetic operators and registers is almost the same but the localization differs. From an area point of view, the major interest of the extended memory sequencer based architecture comes from the number of bits required for these hardware resources. The datapath of the memory sequencer is optimized for dynamic address computations. When pixels are coded using 12 bits, memory datapath bitwidth is 9 bits for 8 8 macroblocks with 24 24 window blocks, whereas 19 bit resources are used when address computations are performed in the processing unit. When 16 16 macroblocks are considered, 12 bit resources are used for the memory datapath compared to 22 bit resources for the processing unit. The logical synthesis of the two architectures including the processing unit and the memory unit without the memory bank was also performed with ISE 7.1i from Xilinx. 8 8 macroblocks with 24 24 window blocks were considered. The extended sequencer based architecture is 7% smaller than the first sequencer based one. Dynamic power consumption was estimated with XPower 7.1. For different test vectors, dynamic power consumption is reduced from 13% up to 19% using the extended sequencer based architecture. This gain is mainly due to the decrease of transfers and memorizations required by this architecture. VIII. CONCLUSION In this paper, a synthesis design flow based on a new sequencer architecture for DSP applications with dynamic memory accesses is presented. An extended data-flow graph model enables data transfers, memory access timing constraints, and execution conditions to be taken into account in a data-flow fashion to optimize parallelism exploitation. The proposed methodology allows the designer to make a preor post-mapping of the data in memory. Results show that memory sequencer-based architectures lead to a reduction in address transfers between the memory and the datapath units, i.e., power saving and reduced latency. This work shows that the traditional approach which consists of using a memory sequencer for deterministic memory accesses may also be very efficient with data-dominated applications containing few control operations. However, the proposed design flow and sequencer-based architectures are not suitable for control intensive applications such as dictionary searches, etc., where most of the computations are address computations.

For more complex designs with pipeline data transfers where address transfers may take more that one clock cycle, this kind of architecture is still very interesting. Predictive and pipeline memory accesses which are not taken into account in this paper can also be addressed. REFERENCES [1] F. Catthoor, W. Geurts, and H. De Man, “Loop transformation methodology for fixed-rate video, image and telecom processing applications,” in Proc. Int. Conf. Appl. Specific Array Process., San Francisco, CA, Aug. 1994, pp. 427–438. [2] T. Meng, B. Gordon, E. Tsern, and A. Hung, “Portable video-on-demand in wireless communication,” Proc. IEEE, vol. 83, no. 4, pp. 659–680, Apr. 1995. [3] R. Li, B. Zeng, and M. Liou, “A new three-step search algorithm for block motion estimation,” IEEE Trans. Circuits Syst. for Video Technol., vol. 4, no. 4, pp. 438–442, Aug. 1994. [4] F. Balasa, F. Catthoor, and H. De Man, “Dataflow-driven memory allocation for multi-dimensional signal processing systems,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. (ICCAD), Nov. 1994, pp. 31–35. [5] D. Karchmer and J. Rose, “Definition and solution of the memory packing problem for field-programmable systems,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. (ICCAD), 1994, pp. 20–26. [6] P. Jha and N. Dutt, “Library mapping for memories,” in Proc. Eur. Conf. Des. Test (EDTC), 1997, p. 288. [7] F. Balasa, F. Catthoor, and H. De Man, “Background memory area estimation for multidimensional signal processing systems,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 3, no. 2, pp. 157–172, Jun. 1995. [8] S. Bakshi and D. Gajski, “A memory selection algorithm for high-performance pipelines,” in Proc. IEEE Int. Eur. Des. Autom. Conf. (EuroDAC), 1995, pp. 124–129. [9] S. Wuytack, F. Catthoor, G. de Jong, B. Lin, and H. de Man, “Flow graph balancing for minimizing the required memory bandwidth,” in Proc. 13th Int. Symp. Syst. Synthesis (ISSS), 1996, p. 127. [10] S. Ramprasad, N. R. Shanbhag, and I. Hajj, “A coding framework for low-power address and data busses,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 2, pp. 212–221, Feb. 1999. [11] N. Passes, E. Sha, and L. Chao, “Multi-dimensional interleaving for time-and-memory design optimization,” in Proc. Int. Conf. Comput. Des. (CDES), 1995, pp. 440–445. [12] A. Nicolau and S. Novack, “Trailblazing: A hierarchical approach to percolation scheduling,” in Proc. Int. Conf. Parallel Processing (ICPP), Boca Raton, FL, 1993, vol. II, pp. 120I–124. [13] P. Ellervee, “High-level synthesis of control and memory intensive applications,” Ph.D. dissertation, Dept. Electron., Royal Inst. Technol., Stockholm, Sweden, 2000. [14] J. Seo, T. Kim, and P. Panda, “Memory allocation and mapping in highlevel synthesis: An integrated approach,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 5, pp. 928–938, May 2003. [15] J. Park and P. Diniz, “Synthesis of pipelined memory access controllers for streamed data applications on FPGA-based computing engines,” in Proc. Int. Symp. Syst. Synthesis (ISSS), New York, NY, 2001, pp. 221–226. [16] G. Corre, E. Senn, N. Julien, and E. Martin, “A memory aware behavioral synthesis tool for real-time vlsi circuits,” in Proc. 14th ACM Great Lakes Symp. VLSI (GLSVLSI), New York, NY, 2004, pp. 82–85. [17] X. Huang, Z. Wang, and K. Mc Kinley, “Compiling for the impulse memory controller,” in Proc. Int. Conf. Parallel Architectures Compilation Techniques (PACT), 2001, pp. 141–150. [18] D. Grant, P. Denyer, and I. Finlay, “Synthesis of address generators,” in Proc. IEEE Int. Conf. Comput.-Aided Des. (ICCAD), Nov. 1989, pp. 116–119.

1464

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 11, NOVEMBER 2008

[19] S. Hettiaratchi, P. Cheung, and T. Clarke, “Performance-area trade-off of address generators for address decoder-decoupled memory,” in Proc. Conf. Des., Autom. Test Eur. (DATE), 2002, p. 902. [20] H. Norell and M. O’Nils, “A generalized architecture for hardware synthesis of spatio-temporal memory models for image processing systems,” in Proc. 12th Int. Workshop Syst., Signals Image Process. (IWSSIP), Chalkida, Greece, 2005, pp. 361–365. [21] N. Lawal, B. Thornberg, and M. O’Nils, “Address generation for FPGA RAMs for efficient implementation of real-time video processing systems,” in Proc. Int. Conf. Field Program. Logic Appl. (FPL), Aug. 2005, pp. 136–141. [22] S.-L. Chu and T.-C. Huang, “SAGE: An automatic analyzing system for a new high-performance SoC architecture-processor-in-memory,” J. Syst. Arch., vol. 50, no. 1, pp. 1–15, 2004. [23] A. Dasgupta and R. Karri, “Simultaneous scheduling and binding for power minimization during microarchitecture synthesis,” in Proc. Int. Symp. Low Power Electron. Des. (ISLPE), Dana Point, CA, 1995, pp. 69–74. [24] A. Dasgupta and R. Karri, “High-reliability, low-energy microarchitecture synthesis,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 17, no. 12, pp. 1273–1280, Dec. 1998. [25] M. Stan and W. Burleson, “Bus-invert coding for low-power I/O,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 3, no. 1, pp. 49–58, Feb. 1995. [26] H. Mehta, R. Owens, and M. Irwin, “Some issues in gray code addressing,” in Proc. 6th Great Lakes Symp. VLSI (GLSVLSI), 1996, p. 178. [27] L. Benini and al, “Architectures and synthesis algorithms for power-efficient bus interfaces,” IEEE Trans. Comput.-Aided Des. Circuits Syst., vol. 19, no. 9, pp. 969–980, Sep. 2000. [28] W.-C. Cheng and M. Pedram, “Power-optimal encoding for a DRAM address bus,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 2, pp. 109–118, Feb. 2002. [29] T. Lv, J. Henkel, H. Lekatsas, and W. Wolf, “A dictionary-based en/decoding scheme for low-power data buses,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 5, pp. 943–951, May 2003. [30] D. Gajski and al, High-Level Synthesis: Introduction to Chip and System Design. Norwell, MA: Kluwer, 1992. [31] J. P. Elliott, Understanding High-Level Synthesis. A Practical Guide to High-Level Design. Norwell, MA: Kluwer, 2000. [32] Mentor Graphics Corp., Catapult C synthesis user’s and reference manual (Release 2004b), Wilsonville, OR, Jun. 2004. [33] E. Martin, O. Santieys, and J. Philippe, “GAUT, an architecture synthesis tool for dedicated signal processors,” in Proc. IEEE Int. Eur. Des. Autom. Conf. (Euro-DAC), Hamburg, Germany, Sep. 1993, pp. 14–19. [34] S. Gupta, N. Dutt, R. Gupta, and A. Nicolau, “SPARK: A high-level synthesis framework for applying parallelizing compiler transformations,” in Proc. Int. Conf. VLSI Des., New Delhi, India, Jan. 2003, pp. 461–466. [35] R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. Rau, D. Cronquist, and M. Sivaraman, “PICO-NPA: High-level synthesis of nonprogrammable hardware accelerators,” J. VLSI Signal Process., vol. 31, no. 2, pp. 127–142, 2002. [36] G. Corre, E. Senn, N. Julien, and E. Martin, “High level ageing vectors management for data intensive applications,” in Proc. Int. Conf. Signals Electron. Syst. (ICSES), Poznan, Poland, 2004. [37] K. Wakabayashi and H. Tanaka, “Global scheduling independent of control dependencies based on condition vectors,” in Proc. 29th ACM/ IEEE Conf. Des. Autom. (DAC), Los Alamitos, CA, 1992, pp. 112–115.

[38] S. Gupta, R. Gupta, N. Dutt, and A. Nicolau, “Dynamically increasing the scope of code motions during the high-level synthesis of digital circuits,” in Proc. IEEE Conf. Comput. Digit. Technique, Sep. 2003, vol. 150, no. 5, pp. 330–337. [39] H. Gharavi, “The cross search algorithm for motion estimation,” IEEE Trans. Commun., vol. 38, no. 7, pp. 950–953, Jul. 1990. [40] A. Puri, H.-M. Hang, and D. Schilling, “An efficient block-matching algorithm for motion compensated coding,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr. 1987, pp. 25.4.1–25.4.4.

Bertrand Le Gal was born in 1979, in Lorient France. He received the DEA (M.S. degree) in electronics and the Ph.D. degree in information and engineering sciences and technologies from the Universit de Bretagne Sud, Lorient, France, in 2002 and 2005, respectively. He is currently an Associate Professor with the IMS Laboratory, ENSEIRB Engineering School, Talence, France. His research interests include system design, high-level synthesis, SoCs design methodologies, and security issues in embedded devices such as virtual component protection (IPP).

Emmanuel Casseau received the DEA (M.S. degree) in electronics and the Ph.D. degree in electronic engineering from UBO University, Brest, France, in 1990 and 1994, respectively. He is currently a Professor with CAIRN-IRISA Laboratory, ENSSAT Engineering School, Lannion, France. From 1994 to 1996, he was a Research Engineer with the French National Telecom School, ENST Bretagne, France, where he developed high-speed Viterbi decoder architectures for turbo-code VLSI implementations. From 1996 to 2006, he was an Associate Professor with the Electronic Department, University de Bretagne Sud, Lorient, France, where he led the IP project of the Lester Laboratory. His research interests include system design, high-level synthesis, virtual components, and SoCs design methodologies.

Sylvain Huet was born in 1978, in Chatenay-Malabry, France. He received the engineering degree in electronics from Polytech Nantes, Nantes, France, in 2002, and the Ph.D. degree in information and engineering sciences and technologies from the Universit de Bretagne Sud, Lorient, France, in 2006. He is currently an Associate Professor with the GIPSA Laboratory, CNRS, INPG, Grenoble, France. His research interests include algorithm architecture matching for digital signal processing.