Optimizing the FPGA implementation of HRT systems

Optimizing the FPGA implementation of HRT systems Marco Di Natale, Enrico Bini ReTiS Lab, Scuola Superiore Sant’Anna, Pisa, Italy [email protected], e.bi...
Author: Raymond Burns
0 downloads 0 Views 505KB Size
Optimizing the FPGA implementation of HRT systems Marco Di Natale, Enrico Bini ReTiS Lab, Scuola Superiore Sant’Anna, Pisa, Italy [email protected], [email protected] Abstract The availability of programmable hardware devices with high density of logic elements and the possibility of implementing CPUs (called softcores) using a fraction of the FPGA area offers additional flexibility for the implementation of embedded applications with real-time constraints. When implementing functions on such devices, designers can choose between hardware and software. Also, the designer can select the number of CPUs that must be created to best support the execution of the real-time software. In this paper, we define a design optimization procedure for hard real-time systems, in which each functional block can be implemented in HW, using the logic elements available on the FPGA, or in SW, by means of a real-time task executed by a softcore. The optimizer allocates the functions and the softcores such that the HW implemented part is mapped within the area constraints and the software part is allocated so that schedulability can be guaranteed. When feasible solutions exist, the minimum utilization solution is computed.

1. Introduction The increasing cost of custom hardware designs and the increasing density of logic into modern hardware is leading towards the adoption of programmable HW devices, such as Field Programmable Gate Arrays (FPGAs), where a lattice of reconfigurable HW elements can be programmed to implement a desired set of functions. The FPGA market is currently dominated by a few major vendors, including Xilinx and Altera. The devices of these two vendors differ in a number of features, but a few common traits exist. The atomic programmable unit, called logic element (LE), consists of an internal look-up table (LUT) implementing combinatorial logic, one or more registers and additional internal circuitry. An interconnection mesh is used to provide communication and synchronization among the elements of the FPGA. The number of LEs that are available for medium to high end FPGAs is typically in the range of tens of thousands. Recently, vendors allow the implementation of CPUs (called softcores) using the LEs on board the FPGA. For example, the current Altera’s softcore (called

NIOS II) is available in three versions, ranging from approximately 700 to 1800 required LEs, depending on the desired CPU features, including caching, pipeline stages, and custom instructions. The NIOS II processor is a true 32-bit CPU with a customizable instruction set, capable of executing instructions at up to 400MHz. The design of applications for these devices is supported by tools that allow for both HW and SW programming. HW programming is performed using front-end graphical tools and Register Transfer Languages (RTLs) or VHDL as backends. SW is developed using programming languages, such as C, and tools, including compilers, linkers, and locators. Recently, a few companies and research centers are starting to challenge the use of HDL-centric flows, and more innovative design flows that provide an automatic errorfree hardware implementation of higher-level models are needed. Early tools for mapping a Simulink model into an FPGA implementation are starting to appear from commercial vendors and from research laboratories. For example, Celoxica, Accelchip and Altera provide fast design flows for MATLAB/Simulink users to integrate DSP algorithms and implement designs in FPGA hardware [9]. When the implementation is in SW, commercial tools are available for the automatic coding of models into singletask or multitask implementations. For example, in the multitask code generation option that is typical of the RealTime Workshop (RTW) code generator by Mathworks, the run-time execution of the model is performed by running code in the context of a set of threads under the control of a priority-based Real-Time Operating System (RTOS). The RTW code generator assigns each block to a task based on its sample rate. The blocks with the fastest sample rates are executed by the task with the highest priority, the next slowest blocks are executed by a task with the next lower priority, and so on, matching the Rate Monotonic priority assignment. The framework for the research work presented in this paper is the following. For each block in the network of functions (for example Simulink blocks) representing the system model, a hardware implementation, or a software implementation is possible. In case of a software implementation, the corresponding section of code is executed

in the context of a thread, scheduled on one of the softcores implemented inside the FPGA. In case of a hardware implementation, a rectangular area must be reserved onto the FPGA and the computations will then be performed in a much shorter time, representing the communication and synchronization overheads. The design is feasible if the FPGA area is sufficient for mapping all the hardware blocks and the softcores, and if the software blocks can be run in real-time threads scheduled for execution before their deadlines. Whenever multiple feasible solutions exist, we explore the design space seeking the solution that provides maximum future extensibility on the software side, formally defined as the solution with the smallest average utilization on the softcores.

1.1. Related work The static placement problem or floorplanning has attracted a significant number of research works, presenting solutions for the positioning of a set of circuit modules on a VLSI chip or on an FPGA. Floorplanning is in essence a 2D bin packing problem, which typically also includes preplacement constraints, range constraints, alignment and/or boundary constraints. The problem goal, in most cases, is finding the placement of the modules in a rectangle with minimum area. Solutions can be classified in the two general categories of slicing or nonslicing. A slicing floorplan is an area decomposition using horizontal and vertical cuts. Research efforts have been dedicated to the selection of the best data structures to represent the two classes of problems. For example, a binary tree representation [22] can be used for slicing floorplans, while for the more general category of nonslicing floorplans, several representations for structure and constraints have been proposed, such as sequence pairs [20], bounded slicing grids (BSG) [21], O-trees [14], B*-trees [10] and Q-sequences [25]. In general, one possible advantage of slicing floorplans is that the search space is much smaller, which in turn leads to a faster runtime. Furthermore, it has been shown [13] that a tight packing is achievable for slicing floorplans. For nonslicing floorplans, different optimization methods have been proposed for constructing a layout that satisfies the alignment and performance constraints while minimizing the required area. Most of the non-slicing solutions make use of simulated annealing to search the solution space for the optimal configuration, but lagrangian-based solution have also been proposed [31]. The objective of our work is different from traditional minimum area floorplanning. The formulation of the placement problem leverages a technique called level packing, which only allows slicing in one direction, resulting in a less sophisticated model. In this paper, however, we explore the tradeoffs between HW and SW implementation, we consider the case in which application functions have explicit

time constraints (deadlines) that must be guaranteed at design time and, finally, we consider the case in which the software part is executed by a variable number of softcores to be placed into the FPGA itself. In summary, our problem formulation addresses the following issues: • the algorithm must find the optimal number of softcores capable of executing the software-implemented functions within the deadlines, trading off the area used by the HW-implemented functions for softcore area; • the modules to be placed into HW are obtained as the result of the optimization procedure, and they depend on the number of softcores. A related research field is the scheduling of tasks in partially reconfigurable devices [8] where the runtime system chooses between hardware and software implementations of the functions. Dynamically reconfigurable systems of this type are typically requested to exploit the FPGA area at its best at any time instant. Conceptually, in a dynamic planner of this type, the time dimension is added to the two physical dimensions of the FPGA, and the result is a 3D packing problem, where the tasks to be placed on the FPGA can be constrained by deadlines [27]. In [27] and in later works [28, 29], tasks that cannot be placed on the FPGA for execution before the deadline are simply rejected. The problem of finding a template placement algorithm that is suitable for dynamically reconfigurable systems is discussed in [5] where extensions of the basic heuristics First-Fit and Next-Fit are discussed for a two-dimensional bin packing problem and an algorithm based on the partitioning of rectangular subsets of the FPGA area is presented. The algorithm has been refined in [27] where it is used by a reconfigurable OS that dynamically allocates hard real-time tasks on an FPGA. The allocation cannot be preempted (preemption of hardware tasks is discussed in [26]) and the system does not allow for the software implementation and the execution of tasks on the external CPU. Walder and Platzner [28] present two scheduling algorithms for hardware tasks based on an adaptation of the EDF policy. In [28] the FPGA is pre-partitioned in slots, tasks are ordered by deadline and are enqueued waiting for the availability of an FPGA block of sufficient size. The FPGA area is allocated according to the deadline of the waiting tasks. A feasibility test for guaranteeing task deadlines is provided for both policies. However, contrary to previous work, the authors do not assume a 2D area model, but perform the allocation of the LEs using a slotted single-dimension constraint. Other authors discussed operating systems services for the dynamic reconfiguration of the FPGA, including device partitioning, allocation, placement and routing [30]. Finally, another research work related to ours is described in [23] where the availability of an external CPU

1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 00000 11111 0000 1111 0000 1111 00000 11111 1 11111 0000 1111 00000 2 0000 111111111 00000 0000 1111 0000 1111 3 0000 1111 0000 1111

1111 0000 0000 1111 0000 1111 4 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111

111111 000000 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 1 2 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 111111 000000



111 000 000 111 000 111 000 111 000 111 000 111 000 111 4 000 111 000 111 000 111 000 111 000 111 000 111




m4k Blocks

Embedded multipliers


Logic Array



Logic Array

Logic Array

Logic Array






Figure 1. Unidimensional models. allows for a SW implementation of those tasks that cannot be placed on the FPGA. The authors present a new dynamic scheduling methodology for FPGAs where tasks are mapped into the reconfigurable HW with the objective of minimizing at any time the load of the processor that must execute the remaining software-implemented tasks. A common characteristic of most of these algorithms is the need of simplifying the allocation problem so that the best possible scheduling can be recomputed dynamically against the area constraints. This assumption results in simple approximations for the fitting problem based on one-dimensional constraint bounds. It is worth noting that, while dynamic allocation and scheduling of tasks into the FPGA is a quite attractive solution, available commercial devices from Xilinx and Altera do not allow dynamic reconfiguration of a subset of the FPGA area (the flexible 2D model suited the now retired Xilinx XC6200 series, reconfigurable at the level of the LEs), but only dynamic reconfiguration of the entire device or (see for example, the Xilinx Virtex devices) partial reconfiguration by column, matching the one-dimension stripe-allocation model. With respect to dynamic scheduling techniques, in previous cited research works, only the allocation of the FPGA area is considered, with the exception of [23], where the availability of an external CPU for executing SW implementations of RT tasks is considered. To the authors’ knowledge, there is no previous work in the context of dynamic FPGA area allocation to real-time tasks that leverages the availability of softcores inside the FPGA and the possibility of executing software tasks on multiple softcores.

2. The HW and SW model The model of the HW requires knowledge of the most common FPGA architectures. For this reason, we review some models that have been adopted in the literature and we compare them with real device designs. The slotted model assumes that FPGA resources are available in slots. All slots are identical and the HW requirements are expressed in terms of them. For example, in Figure 1(a) the functionality 1 requires 5 slots. In the linear model, the FPGA is considered as a uniform resource, which is allocated to the functions by rectangular shares spanning the entire device height, as shown in Figure 1(b). In both cases, the FPGA is assumed to be a homogeneous

m4k Blocks


m512k Blocks


The Cyclone FPGA



m512 Blocks



The Stratix FPGA

Figure 2. The Altera FPGAs layout. array of logic elements and the allocation constraint is expressed by the simple linear bound  (1) ai ≤ A. In the slotted model, ai denotes the number of slots that are required by the ith function/task, and A is the total number of slots. In the linear model, ai is the width of the allocated stripe and A is the total width of the FPGA. These two models present the benefit of a very simple condition representing the allocation constraint. However, when considering actual FPGA structures, it is clear how these two models are only tentative approximations. In fact, they result in a loss of area due to internal fragmentation and they are unrealistic because of routing constraints, especially in the case of the slotted allocation and because of the internal structure of these devices. Such degree of simplification is desirable in the dynamic scheduling of functions, where at most a few milliseconds can be spent in the on-line allocation procedure. The cost of the simplistic one-dimensional model is often too high for off-line design synthesis, which is the goal of our research. In reality, all FPGAs, from the simplest to the most complex, are not homogeneous, but they are composed of blocks of memories, logic elements, I/O ports, floating point ALUs, and other dedicated circuitry. Figure 2 shows two Altera FPGAs: on the left we report the simple structure of the Cyclone model and on the right we show the more complex layout of the larger Stratix II models. In the Cyclone devices (left side of the Figure 2), LEs exist in between columns of floating point multipliers, columns of 512 bits memory blocks and columns of 4KB memory blocks. Furthermore, four bigger internal memory areas (M-memory blocks of 512KB each) characterize the layout of the Stratix II FPGA. Except for the M-blocks, the structure is not homogeneous, but shows some regularity. Resources (memory, floating point multipliers) are placed in stripes. FPGA manufacturers provide place and route tools for mapping the functions into the FPGA and connecting them. In the case of unconstrained mapping (see Figure 3) the place and route tool (Quartus II for Altera devices) maps functions into the FPGA, allocating for each of them an area

FPGA resources Logic Elements (LEs) Memory (512 KB banks) Memory (small blocks) Multipliers Input/Output Elements (IOEs)

Mapped functionalities NIOS II (softcore 1) NIOS II (softcore 2)

Figure 3. Allocation of two NIOS II softcores. that is constrained by the route map, and varies radically according to the required connections with other functional elements. Figure 3 clearly shows how two instances of the same logic element (a NIOS II softcore) result in very different footprints on the FPGA, even though the number of allocated resources is roughly the same. However, FPGA programming tools allow for another possible placement option. This option is recommended in the case of modular designs, when design modules are possibly developed by different teams and eventually composed or reused among different projects. This placement option is called LogicLock mode. In this case, a functionality can be wrapped in a LogicLock component where the FPGA implementation is constrained by a bounding box with a fixed height and width (see Figure 4). Wrapping all functions into rectangles allows for modularity and efficient packing at the price of some small internal fragmentation. In this paper, we assume that the HW model of a functionality is a rectangle, encapsulating the resources required for its implementation inside the FPGA. Thanks to this assumption, the allocation of functions onto the FPGA can be approximated by a bi-dimensional bin packing problem and we can adopt placement techniques from operations research. In reality, the placement should consider the striped distribution of resources on the FPGA. Although the current solution does not manage these constraints, we believe that the proposed formalization, based on level packing, can be extended to deal with placement constraints derived from the use of resources placed in stripes. Hence we model the FPGA by a homogeneous rectangle of resources of width W and height H. These two values, together with all the other physical dimensions, are expressed in terms of logic elements. We assume the availability of an external quantity M of memory words for storing the code and the data of all the functions implemented in SW. The FPGA implements a variable number m of softcores. These softcores will run the SW tasks. Each softcore is implemented by a rectangle whose height and width are hCPU and wCPU , respectively. The application is a collection of n functional blocks

A = {F1 , . . . , Fn }. Each functional block Fi represents a basic operation of the system. A block Fi has one or more input ports and one output port. Input ports carry a set of signals with a uniform sampling period ti . The signals are processed by the block and the result of the computation is a signal with the same rate, produced on the output port. The sampling period of the input signal ti is also the activation period of the block and each execution instance must complete before the arrival of the next one. Fi is modeled by the tuple (ui , mi , hi , wi , vi ). When a functionality is implemented in software it is executed in the context of a real-time (RT) task executing at the same period ti , with a worst-case computation time (WCET) ci . From the point of view of the CPU requirements, all blocks, and, correspondingly, all RT tasks are modeled by the ratio between their WCET and their period. This attribute, called utilization ui = ci /ti , represents the fraction of the processor that is required for its execution. Several multiprocessor schedulability tests [16, 1, 4] are based on tasks utilizations. The words of external memory that are required by the SW implementation are denoted by mi . When implemented in HW, hi and wi are respectively the height and the width of the rectangle representing the FPGA area that is required and vi is the fraction of processor time that is required for the execution on the softcores of the communication and synchronization routines setting up the input data and collecting the results from the HW-implemented functions. Our solution is a first attempt at solving a very complex design optimization problem. We start from an idealized view of the actual placement problem, where routing and resource constraints are not included. In practice, this means that there is no guarantee of optimality, and that the solution provided by the optimization procedure needs to be processed by place-and-route tools to confirm its feasibility.

3. The Allocation Problem Given the formalization of the domain entities introduced in the last section, the codesign problem of deciding upon the implementation of the functions either in HW or in SW can be solved in the framework of a 2D packing problem. The goal of the placing algorithm is to minimize the average softcore utilization of the tasks mapped in SW. By minimizing the average utilization of the software tasks we expect to achieve an efficient utilization of the FPGA area and to facilitate the scheduling of SW functionalities onto the softcores. Once the SW tasks and the number of available softcores has been determined, there exist efficient heuristics to solve the partitioning problem [24, 15], and some of these have already been effectively applied to the scheduling of real-time tasks [3]. For fairness, it is worth noting that minimizing the total utilization does not always guarantee finding a feasible SW allocation (if one exists). This

12 rows (x10) of LEs

Definition of a LogicLock region with a NIOS II processor

one block of multipliers (out of 8 available)

15 columns of LEs

Figure 4. The Altera FPGAs layout. phenomenon is observed in global multiprocessor scheduling [4] when task sets may not be schedulable despite a very low utilization (Dhall’s effect [12].) In most cases, however, the minimization of the total SW utilization leads to effective solutions [3]. The FPGA must accommodate the m softcores and all the functions Fi mapped into the HW. Both the softcores and the HW-implemented functions are modeled as rectangles. Considering that the placing algorithms work poorly when the rectangles are very different in size, and that a softcore implementation is generally larger than the FPGA footprint of any function Fi , we introduce a preprocessing stage, dedicated to the placement of the softcores, and then we place the HW functions on the remaining FPGA resources. Our optimization method can still be applied when the previous condition does not hold. In this case, however, there is the possibility that the preprocessing stage overconstrains the solution space and prevents the algorithm from finding the optimum or even a feasible solution when one exists.

3.1. Allocation of the softcores When m softcores are implemented in the FPGA, their packing can be performed in many ways, leaving multiple options for the partitioning of the remaining available FPGA area. In our study, we assume that the softcores are placed contiguously starting from one of the corners of the FPGA, arranged by columns, as shown in Figure 5. There are several advantages in assuming such a placement for the softcores: (1) the communication among the processors is easier because they are close to each other, (2) softcores are placed together, thus reducing the fragmentation, and (3) the remaining FPGA resources can be viewed as the union of two rectangles, thus allowing the use of bidimensional packing algorithms.



+ r,

i define the packing of item j when it doesn’t initialize a level. We have xi,j = 1 if item j is packed in level i, xi,j = 0 otherwise. For example, in the case of the example of Figure 6, the variables are y1 = y3 = 1, because only item 1 and 3 initialize a level, and all other yi = 0. The allocation of the other blocks to the levels is encoded in x1,2 = x1,4 = x1,6 = 1 and x3,5 = 1 with all the other values xi,j = 0. Finally, n additional variables si encode the mapping of functionality Fi . If Fi is mapped in SW, then si = 1, otherwise si = 0. Given this formalization, let us now consider the constraints the problem is subject to.

Notice that, thanks to the ordering of the rectangles by decreasing height, the item j can be allocated as one of the non-initializing items only in the levels from 1 to j − 1. A second constraint is given by the width of the resource. Each level cannot exceed the width W of the rectangle. Thus we have n  wj xi,j ≤ (W − wi ) yi ∀i = 1, . . . , n − 1 (3) j=i+1

In this constraint, notice that when the level i does not exists (yi = 0) then all the xi,j are forced to 0 as well. Finally, we impose that the total height H of the FPGA is not exceeded, as follows n  hi y i ≤ H (4) i=1

and the constraint on the availability of external memory is n  si mi ≤ M. (5) i=1

The goal of the design can change, depending on the specific needs, but a reasonable goal is to provide for maximum future extensibility on the SW side by minimizing the total utilization of the SW tasks. In fact, a low total utilization has many benefits: • it makes the allocation of the SW tasks onto the m processors and the schedulability problem easier; • by keeping the load of the m softcores as low as possible, the system can react promptly to additional runtime requests by allocating other software tasks in the spare processor time. The minimization of the average utilization is expressed as follows n  (ui si + vi (1 − si )) (6) minimize i=1

Let us now evaluate the number of variables and constraints. The yi are n, because all the items may initialize one level. The xi,j are n(n−1) and the si are n. Hence the 2 total number of variables is n(n+3) . Also, by counting the 2 number of inequalities in Equations (2), (3), (4), and (5), we find that the number of constraints is 2 n + 1. Another possible metric could be to provide for maximum future extensibility on the HW side. In this case, the

area constraint provided by Eq. 4 is replaced by the following constraint on the SW tasks n  (ui si + vi (si − 1)) ≤ m (7) i=1

and the metric to be minimized is no longer the total SW utilization but it becomes to minimize the height of the required FPGA area n  minimize y i hi (8) i=1

We will not further explore this option, focusing our analysis on the extensibility of the SW implementation as a design goal. Please note that the minimization of the total utilization can even possibly allow for the complete deallocation of some processor if the number of processors is too high. In this case, the additional FPGA LEs provide for the desirable amount of flexibility on the HW side. In general, however, the ILP optimizer is run for each possible value of m and computes the solution with minimum utilization U (m). The selection of m can be performed by computing the solution with the minimum value of the average utilization U (m)/m. This solution represents the design which is probably easiest to schedule and provides for the maximum future extensibility. Extending to the two rectangles case As explained in Section 3.1, after the allocation of the softcores, the remaining FPGA resources can be modeled as two rectangles. The HW allocation problem has been already described when only one rectangle is available. If the items are allocated into two rectangles, labeled as A and B, the problem can be solved by duplicating the variables in the set pairs yiA and B yiB , and xA i,j , xi,j , meaning that an item can be allocated in any of the two rectangles. In this case, we have 2 n variables yi and n(n − 1) variables xi,j , but still n variables si . Hence, the total number of variables grows to n2 + 2n. Moreover, the constraints of Equations (3) and (4) are required for both rectangles. As a result, the total number of constraints is 3 n + 3. We remark that the optimization routine assumes that the sizes of the two rectangles are known in advance, meaning that the number of available softcores must be fixed in advance. The best number of processors for our design problem is not known in advance. However, a low number of processors forces a HW implementation of most functions, including those for which a SW implementation would be more efficient. Conversely, a large number of softcores reduces the number of LEs that are available for the HW implementation of functions. Hence, we expect that, in most cases, the best solutions are found for an intermediate number of processors allowing for the best implementation of each functionality. This intuition is confirmed by the results of the experimental section.

3.3. Allocation of the SW functions After partitioning the functions between HW and SW implementation, we consider the problem of scheduling the tasks onto the m processors. The literature on multiprocessor scheduling provides solutions divided in two main classes: partitioning and global algorithms. In the first case, finding the optimal partition of the tasks onto the m processors is an NP-complete problem, and efficient heuristics have been proposed [24, 15, 3] to solve it. The problem can be approached in two ways. Finding the minimum number of processors which can feasibly schedule the tasks. In this case the problem becomes a classical bin packing, where the items are the task utilizations ui . If the minimum found is smaller than the available processors m, then the unused softcores may be deallocated to provide additional HW resources, or used to execute dynamic tasks that may be requested at run-time. As an alternative, it is possible to balance the load among all processors. This solution is well suited for energy critical application. In fact, a low average utilization on all the softcores, allows the reduction of the FPGA clock rate and, consequently, allows reducing the energy consumed by the FPGA. If a global strategy is chosen, the SW tasks can run on any of the m processors. In this case, pFair scheduling algorithms [1] can use all the processor capacity and allow balancing the load at runtime among the processors. Simpler but less effective multiprocessor scheduling policies are Earliest Deadline First (EDF) [4, 6] and RM-US [2].

4. Experiments We performed a set of experiments to demonstrate the effectiveness of the packing algorithm and to highlight the tradeoffs in the HW/SW allocation of functionality in FPGAs with softcores. The simulation platform is a personal computer with an Intel Xeon processor running at 3.4 GHz. The optimization problem has been implemented using the GNU Linear Programming Kit (GLPK) [19]. The C code used for the experiments is available on-line [7]. Simulation settings The simulation parameters have been defined in accordance with a model of the Altera Cyclone FPGA and the NIOS II softcore. In particular, we considered the EP2C20 model with 18752 Logic Elements, divided in 46 × 26 cells or logic array blocks of 16 LEs each, and a NIOS II softcore with 1800 LEs, in a rectangle of 14 × 8 cells. According to these values, the number of softcores that can be implemented into the FPGA ranges between 0 and 9. The softcores are allocated in columns, as shown in Figure 5. After their allocation, the LEs that remain for the placement of the HW blocks are placed inside two rectangles having size (width × height) W A × H A and W B × H B as shown in Table 1 (notice that the table

Num. procs. (m) 0 1 2 3 4 5 6 7 8 9 W A ,H A (46, 26) (32, 26) (32, 26) (32, 26) (18, 26) (18, 26) (18, 26) (4, 26) (4, 26) (4, 26) W B ,H B (0, 0) (14, 18) (14, 10) (14, 2) (14, 18) (14, 10) (28, 2) (14, 18) (14, 10) (42, 2) Avg. time (sec) 9.2 1042.4 640.4 83.7 1495.9 1120.3 860.7 4.8 1.4 0.2 Optimality (%) 100 63.1 84.5 100 29.8 63.1 72.6 98.8 100 100 Feasibility (%) 0.0 0.0 25.0 46.4 95.2 97.6 98.8 85.7 48.8 5.9 Avg. proc. util. ∞ 1.6548 1.1226 1.0062 0.8531 0.8314 0.8539 0.9574 1.0018 1.0969 ami33: Run time (sec) 754.0 14400 5530.0 3603.0 14400 9619.0 14400 36.0 23.0 1.0 ami33: Optimality Y N Y Y N Y N Y Y Y ami33: Proc. util. ∞ 1.2300 0.7626 0.7993 0.9074 0.8576 0.8615 0.9874 1.0085 1.0718 Table 1. Experiment results as function of the number of softcores. also reports the experiment results which will be explained later). When the number of processors is 4, 5, 7 and 8, the LEs on top of the first column are left unused. Accounting for these LEs would require the addition of a third rectangle and a much larger number of optimization variables. If a fractional allocation of softcores was possible, the FPGA would allow for 1196 112 = 10.6786 CPUs and the bound on the maximum possible software utilization would be 10.6786. Although this value is only a theoretical upper bound, it can be used to evaluate the effectiveness of the packing solution with respect to the amount of used resources. For each experimental run, random sets of functions have been generated. The number of functions is always 25. The number of variables is 675 and the constraints are 78. We performed the optimization assuming a number of processors ranging from 0 to 9. The random set of functions is generated in such a way that  the total area requirement for the HW implementation i wi hi always equals the FPGA area. These settings have been purposely selected so that a feasible HW/SW partitioning is very hard to find, and the problem of finding a feasible solution with minimum SW utilization is not trivial. The HW footprint of the functions has been selected with a random form factor (height/width) such that 0.5 ≤ hi /wi ≤ 2. In real applications, we expect some degree of correlation between HW and SW implementation. This correlation is taken into account by calculating the ideal utilization of wi hi each functionality uid i = ACPU and extracting the SW utiid id lization ui in an interval centered on uid i , [ui − δ, ui + δ]. The uncertainty on the actual value of ui accounts for the fact that there exist functions for which a HW implementation is more efficient than a SW implementation and vice versa. Finally, all the task utilizations are scaled so that  AFPGA i ui = ACPU . Similarly the software utilization of the hardware implementations vi is centered around uid i /25, assuming that the hardware implementation is in the average 25 times faster than the software. Time complexity The packing problems defined in this paper is NP-complete and the time required by the BILP-

based solution is exponentially increasing with the number of functions. The total running time depends on the function set, but can be very large for sets of 25 or more functions. The BILP solver, however, keeps track of the best solution that it finds at any time during the search. In the experiments, each run was stopped after 30 minutes, and the best solution available at that time was considered, even if not optimal. The average time spent by the algorithm is reported in the “Avg. time” row of Table 1, for a given number of softcores (the runs were stopped after 1800 seconds.) The row “Optimality” is the percentage of runs where the optimum was found before stopping the algorithm. Finally, the row “Feasibility” shows the percentage of solutions possibly admitting a feasible  placement and schedule, i.e. all the solutions for which i ui si ≤ m. When there is no processor (first column) the allocation problem is performed on one rectangle only and the number of optimization variables is halved. Hence, the run-time is sensibly lower and the ILP solver always finds the optimum. When the number of softcores is 9, the FPGA rectangles that are left for the HW functions are so small that the number of possible placement options is very limited and the solver is always able to finish within the time limit. The optimization algorithm requires a much longer time for the other cases, when the number of possible solutions is very high and the branch and bound search must evaluate all of them. In conclusion, in its current formulation, the optimum problem can be effectively solved for a number of functions lower than 30. FPGA occupancy Another outcome of the experiments is the evaluation of the area that is used in the FPGA as the number of processors varies. This gives a measure of the effectiveness of the level packing approximation to the 2D bin packing problem. Figure 7 shows the results. The number of softcores is shown on the x axis, whereas the y axis shows the number of logic elements. The area used by the softcores is clearly proportional to the number of processors. For 4, 5, 7 and 8 softcores there is a small amount of area loss because of internal fragmentation inside the column of the CPUs. This loss consists of 28 logic elements (CPU fragm. in the figure) when m ∈ {4, 5} and

Vert. fragm. 1200

Horiz. fragm.

Number of Logic Elements (x16)




CPU fragm.

HW implem. functions


450 300


Softcore area











Number of Softcores

Figure 7. FPGA occupancy. 56 LEs when m ∈ {7, 8}. The LEs that have been allocated to HW functions make up for most of the remaining space. Level packing results in internal fragmentation inside each level. The number of LEs that are lost because of fragmentation inside each horizontal level is labelled as Horiz. fragm. The horizontal fragmentation amounts to roughly the 17% of the LEs that are used for the HW implementation of functions. Finally, the figure shows the number of unused LEs that remain on top the last level (denoted by Vert. fragm.) Resource utilization Another important metric combines the utilizations of both the HW and the SW implemented functions. Figure 8 shows the results. A thick black line shows the overall FPGA resources (expressed as HW and SW utilization) that are available. This bound is always equal to AAFPGA = 10.6786 and cannot be exceeded. CPU

ble only when m ∈ {4, 5, 6, 7}. In fact, for an intermediate numbers of cores, the optimization routine can best exploit the flexibility of HW/SW placement. The figure also shows the resources that are required by the functions implemented in HW (label HW requirement), which are decreasing with the number of softcores, and finally, the sum of the SW and HW requirements (Tot. requirement in the figure). When the processors are 4, 5 or 6 the FPGA is used at its best, because after the allocation of all the functions, and after adding the utilization of the LEs that are lost because of the horizontal fragmentation there still remains a share of available HW/SW resource utilization. This is shown in the graph by the line associated with the total requirement plus the horizontal fragmentation (label Tot. req+Horiz. frag.), which is below the limit value 10.6786. Finally, the previous Table 1 shows that when using an intermediate number of processors the solution found is possibly feasible in approximately the 90% of the cases, even if the total requested area equals the available FPGA area. This confirms our initial intuition that using 4, 5 or 6 processors often leads to the best solution. The ami33 benchmark In our last set of experiments we tried the set of blocks of the ami33 benchmark [11], which not only provides a realistic definition of the geometry of the blocks, but also one that is not well suited to level packing because of the large variation in the relative sizes of the blocks. The ami33 benchmark is used for comparing the quality of floorplanning algorithms, but provides no indication of the complexity of a software implementation of its blocks. Hence, we applied the same formula used to associate a utilization parameter to the previous random sets. The results are summarized in the lower part of Table 1. The level packing procedure is less efficient than state of the art floorplanning algorithms. The ami33 set of blocks can be placed on the FPGA with 4% unused area [11], while in our case, the solution without softcores is not feasible and leaves 21% unused area. However, the ILP optimization is capable of solving the problem and leveraging the tradeoffs between HW and SW implementation similarly to what happened with the randomly generated sets. In the case of the ami33 example, the best solution with respect to the average core utilization is obtained for m = 2 softcores, when the average load on the processors is 0.7626.

5. Conclusions Figure 8. Resource utilization. Figure 8 also shows the minimum utilization of SW tasks (label only SW tasks) and utilization also including the impact on the softcores of HW implemented functionalities (label SW + HW util.). From the figure and from the “Avg. proc. util.” row in Table 1 we see that in the average the problem is feasi-

In this paper we investigate the HW/SW codesign of FPGAs with softcores supporting the execution of real-time tasks. The problem has been formulated through an efficient model of the 2D packing problem, called level packing. Our approach tackles the problem of the softcore placement together with the HW/SW partitioning of functions. Even if many “real-world” aspects of the problem are not

considered (such as the heterogeneity of the FPGA and the routing constraints for the communication among the HW implemented functions and the softcores), we believe that the model of the FPGA, the softcores and the functions is sufficient for practical applicability in codesign problems. As future work, we plan to further refine the model to capture the details of the internal architecture of the FPGA, at least when a striped resource placement (see Altera’s Cyclone) is applicable.

References [1] J. H. Anderson and A. Srinivasan. Early-release fair scheduling. In Proceedings of the 12th Euromicro Conference on Real-Time Systems, pages 35–43, Stockholm, Sweden, June 2000. [2] B. Andersson, S. K. Baruah, and J. Jonsson. Static-priority scheduling on multiprocessors. In Proceedings of the 22nd IEEE Real-Time Systems Symposium, pages 193–202, London, U.K., Dec. 2001. [3] S. K. Baruah. Partitioning real-time tasks among heterogeneous multiprocessors. In Proceedings of the 33rd Annual International Conference on Parallel Processing, pages 467–474, Montreal, Canada, Aug. 2004. [4] S. K. Baruah, S. Funk, and J. Goossens. Robustness results concerning EDF scheduling upon uniform multiprocessors. IEEE Transactions on Computers, 52(9):1185–1195, Sept. 2003. [5] K. Bazargan, R. Kastner, and M. Sarrafzadeh. Fast template placement for reconfigurable computing systems. IEEE Design and Test of Computers, 17(1):68–83, Jan. 2000. [6] M. Bertogna, M. Cirinei, and G. Lipari. Improved schedulability analysis of EDF on multiprocessor platforms. In Proceedings of the 17th Euromicro Conference on Real-Time Systems, pages 209–218, Catania, Italy, July 2005. [7] E. Bini and M. Di Natale. C code for experiments on FPGA allocation. available at http://feanor.sssup.it/ ˜bini/resources/c/FPGAalloc.c, 2006. [8] G. J. Brebner. The swappable logic unit: A paradigm for virtual hardware. In Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines, pages 77– 85, Napa Valley (CA), U.S.A., Apr. 1997. [9] Celoxica. Dk design suite: Matlab-simulink integration. available at http://www.celoxica.com/ products/dk/matlab.asp. [10] Y. Chang, Y. Chang, G. Wu, and S.W.Wu. B*-trees: a new representation for non-slicing floorplans. Design Automation Conference DAC, pages 458–463, 2000. [11] W. Dai, L. Wu, and S. Zhang. MCNC floorplan benchmark: Circuit ami33. available at http://www.cse.ucsc.edu/research/surf/ GSRC/MCNC/ami33/ami33.html. [12] S. K. Dhall and C. L. Liu. On a real-time scheduling problem. Operation Research, 26(1):127–140, Jan. 1978. [13] F.Y.Young and D. Wong. How good are slicing floorplans. Integration, the VLSI Journal, pages 23–61, 1997. [14] P. Guo, C. Cheng, and T. Yoshimura. An o-tree representation of non-slicing floorplans and its applications. Design Automation Conference DAC, pages 268–273, 1999.

[15] J. K. Lenstra, D. B. Shmoys, and E. Tardos. Approximation algorithms for scheduling unrelated parallel machines. Mathematical Programming, 46:259–271, 1990. [16] C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogramming in a hard real-time environment. Journal of the Association for Computing Machinery, 20(1):46–61, Jan. 1973. [17] A. Lodi, S. Martello, and M. Monaci. Two-dimensional packing problems: A survey. European Journal of Operation Research, 141:241–252, 2002. [18] A. Lodi, S. Martello, and D. Vigo. Models and bounds for two-dimensional level packing problems. Journal of Combinatorial Optimization, 8(3):363–379, 2004. [19] A. Makhorin. GNU Linear Programming Kit. available at http://www.gnu.org/software/glpk/. [20] H. Murata, K. Fujiyoshi, S. Nakatake, and Y. Kajitani. Vlsi module placement based on rectangle-packing by the sequence pair. IEEE Transaction on CAD, 15,12:1518–1524, 1996. [21] S. Nakatake, H. Murata, K. Fujiyoshi, and Y. Kajitani. Module placement on bsg-structure and ic layout applications. ICCAD Conference, pages 484–491, 1996. [22] R. Otten. Automatic floorplan design. Design Automation Conference DAC, pages 261–267, 1982. [23] R. Pellizzoni and M. Caccamo. Adaptive allocation of software and hardware real-time tasks for FPGA-based embedded systems. In Proceedings od the 12th IEEE Real-Time and Embedded Technology and Applications Symposium, pages 208–220, San Jos´e (CA), U.S.A., Apr. 2006. [24] C. N. Potts. Analysis of a linear programming heuristic for scheduling unrelated parallel machines. Discrete Applied Mathematics, 10:155–164, 1985. [25] K. Sakanushi and Y. Kajitani. The quarter-state sequence (q-sequence) to represent the floorplan and applications to layout optimization. IEEE APCCAS, pages 829–832, 2000. [26] H. Simmler, L. Levinson, and R. M¨anner. Multitasking on FPGA coprocessors. In Proceedings of the 10th International Workshop on Field-Programmable Logic and Applications, pages 121–130, Villach, Austria, Aug. 2000. [27] C. Steiger, H. Walder, M. Platzner, and L. Thiele. Online scheduling and placement of real-time tasks to partially reconfigurable devices. In Proceedings of the 24th IEEE RealTime Systems Symposium, pages 224–235, Cancun, Mexico, Dec. 2003. [28] H. Walder and M. Platzner. Online scheduling for blockpartitioned reconfigurable devices. In Design, Automation and Test in Europe Conference and Exhibition, pages 290– 295, Munich, Germany, Mar. 2003. [29] H. Walder and M. Platzner. An EDF schedulability test for periodic tasks on FPGAs. In Design Automation and Test in Europe, Munich, Germany, Mar. 2006. [30] G. Wigley and D. Kearney. The development of an operating system for reconfigurable computing. In Proceedings of the 9th IEEE Symposium on Field-Programmable Custom Computing Machines, pages 249–250, 2001. [31] F. Y. Young, C. C. N. Chu, W. S. Luk, and Y. C. Wong. Floorplan area minimization using lagrangian relaxation. International Symposium on Physical Design, pages 174–179, 2000.

Suggest Documents