IT IS BECOMING increasingly important to design highperformance

1558 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 10, OCTOBER 2012 Algorithms for Gate Sizing and Dev...

Author: Dustin Malone

1 downloads 1 Views 3MB Size

Report

Download PDF

Recommend Documents

Post-Grant Review Is Becoming Increasingly Popular

RECOMMENDATION is playing an increasingly important

Business leaders are becoming increasingly

Cost is becoming an increasingly. Single-Wafer recycling processes

An increasingly important goal

Why is it important to eat healthily?

Botanical gardens worldwide are becoming increasingly

Slotting allowances, or fees, are becoming increasingly

It is becoming accepted that immune reactions

BIODIVERSITY WHY IS IT IMPORTANT?

How important is it to use data to assess student

Health Literacy: What is it, What to do about it, Why is it Important?

DEMENTIA IS INCREASINGLY

It is increasingly recognized that health care professionals

It is easy but important to imagine how government

Why Is It Important To Pick A Weekend Dentist Nearby

Dengue represents an increasingly important global health

As the complexity of malicious software continues to evolve, knowing how to analyze malware is becoming increasingly important. However, sometimes we

Is It Important What We Wear?

With speed requirements becoming increasingly critical, maximizing your system

It is becoming widely accepted that research which

Small-molecule microarrays (SMMs) (1 3) are becoming increasingly

EVENT IS JUST AS IMPORTANT TO AMARILLO SKY AS IT IS TO YOU

Neoplastic meningitis is an increasingly

1558

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 10, OCTOBER 2012

Algorithms for Gate Sizing and Device Parameter Selection for High-Performance Designs Muhammet Mustafa Ozdal, Steven Burns, and Jiang Hu

Abstract—It is becoming increasingly important to design highperformance circuits with as low power as possible. In this paper, we study the gate sizing and device parameter selection problem for today’s industrial designs. We first outline the typical practical problems that make it difficult to use traditional algorithms on high-performance industrial designs. Then, we propose a Lagrangian relaxation-based formulation that decouples timing analysis from optimization without a resulting loss in accuracy. We also propose a graph model that accurately captures discrete cell-type characteristics based on library data. We model the relaxed Lagrangian subproblem as a graph problem and propose algorithms to solve it. In our experiments, we demonstrate the importance of using the signoff timing engine to guide the optimization. We also show the benefit of the graph model we propose to solve the discrete optimization problem. Compared to a stateof-the art industrial optimization flow, we show that our algorithms can obtain up to 38% leakage power reductions and better overall timing for real high-performance microprocessor blocks. Index Terms—dynamic programming, circuit optimization, gate sizing, Lagrangian Relaxation.

I. Introduction

I

T IS BECOMING increasingly important to design highperformance circuits with as low power as possible. In this paper, we focus on the gate sizing and device parameter selection problem for designs with high-performance requirements. The gate-sizing problem has been studied extensively in the literature, and there have been many techniques proposed to solve this problem. However, the new challenges in modern technologies make it hard to apply many of these techniques without incurring significant overheads. One main problem with existing approaches is the over simplification of the timing models, which can lead to suboptimal decisions in sizing and can require large guardbands to avoid timing violations. Especially, for high-performance designs, significant power savings are possible by taking into account accurate timing information during gate sizing. Manuscript received September 23, 2011; revised December 15, 2011 and February 13, 2012; accepted March 20, 2012. Date of current version September 19, 2012. The preliminary version of this paper was presented at the IEEE/ACM ICCAD Conference in November 2011 [18]. This paper was recommended by Associate Editor S. Vrudhula. M. M. Ozdal and S. Burns are with the Strategic CAD Labs, Intel Corporation, Hillsboro, OR 97124 USA (e-mail: [email protected]; [email protected]). J. Hu is with Texas A&M University, College Station, TX 77843 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCAD.2012.2196279

We can list the main optimization challenges in modern industrial designs as follows. 1) Discrete cell sizes: Continuous optimization techniques [such as Lagrangian relaxation (LR) [4]] inherently assume that cell sizes are continuous, and the delaypower tradeoff can be modeled as a convex curve. However, industrial cell libraries contain limited number of cell sizes. Furthermore, different technology parameters introduce discrete cell families with very few elements in between (e.g., only a few levels of threshold voltages exist for a typical library). For example, for Intel’s 32-nm technology, it was shown that varying Idsat (corresponding to drive strength) between 1.4 mA/μm and 1.6 mA/μm leads to about a 10× difference in Ioff (corresponding to leakage power) for an NMOS transistor [8]. A typical cell library contains a few cell families, each corresponding to a different set of technology parameters, corresponding to different power-performance tradeoffs. In the presence of discrete technology parameters, the circuit optimization problem is no longer only the well-known gate-sizing problem, but also a discrete cell-type selection problem. Using traditional continuous optimization techniques for this purpose would require defining continuous models over different device technologies, which may not accurately capture the characteristics of the discrete cells in the library. Furthermore, these continuous sizes would need to be snapped to discrete sizes at the end, which can be very sparse especially for different technology parameters. 2) Cell timing models: Modern cell libraries utilize complex timing models to characterize cell delays. Furthermore, the different technology parameters used for different cells (such as gate length, doping, and others) lead to different threshold voltages and different tradeoffs between drive strengths and leakage powers. It is hard to represent all different cell types with a simple formula such as delay = driver − resistance × load. Even higher order convex models are not accurate enough for modern cell libraries, because, in reality, delay is not a convex function of cell size and output load (due to transistor folding in the layout, etc.) [5]. As an example, consider Fig. 1 for a timing arc of a cell from an industrial 32-nm library. Here, normalized delay is plotted with respect to normalized size and output load. Note that the ratio of output load to size is constant in this graph. The simple delay models would assume

c 2012 IEEE 0278-0070/$31.00

OZDAL et al.: ALGORITHMS FOR GATE SIZING AND DEVICE PARAMETER SELECTION

constant delay when the cell size and the output load both change by the same amount. However, in reality, this is not the case as demonstrated in this figure. Furthermore, different technology parameters lead to different cell families, and hence different curves, two of which are shown in this figure. Note that the curve at the top corresponds to a cell family with higher threshold voltage (lower leakage) than the other. 3) Complex timing constraints: There can be various timing constraints imposed on high-performance designs such as timing overrides, multicycle paths, transparent paths, multiple clock events, false paths, and others. An algorithm that relies on simple timing models to compute slacks will not capture these constraints accurately. 4) Interconnect timing models: Elmore model may be reasonable for early estimation, but is not accurate enough for final optimization. As the interconnect delays are becoming more and more dominant with device scaling, slack computation using simple interconnect delay models (as done by many sizing algorithms in the literature) is not reliable anymore. 5) Slew effects: The cell delays are typically defined with respect to input slews and output loads. The input slews can have significant impact on the gate delays. For example, downsizing a gate can lead to worse slew at its output pin, which also worsens the delays through other gates after it. So, slew changes and their impacts on other cells should be considered during optimization. 6) Many near-critical paths: In a high-performance design, a large portion of the cells may be on critical or near-critical paths. Algorithms that optimize only critical paths iteratively can have convergence problems, leading to suboptimalities. A near-critical path can become critical repeatedly depending on the size changes. Ideally, an optimization algorithm should consider all paths for high-performance designs, not only the most critical ones. 7) Large design sizes: Blocks in modern designs can easily contain millions of cells, and the optimization algorithm should be scalable enough to handle such large designs. So, compute-intensive algorithms may be impractical for modern designs. Also, algorithms that iteratively upsize a few gates and do incremental timing updates can be too expensive, especially if an accurate timing engine is utilized. As listed above, a gate-sizing algorithm applicable to modern high-performance designs should have an accurate view of timing. Modeling all the complicated timing constraints in the optimizer is not practical given all the sophistications in modern timing engines. Ideally, the optimizer should be guided by the slacks computed by the signoff timing engine. However, calling the signoff timer after every cell size change is not practical due to long runtimes even in the incremental mode. In this paper, we have the following contributions. 1) We propose an LR formulation (Section IV) that decouples timing analysis from the optimization engine without resulting in loss of accuracy. Existing LRbased sizing algorithms incorporate static timing anal-

1559

Fig. 1. Delay as a function of cell size and output load for two cell families from a 32-nm industrial library. In the x-axis, the size and output load values are changed such that load/size = k, where k is kept constant. All values are normalized.

2)

3)

4)

5)

ysis modeling into the optimization formulation [4]. Although this is adequate for simple timing models, the complexity will significantly increase in the presence of realistic timing constraints (multiple clock domains, multicycle overrides, false paths, transparent latches, etc.). Our proposed formulation allows using the slack values computed by the signoff timing engine, rather than having to model all timing constraints in the optimizer. In other words, an accurate signoff timer is used as a black box to compute the slack values, which are then used to guide the optimization engine. We propose a graph model (Section V) that captures the delay costs of discrete cells accurately based on timing tables in the cell library. For multi-fanout nets, each edge cost can be modeled independently, while still capturing the fact that the load of one branch can affect the delay of another branch. The slew effects on delay are also taken into account as defined in the library. We show that minimizing the subnode and edge costs in this graph corresponds to optimizing the Lagrangian-relaxed subproblem (LRS). We propose a delta-delay cost metric that alleviates the suboptimalities due to double counting in directed acyclic graph (DAG) optimization by combining fanin and fanout costs in the subnode costs. We propose a dynamic programming (DP) algorithm based on critical tree extraction to solve the LRS optimization problem for discrete cells (Section VI). In our experimental study (Section VII), we first use real microprocessor blocks to show the practical benefits of using the signoff timer to guide the LR iterations and the advantages of our DP algorithm based on our graph model. Then, we compare our results with a state-of-the-art industrial tool and show that up to 38% power savings are possible, while reducing the timing violations.

II. Related Work Many previous works on gate sizing are based on continuous optimization, such as linear programming [2], [5], [13], quadratic programming [17], convex programming [6], [23], [22], greedy search [4], [20], [25], and network flow [21], [26]. The prevalence of cell library-based designs requires

1560

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 10, OCTOBER 2012

rounding the continuous solutions to discrete options in a library. The rounding is not trivial at all, as it can easily result in infeasible solutions. As such, the rounding can be as complicated as a full-fledged combinatorial optimization such as DP [11] and branch-and-bound [20]. Moreover, the continuous optimizations are often restricted to simple delay models such as piecewise linear model [21] and RC switch gate model [4], [21], [25], [26]. Discrete gate sizing is also directly tackled by combinatorial optimization. Not surprisingly, the discrete sizing problem is proved to be NP-hard [14]. An early work on discrete gate sizing [3] is an exhaustive search, which is hardly affordable even on medium-sized circuits. A fast gate-sizing heuristic is reported in [10], but it optimizes only delay without considering power or cost. For simultaneous gate sizing and threshold voltage assignment, sensitivity-based heuristics are reported in [24] and [27]. The greedy nature of these heuristics often leads to ad hoc solutions. A recent progress is [16], which allows a DP-like systematic solution search despite the notorious fanout reconvergence problem. Although combinatorial approaches are capable of addressing the aforementioned practical issues, such a capability has not been demonstrated in these previous works. A multimove search method is developed in [7]. This method is a flexible framework that can accommodate many practical concerns such as accurate delay models and slew rates. However, its runtime is difficult to scale with the modern chip designs. Instead of directly solving the gate-sizing problem, LR changes the problem to a subproblem, which is easier to solve, and a companion dual problem. An elegant and perhaps the most famous LR-based work is [4]. With the Elmore delay model, it provides the optimal continuous solutions. Later on, speedup techniques for LR-based approaches are proposed in [25]. LR is also integrated with convex programming [6], network flow [26], and DP-based algorithm [16]. A new technique for solving Lagrangian dual problem is presented in [12]. However, all these works assume simple timing models and more practical issues are ignored.

III. Preliminaries Let D be the given design that contains a set of standard cells C, a set of pins P on these cells, and a set of nets N that define the connectivity over these pins. For each standard cell c ∈ C, a set of cell types Sc is defined, where Sc can contain cell types with different sizes and device technology parameters.1 Let power(s) denote the leakage power of cell type s ∈ Sc . For simplicity, we will focus on leakage power optimization, but it is straightforward to extend our models to consider dynamic power optimization, as will be discussed in Section V-F Furthermore, assume that an arbitrary set of timing constraints are enforced on design D. Let slack(p) be defined as the timing slack of pin p ∈ P, which is computed by a timing engine. Negative slacks indicate timing violations. Let TNS 1 In the remainder of this paper, cell type will refer to both the gate size and the set of technology parameters used.

denote the absolute value of the total negative slack of all the primary outputs in D. Based on these definitions, we can formulate the problem as follows. Given a design D with timing constraints, determine the cell type s ∈ Sc for each cell c ∈ C such that the following objective function is minimized: α c∈C power(c)+TNS. Here, α is a constant that determines the relative importance of power minimization with respect to timing violations. Eventually, the TNS value needs to be reduced to 0 before tapeout. However, during the design process, some timing violations can be allowed, assuming that some changes will be made in the design to fix these violations later. Note that this objective function is chosen for practical reasons as opposed to enforcing TNS to be zero so that our techniques are also applicable to designs that are not yet converged. For such designs, it is not desirable to increase the clock period(s) to make TNS = 0, because doing so would lead to (artificially) downsizing most cells. Instead, it is more desirable to identify those paths that violate the timing constraints so that designers can specifically target them. Both power and TNS are important metrics to minimize, and the priority of power versus TNS minimization may vary based on the stage of the design project. Hence, we have chosen our objective function as the weighted sum of power and TNS. Selection of the right α value can be done empirically. Obviously, the library units for delay and power need to be taken into account while choosing the α value. For a given sample block from a design project, one can try different α values to obtain different results (e.g., see Fig. 12) and then choose the α value based on the desired power–TNS tradeoff. The same α value can then be utilized for other blocks belonging to the same project. In this paper, we assume a lookup table-based standard cell library, where cell delays and slews are defined using delay tables (DTs) and slew tables (STs). The timing arcs are defined from input pins of the cell (rising and falling) to the output pin2 (rising and falling). For each timing arc through the cell, a table of the form DT[input− slew, output− load] exists to define the timing arc delay for the given input slew and output load. A similar table ST [input− slew, output− load] exists for output slews. If a given slew or load value does not exactly match the entries in the table, linear interpolation is performed in-between the nearest table entries. So, in this paper, we assume that two functions delay(input− slew, output− load) and output− slew(input− slew, output− load) are available in the cell library for each timing arc of each cell type. Note that although delay and slew have some linear dependence to load within an interval (due to linear interpolation between table entries), in general, the behavior can be nonlinear if the load and input slew values vary significantly. Note that modern timing engines may utilize more advanced current or voltage source models (e.g., [1]) than lookup tables. As will be explained in Section IV, our LR formulation allows decoupling slack computations from the optimization engine. In other words, the signoff timing engine can be used as a black box to compute the accurate slack values, which are then 2 For simplicity, let us assume a single output pin for each cell. It is straightforward to generalize our models for multioutput cells.

OZDAL et al.: ALGORITHMS FOR GATE SIZING AND DEVICE PARAMETER SELECTION

used to guide the optimization engine. In this framework, the optimization engine needs the information of relative delay or slew values and delay or slew sensitivities for different cell types (see Section V). In the remainder of this paper, it is assumed that this information is available in the standard cell library, while the signoff timer can use more accurate models to compute the actual slack values to guide the optimization. IV. LR-Based Optimization LR is a general technique for solving optimization problems with difficult constraints. The main idea is to replace each complicated constraint with a penalty term and add it to the original objective function. Specifically, each penalty term is the original constraint multiplied by the corresponding Lagrangian multiplier (LM). The Lagrangian problem is defined as the optimization of this new objective function. If the optimization is a minimization problem, then the solution of the Lagrangian problem is guaranteed to be a lower bound for the original problem. So, the goal is to find the best LMs such that the optimal value obtained for the Lagrangian problem is as close to the real optimal value as possible. For this purpose, the LMs are updated iteratively (typically using a subgradient method) in a high-level loop, while the relaxed Lagrangian problem is solved for the fixed LM values in iterations of a low-level loop. Further details about LR can be found in [9]. LR has been utilized for various design automation problems, including gate sizing [4], floorplanning [15], and routing [19]. Existing work on LR-based sizing [4] incorporates static timing analysis modeling directly into the formulation of the optimization problem. Although this is adequate for simple timing models, the complexity will increase significantly in the presence of complex timing constraints (multiple clock domains, multicycle overrides, false paths, transparent latches, etc.). In this section, we propose an LR formulation that decouples slack computation from optimization. This formulation allows the optimizer to use the slack values computed by the signoff timer directly, instead of having to model them explicitly. This way, the complexity of the timing analysis is encapsulated in the signoff timing engine (including timing constraints, interconnect timing models, and others), and the optimization engine can use relatively simpler timing models. Intuitively, the optimization engine chooses the cell sizes and types based on accurate feedback provided by the black box signoff timing engine. Here, the signoff timer needs to be called only once per iteration, after all the cell sizes and types are chosen to optimize the LRS, as will be described in Sections V and VI. This allows our approach to scale well to handle large industrial designs and makes it affordable to process all cells in the design instead of only the most critical ones. The problem of gate sizing is traditionally formulated as given; minimize either area or power subject to the constraints such that: 1) the design should meet static timing constraints, 2) the design should use only legal gate sizes, and 3) the design should meet additional constraints such as max transition (i.e., slew). In the following sections, we will first derive relaxations associated with timing constraints for a single

1561

clock domain (Section IV-A). Then, we will show how to extend the formulation to handle multiple clock domains (Section IV-B). After that, we will summarize the overall LR framework (Section IV-C). The slew constraints will be later discussed in the context of the proposed graph model (Section V-E). A. Basic LR Formulation Let us formulate the objective function as a tradeoff between power and TNS as follows: α power + max(0, apo − rpo ) (1) po

where apo and rpo are the arrival and required times of transition po, and po ranges over all transitions (up and down) of the timing endpoints (primary outputs or sequential inputs) in the design. Next, let us define the dummy variable m ˆ po = max(0, apo − rpo ), and introduce the following timing constraints: 0≤m ˆ po apo − rpo = mpo ≤ m ˆ po au + du→v ≤ av

(2)

where du→v is the delay of the timing arc from u to v. Putting the constraints into a form conducive to relaxation into the objective function and introducing an LM for each constraint, we add these constraints times their respective multipliers to the original objective function, obtaining the LRS as follows: α power + po m ˆ po + po μpo (−m ˆ po ) + po μpo (apo − rpo − m ˆ po ) (3) + u→v μu→v (au + du→v − av ). Note that (3) contains the terms associated with margins and arrival times. These quantities are expensive to compute accurately during sizing, and we want to rely on the signoff timer to compute these quantities. Based on Kuhn–Tucker conditions, it is derived in [4] that certain flow conditions must be maintained among the introduced multipliers. Here, e.g., the terms including m ˆ po cancel out if μpo + μpo = 1 as follows: α power +

μpo (apo − rpo ) +

po

μu→v (au + du→v − av ). (4)

u→v

Here, each timing arc is associated with an LM μu→v . Also, based on the derivations in [4], the sum of multipliers on incoming arcs of a node must be equal to the sum of multipliers on its outgoing arcs. If the multipliers are treated like a flow, this conclusion is equivalent to the flow conservation in a network flow model. These constraints can be summarized as follows: ∀u :

μw→u =

w→u

∀pi : ∀po :

u→po

μu→v

u→v

μpi =

μpi→v

pi→v

μu→po = μpo .

(5)

1562

Fig. 2.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 10, OCTOBER 2012

paths can have multicycle timing overrides to allow longer timing paths. In this section, we explain how to capture these constraints using a signoff timing engine as a black box. For simplicity of presentation, we will use the term clock domain to represent any unique pair of clock events used in a timing constraint. In the presence of multiple clock domains, we can rewrite the objective function in (1) as

High-level algorithm to optimize cell sizes and cell types.

m ˆ po

Here, the first flow constraint states that for any node u, the sum of multiplier values of the incoming arcs must be equal to the sum of multiplier values of the outgoing arcs. The second and third flow constraints above are special cases for primary inputs and primary outputs. For these cases, one can consider a single incoming μpi and outgoing μpo for each primary input pi and primary output po, respectively. Substituting these equations into (4), we obtain the following functional form as our LRS formula: α power +

μu→v du→v +

μpo (−rpo ) +

po

u→v

α power +

μpo mpo +

po

T

T max 0, max apo − rpo T

po

0≤m ˆ po ∀T :

T − rpo = mTpo auT + du→v

T apo

∀T :

pi

μu→v (mu→v − mv ).

(7)

Now, we can base the multiplier update on two types of margin values: the margin value across a timing arc, mu→v , and the margin value at a node, mv . These values are readily calculated by most static timing engines and can be used directly in our LR framework. In other words, we can use the margin values computed by a black box timing engine to update the LMs. This allows decoupling timing analysis from the optimization engine, while the optimization is still being guided by accurate timing results. B. Multiple Clock Domains In industrial high-performance designs, it is common to find timing constraints specified for different paths with respect to different clock domains or different phases of the same clock domain. These constraints become more complex especially for duty cycles different than 50%. Furthermore, certain

≤m ˆ po ≤ avT .

(9)

Similar to the formulation in Section IV-A, we can introduce an LM for each constraint, and add these constraints times the respective multipliers to the objective function. The detailed derivations are not provided due to page limitations. However, it is possible to show that the variable part of the LRS objective function becomes α power + μu→v du→v . (10) u→v

Observe that the LRS optimization problem for multiple clock domains is now equivalent to the problem for single clock domain. Similarly, the objective function for the dual problem can be derived as α power +

μpo mpo +

po

u→v

(8)

T T and rpo are the arrival and required times of primary where apo output po for clock domain T . Note that the expression T T maxT (apo − rpo ) is the worst negative slack of po across all clock domains T , which is typically what is reported by industrial timing engines. The timing constraints of (2) can be rewritten as

μpi api . (6)

The only variables in this LRS formula are the cell power and the timing arc delay through each cell; the quantities rpo and api are fixed. This equation can be optimized (Sections V and VI) using the data from the cell library only, without having to compute the arrival times and slacks explicitly. It is known that for any fixed set of LMs, the optimal result to the LRS problem will be no greater than the optimal result to the original problem [4]. So, the Lagrangian dual problem is to maximize the minimum value obtained for the LRS problem by updating LMs accordingly. If arrival times were available, then (4) would be the appropriate form for updating the multipliers. However, in the presence of complex timing constraints, we want to use the slack values computed by the signoff timer directly. Using the definitions mv = av −rv and mu→v = au + du→v − rv , we can rewrite (4) as

α power +

μu→v (mu→v − mv )

(11)

u→v

where mu→v = maxT (mTu→v ) and mv = maxT (mTv ). Note that the Lagrangian dual problem for multiple clock domains is also similar to the single clock domain case. The only difference is that the objective function is based on the worst margin values across all clock domains. C. LR Framework Summary A typical LR framework involves iteratively solving two problems. In every iteration: 1) the LRS is minimized for fixed LMs, and 2) the LMs are updated to maximize the Lagrangian dual formulation. Our LR-based framework is summarized in Fig. 2, where the first step is to initialize the LMs. Our methods for LM initialization and updates are derived from [4]. In the very beginning, the LM values at the timing endpoints can be set to an arbitrary value between 0 and 1.0. In our implementation, we chose a value in the middle (e.g., 0.5) to start optimization. Although it is possible to explore more sophisticated methods

OZDAL et al.: ALGORITHMS FOR GATE SIZING AND DEVICE PARAMETER SELECTION

Fig. 3. Example illustrating a typical step-size function for multiplier updates.

for LM initialization, we have used a simple initialization scheme without fine tuning. Note that the flow conservation constraints between different LMs, as stated in (5), must be maintained while setting the LMs. After setting the LM values at the timing endpoints, we can compute the LMs at the internal nodes in a way that all flow conservation constraints are satisfied. Specifically, starting from the timing endpoints, LM values can be propagated backward such that the constraints in (5) are satisfied. When an LM value needs to be distributed to multiple incoming arcs, this distribution is done based on the slack values of the respective arcs. Since the LMs are initialized without considering the initial cell sizes, our framework cannot be used directly for the purpose of incremental optimization. However, it is well suited for the cases where substantial cell size changes are allowed. After initialization, we perform a fixed number of LR iterations, as shown in Fig. 2. In each iteration, we first optimize cell sizes and cell types for fixed LMs to minimize (10). Further details of our LRS models and algorithms will be given in Sections V and VI. After cell optimization, we use the signoff timing engine to compute the new slack values. Based on these slack values, we update the LMs to maximize (11) using the subgradient-based algorithm proposed in [4]. The basic idea of the subgradient-based multiplier update method (such as in [4]) is to choose a subgradient direction based on the timing violations at the respective nodes. Basically, the LMs corresponding to the nodes with negative slacks are increased and the ones with positive slacks are decreased. The rate of change in the LM values is determined by the step size tk in iteration k. It is well known that if k is updated t∞ to satisfy the conditions limk→∞ tk = 0 and k=1 tk = ∞, then convergence can be achieved for a convex and continuous problem [4]. However, since the cell types are discrete, and the timing models are not necessarily convex, then optimality is not theoretically guaranteed in our formulation. Specifically, we use a heuristic method in our implementation to update the step size tk , as shown in the example of Fig. 3. Intuitively, in the first several iterations, the step size is kept at the maximum value to allow large changes in the LM values. Gradually, the step size is decreased in the later iterations. In the last several iterations, the step value is kept at its minimum value so that the LMs converge to their final values. Observe that the purpose here is to finish the whole execution in a fixed number of iterations (30 in our experiments). Although convergence

1563

is not guaranteed theoretically, we have observed that this approach works well in practice. Note that determining the exact step-size update method is typically a practical problem that requires finding the right tradeoff between larger solution space exploration (e.g., the earlier iterations, where tk is large) and convergence (e.g., the later iterations, where tk is small), and the total number of iterations. In practice, for a particular design project, one can experiment with a small sample design to determine the right parameters to update the step sizes. Then, the same update mechanism can be utilized for all other designs. In our experiments (Table I), we have used the same update schedule shown in Fig. 3 with the same tmax value for all blocks. We have not performed fine-tuning per benchmark. In our overall LR framework shown in Fig. 2, we have observed that the main runtime bottleneck is the timing update step because of the complex timing constraints and parasitics models (see Section VII for runtime statistics). In practice, we have observed empirically that ∼30 iterations is typically sufficient to obtain good results without significant runtime overheads. V. Graph Model In this section, we propose a graph model that captures the LRS formulation in (6). Here, our objective is to represent the delay costs as accurately as possible using the delay functions in the cell library. In the remainder of this section, we will make use of the example shown in Fig. 4(a). In this example, we have three cells A, B, and C, where the input pins of B and C are both connected to the output pin of A. Let a1 and a2 denote the timing arcs through cell A. Similarly, let b and c denote the single timing arcs through cells B and C. Finally, let Ra,b and Ra,c denote the interconnect resistances of the net from A to B and from A to C, respectively. For ease of presentation, we assume that the routes from A to B and from A to C do not have common segments. However, it is straightforward to extend our models to handle that case. In our graph model, we define a subnode for each possible j cell type of each cell. Let si denote the jth cell type corresponding to cell i. In Fig. 4(b), we have subnodes {sa1 , sa2 }, {sb1 , sb2 }, and {sc1 , sc2 } corresponding to cells A, B, and C, respectively. To keep the example small, we assume that there are only two cell types available for each cell, although in reality, one can expect many more available cell types, corresponding to different sizes and technology parameters. Once the graph is defined this way, the problem is how to define the node and edge weights such that the original LRS objective function is captured accurately. One difficulty here is how to define edge costs of multi-fanout nets independent of each other. In the example of Fig. 4, the delay of timing arcs a1 and a2 depend on both types of cells B and C, but we want to model the edge costs independent of each other. Furthermore, the delays through timing arcs b and c depend on the slew at the inputs of B and C, which in turn depend on both the type of A and its output load. Ideally, when exactly one subnode is chosen for each cell, then the sum of the selected subnode weights and the weight

1564

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 10, OCTOBER 2012

Fig. 4. (a) Sample cell-type selection problem with cells A, B, and C. (b) Corresponding graph model, where a subnode exists for each available cell type.

of the edges between them should directly correspond to our LRS cost metric defined in Section IV. For example, in Fig. 4, if cell types sa1 , sb2 , and sc1 are chosen, then the LRS cost should be equal to the total weights of the highlighted subnodes and edges in Fig. 4(b). In the remainder of this section, we discuss how to model the subnode and edge weights in our graph model. For simplicity of presentation, we will ignore interconnect delays and slew dependency in the next two sections. Then, in Sections V-C, V-D, and V-F, we describe how to incorporate interconnect delays, slew effects, and dynamic power into our cost metrics. A. Simple Delay Cost Model We first show a rather straightforward way to model edge weights based on absolute delay costs in this section. Then, we discuss the potential problems with this approach, and will propose a better model in the next section to avoid these issues. The delay cost of an edge (e.g., the edge from sa1 to sb1 in Fig. 4) can be modeled independent of the other edges (e.g., the edge from sa1 to sc1 ) if delay is assumed to be linear with respect to the output load, i.e., if delay(capB + capC ) = delay(capB )+delay(capC ). With this assumption, we can model our graph as follows. j j 1) The weight of subnode si is set as weight(si ) = j α.power(si ), where α is the constant in the original LRS j formulation and power(si ) is the leakage power of the jth type of cell i. j n 2) The weight of the edge from si to sm is set as j n n weight(si → sm ) = k μk .delayk (cap(sm )), where μk and delayk are the LM and the delay of the kth timing arc of cell i, respectively. Here, delay is represented as n a function of cap(sm ), which is the input capacitance of the nth type of cell m. The summation is over all timing arcs through cell i because the delay through each arc n depends on the input cap of sm . Intuitively, if delay is a linear function of output load, it means that it can be separated and represented as the sum of edge weights connected to the receiver cells. The following proposition formally states this concept. Proposition 1: Under the assumption that delay is a linear function of load, and ignoring interconnect delays and slew

Fig. 5. Sample optimization problem with edge weights as marked. The solution of a commonly used heuristic is highlighted, where the solution cost is equal to the total weights of the edges highlighted.

effects, the cost metric above models the LRS cost exactly. In other words, the LRS cost in Section IV will be minimized if we choose exactly one subnode for each cell such that the total cost of all selected subnodes and edges between them is minimized. Proof: The variable portion ofthe LRS cost j from (6) can i (α.power(si ) + be nrewritten as k (μk .delayk ( m (cap(sm ))))), where i is over all cells, k is over all timing arcs through cell i, and m is over all cells connected to the output of cell i. Assume cell type j n si is selected for cell i and the type sm is selected for each cell m connected to the output of i. By definition, j j ) and the sum of edge costs subnode cost of si is α.power(s i j n connected to si will be m k μk .delayk (cap(sm )). Due to this expression can be rewritten as linearity assumption, n μ .delay ( cap(s k m )). Hence, the lemma follows. k k m There are two problems with this type of formulation. First, the linear delay model is inaccurate for modern cell libraries, which define delays in lookup tables. The second problem is with the difficulty in the optimization algorithms. Although the algorithms will be described in detail in Section VI, we will briefly mention the difficulties here. If the proposed graph was a tree structure (e.g., no nets have more than one fanout and there are no reconvergent paths), then a DP formulation would solve the LRS optimization problem optimally. However, as also mentioned in [16], reconvergent paths in DAGs make optimization more difficult, leading to heuristic algorithms that do not guarantee optimality. We claim that a graph model that uses absolute delay values as its edge costs make DAG optimization even more difficult and the heuristics more prone to suboptimalities. As an example, consider the DAG optimization problem in Fig. 5, where there are four cells A, B, C (with two subnodes each), and D (with a single subnode). The edge costs are as shown in the figure. For simplicity of the example, let us assume that the subnode costs are zero. The optimization objective is to choose exactly one subnode for each cell such that the sum of edge costs between the selected subnodes is minimum. This figure also shows how a DPbased heuristic would operate on this problem. A typical heuristic would process the nodes in topological order and

OZDAL et al.: ALGORITHMS FOR GATE SIZING AND DEVICE PARAMETER SELECTION

compute the minimum accumulated cost at each subnode. For example, for the first subnode of cell B, the accumulated cost is the minimum of 1 + 8 or 10 + 1, depending on the previous subnode selected. Note that there are edges from cells B and C converging at the input of D. This can happen if output pins of B and C are connected to different input pins of D. In this case, the minimum accumulated cost of D is chosen to be the sum of the minimum costs at its input pins, i.e., min(9 + 2, 11 + 1) + min(9 + 2, 11 + 1) = 22. Backtracking from D leads to the solution highlighted in Fig. 5. The heuristics proposed in [16] focus on the “historical inconsistency” problems due to reconvergent paths. For example, in Fig. 5, backtracking from D could end up choosing different subnodes of cell A. The algorithms in [16] try to avoid this problem through some heuristics. However, in Fig. 5, the solution obtained is consistent, i.e., the paths from D through B and through C both end at the same subnode of A. Despite this, the solution obtained is suboptimal. The total weight of the subgraph highlighted in Fig. 5 is 1 + 8 + 8 + 2 + 2 = 21. However, if the second subnode of A was chosen instead of the first, then the total weight would be 10 + 1 + 1 + 2 + 2 = 16. This example shows that we can obtain suboptimal results using existing heuristics even when the solution does not have the “historical inconsistency” problem. We will show in the next section that we can make DAG optimization (with reconvergent paths) easier by using a different graph model. B. Delta Delay-Based Modeling In this section, we make use of the iterative nature of LR optimization. Especially, in the later iterations of LR, we can expect the cell sizes to start to converge and the size changes in every iteration to be small with respect to the previous iteration. Also, the accuracy of the delay model becomes more important as the iterations start to converge. Based on this, we define our graph model using the sizes from the previous iteration as reference and modeling the delta delay values in the current iteration. This approach both improves the accuracy of the delay model for multi-fanout nets and makes DAG optimization easier. Let us consider a multi-fanout cell i with a particular type j si , and let us define the reference delay as the delay value when all cells at the output of i have their reference cell types (i.e., same types as they had in the previous iteration). We can compute the reference delay through timing arc k of this cell j using the delay function defined in the lookup table of si as follows: ⎛ ⎞ j delay− ref k (si ) = delayk ⎝ cap(stref )⎠ . (12) t∈fanout(i)

Here, stref is the reference size of cell t at the fanout of cell i. As before, cap(stref ) refers to the input capacitance of stref connected to the output of cell i. Now, assume that the cell type of one of the fanout cells (e.g., cell m) is changed, leading to a change in its input n n n ref capacitance by cap(sm ), i.e., cap(sm ) = cap(sm ) − cap(sm ), n where sm is the new cell type of m. We can use first-order

1565

Fig. 6. Graph model based on delta delay model. The reference cell types (from the previous iteration) are shown as shaded circles. The result of a DP-based heuristic is highlighted.

approximation to compute the new delay value through timing arc k of driver cell i as ∂Tk j j . (13) delay(si ) = delay− ref k (si ) + cap(m) · ∂cap ref ∂Tk |ref is the derivative of delay through arc k with Here, ∂cap respect to output load and is computed at the reference load. In other words, we linearize the delay function at around point (ref − cap, ref − delay). Since the cell delay models typically use linear interpolation for delay between lookup table entries, we expect the accuracy of this model to be reasonable if cap(m) is not too large. As mentioned above, as the LR iterations converge, we expect the accuracy of this model to improve more and more. Based on this delay model, we can redefine the costs in our graph model as follows. j 1) The weight of a subnode si is computed as j

j

weight(si ) = power(si )+

j

μk delay− ref k (si ). (14) j

k∈arcs(si ) j

n 2) The weight of an edge from si to sm is computed as ∂Tk j n n weight(si → sm ) = cap(sm ) · μk . ∂cap ref j k∈arcs(si )

(15) Compared to the cost model of Section V-A, the main difference here is that we do not assume delay is linear in capacitance, but we do linearization around the point (cap− ref, delay− ref), which is computed based on the delay tables of the cell from the library. Also, we include the reference delay cost in the subnode cost and let the edge costs reflect delay differences only. This model also simplifies the DAG optimization, as described in the following example. Let us consider the example of Fig. 5 again and assume that the reference sizes of cells B and C are as marked in Fig. 6. The reference size of A is irrelevant for this example. Using (14) and (15), we obtain the values shown in Fig. 6. For example, the cost of the first subnode of A is set to 18, which is the sum of old edge weights (in Fig. 5) from this subnode to the reference subnodes of B and C (i.e., 10 and

1566

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 10, OCTOBER 2012

8). Similarly, the edge weights from the second subnode of A to the reference sizes of B and C were 3 and 1 (in Fig. 5); hence, the subnode cost is 4. By definition, the edge weights to the reference subnodes are always zero because the subnode weights are defined with respect to these reference cell types. The other edge costs are defined based on the differences in their cost values with respect to the reference costs. For example, in Fig. 5, the edge cost from the first subnode of A to the first subnode of B was 8, and to the second subnode of B was 10, which is the reference cost now. So, the corresponding edge costs in Fig. 6 are −2 and 0, respectively, both with respect to the reference cost 10. If the same DP-based heuristic (as in Section V-A) is used for this model, the minimum accumulated cost found at node D becomes 30, and backtracking leads to the selection of the highlighted edges in Fig. 6. If we compare the highlighted subgraphs of Figs. 5 and 6, we can see that the total cost is less in the latter figure. Intuitively, by collecting the reference costs of all edges originating from a subnode in the actual subnode cost, we can capture the real cost of the subnode in a better way in the DAG optimization. For example, in Fig. 5, the first subnode of A has high fanout cost but small fanin cost, and the second subnode has the opposite. If a typical DP-based heuristic is used, the computed accumulated cost at node D double-counts the fanin cost of A, but does not double-count its fanout costs. Since the first subnode of A has lower fanin cost, it is chosen over the second subnode in Fig. 5, leading to a suboptimal solution. However, in Fig. 6, the reference edge costs are collected in the subnode costs and the subnode costs reflect the actual costs more accurately. For example, one can see that the first subnode of A is more expensive overall, because most of its fanout costs are captured in its subnode cost. The edge costs are smaller and they only reflect the delta delays with respect to the reference size. This makes optimization easier in general. This is also true for the critical tree extraction-based algorithm we propose in Section VI. Furthermore, this model allows us to capture the delay values more accurately from the delay lookup tables for the reference load. The only potential inaccuracies are in the delta delay values, which are expected to be small as the LR iterations start to converge. C. Modeling Interconnect Costs In the previous two sections, we have ignored the interconnect effects for simplicity of presentation. In this section, we show how to model interconnect effects in our graph model. Basically, two changes are needed to the model described in Section V-B. First, the reference delay cost through timing j arc k of cell type si needs to be computed considering the interconnect capacitance at its output. For this, we can rewrite (12) as j

delay− ref k (si ) = ⎛ delayk ⎝interconnect− cap +

⎞ cap(stref )⎠ .

t∈fanout(i)

(16)

Again, delayk () is the delay function defined in the library for timing arc k. We simply add the effective capacitance of j interconnect at the output pin of si . In our current implementation, we use a lumped capacitance model, ignoring effects such as resistive shielding. As mentioned before, we rely on the actual timing engine to perform accurate slack computations, and those slack values guide our LR optimization framework. However, by using a simplified lumped capacitance model here, we take into account how the delay of the driver cell is impacted due to interconnect. Our second change in the model is to consider how input capacitances of different cell types affect the interconnect delays. We can capture this effect by adding the interconnect delay costs to the subnode cost of the corresponding cell type. j Let us define the interconnect cost corresponding to cell si as follows: j

interconnect− cost(si ) =

μp · Rp · cap(p). (17) j

p∈input− pins(si ) j

Here, μp is the LM at the input pin p of si , Rp is the effective resistance of the interconnect connected to pin p, and cap(p) is the input pin capacitance. Obviously, if we choose a cell type with a larger input capacitance, the cost of the preceding interconnect will increase. Based on this, we can j j add interconnect− cost(si ) to the subnode cost of si . D. Modeling Slew Dependencies It is important to model slew dependencies, because the cell delays depend not only on cell types but also the slew rates at their input pins. As mentioned before, in a typical library, the delay and output slew tables are defined as a function of output loads and input slew values. In the previous sections, we have assumed cell delays as functions of output loads only. In this section, we show how to incorporate slew dependencies as well. If the slew value at each input pin was given, we could simply use these values during cell delay computations. However, the input slew rates depend on the cell types of the preceding cells. In the example of Fig. 4, the slew rate at the input pin of cell B depends on all the cell types before B. So, the delay through arc b, and hence, the subnode and edge costs of B all depend on the predecessor cells. This necessitates the traversal of our graph in forward topological order. Starting from the primary inputs, we can compute the slew rate at the output of a subnode using the slew tables from the library. Then, we can propagate these slew values to successor nodes. In other words, as long as we process the nodes in topological order, we can compute the input slews at each subnode and can compute the subnode and edge costs defined before using the library delay tables. Note that this means our graph model is not static, because the edge weights depend on the paths preceding them. Other than this, we also need to take into account the slew effects at multi-fanout nets. Consider the example of Fig. 4 again. If we upsize cell C, then the load at the output of A increases, and the slew rate at the output of A and at the inputs of B and C all get worse. This increases the delay of not only

OZDAL et al.: ALGORITHMS FOR GATE SIZING AND DEVICE PARAMETER SELECTION

arc c (through cell C) but also arc b (through cell B). Although our dynamic graph model described above captures most of the impact on arc c, the impact on arc b is not captured. The reason is that cell C is not before B in topological order, and it can be processed before or after cell B. So, we need to capture this effect for multi-fanout nets explicitly. We have observed in practice that the change in slew rates have the highest impact on the first-level cells only, and significantly less impact on the second or third-level successors. For the example of Fig. 4, the slew change at the input pin of B (due to load change of C) will affect the delay through cell B and the output slew of B. This change in the output slew of B can also affect other cells driven by B, but not as much as the arc through B. So, for simplicity, we model the slew impact on the first-level cells only. In other words, we model the impact of size change of C on the delay through cell B, but not the successors of B. Note that the purpose here is to capture the relative impact of receiver size changes on the cost function of a multi-fanout net. For the actual slew computations, we still propagate the slew values through the whole timing path as described above. For simplicity of presentation, we will derive our formulas using the example in Fig. 4 and then generalize them at the end of this section. Let us now rewrite the delay through timing j arc b for cell type sb as j delayb (sb )

= delayb (input− slew(b), output− cap(b)).

(18)

First-order approximation around reference delay (i.e., when the size of C is the same as in the previous iteration) leads to ∂Tb j j delayb (sb ) = delay− ref(sb ) + input− slewb · . (19) ∂slw ref

∂Tb | ∂slw ref

denotes the delay sensitivity of timing arc b Here, with respect to input slew at around the reference input slew and output load of arc b, which can be obtained directly from the cell delay table. This sensitivity value is multiplied by input− slewb , which is the input slew change due to size change of cell C. We can approximate this as ∂slwa1 ∂slwa2 input− slewb = capC · max . , ∂cap ref ∂cap ref (20) Here, capC is the change in the input capacitance of cell C a1 a2 | and ∂slw with respect to the reference type of C; ∂slw | ∂cap ref ∂cap ref are the output slew sensitivity values of arcs a1 and a2 with respect to output load. These sensitivity values are computed at around the reference output loads based on the sizes from the previous iteration. Here, we take the maximum slew sensitivity to consider the worst degradation in the slew rate at the output of A. Substituting (20) into (19), we can obtain j

j

delayb (sb ) = delay− ref(sb )+ ∂Tb ∂slwa1 ∂slwa2 · capC · max , . ∂cap ref ∂cap ref ∂slw ref (21) Intuitively, the input capacitance change of C affects the delay through timing arc b by the delta term in (21), so this

1567

delta delay value needs to be accounted for during selection of the cell type of C. Note that this term depends on the input capacitance of cell C and the output slew sensitivity of cell A to output load. Hence, it depends on both sizes of A and C. So, we add the following cost term to each edge weight from saj to scn : slew− impact(saj → scn ) = ∂Tb ∂slwa1 ∂slwa2 n μb · cap(sc ) · max · , . ∂cap ref ∂cap ref ∂slw ref (22) We can generalize this formula for any edge from subnode j n as follows: si to sm j

n )= slew− impact(si → sm ∂slwk ∂Tt n μt · cap(sm ) · max · . j ∂cap ∂slw k∈arcs(si ) t∈fanout(i)\m

(23) This way, we can capture the slew impact of cell m on the other cells (if any) connected to the output of cell i. E. Capturing Slew Constraints Slew constraints are typically enforced as maximum transition time constraints at internal receiver pins and primary outputs. Let ttmax denote the max transition time allowed for the design, and let ttv denote the transition time at node v. To capture the slew constraints, we can add the following term to our original objective function given in (1): max(0, ttv − ttmax ) (24) β· v

where β is a scaling factor determined empirically based on the importance of slew constraints. As mentioned before, we process cells in topological order from primary inputs to outputs and compute the slew values at each subnode. Hence, we can add the following term to our edge cost weight(u → v): slew− cost(u → v) = β.max(0, ttv − ttmax ).

(25)

Note that since slew values are available at each subnode during our traversal as mentioned before, it is possible to handle this constraint directly in our optimization by modeling it explicitly as part of the edge cost. F. Optimizing Dynamic Power In the previous sections, we have assumed that a single leakage power value exists for each discrete cell choice. However, it is straightforward to extend our formulations to capture the dynamic power optimization objective as well. In a typical standard cell library, the dynamic (switching + short circuit) power consumption of a cell due to the transition of a timing arc is precharacterized as a function of input slew rate and output load. If the estimated activity factor fk for each timing arc k is available, then the dynamic power function of k can be computed by multiplying fk with the precharacterized power values in the library.

1568

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 10, OCTOBER 2012

Fig. 7. Sample graph model for LRS optimization. The noncritical cells with predetermined types are shown with shaded circles. The two trees extracted are shown with dashed boundaries.

Fig. 8.

Algorithm to extract the critical trees in graph G.

In the previous section, we have modeled the LRS optimization as a graph G, taking into account cell delay models, interconnect costs, and slew effects. In this section, we describe our algorithms to solve the following problem; choose a subnode for each cell node in G such that the total cost of the selected subnodes and edges between them is minimized. For this purpose, we use a DP-based heuristic algorithm. As mentioned before, G can have reconvergent paths. Applying DP directly on the whole G would require keeping track of solution histories every time we have multi-fanout nets, and that may lead to exponential number of partial solution states. For this reason, we use a heuristic approach where we first extract the critical trees from G (Section VI-A) and optimize each tree independently (Section VI-B) using DP. Note that this approach is different than [16], where heuristics are used to expand beyond multifanout cells in DP. Such an approach has the disadvantage that the delay values computed can be inaccurate even in the late iterations of LR, because the sizes estimated during DP expansion can be incorrect due to double counting of costs after multi-fanout expansion. In contrast, our method is expected to get more and more accurate as the LR iterations converge, because our estimation error at multi-fanout nets is bounded by the change in the current iteration only. Note that in the earlier iterations of LR, our optimization is relatively coarse grain because of the larger step sizes (see Section IV-C). As the iterations converge in the later iterations, the changes become finer grain and accuracy in optimization becomes more important.

mization problem using DP directly. For a cell c with multiple fanouts, we have a choice between: 1) making c the root of the current subtree (e.g., cell H of Fig. 7), and 2) predetermining the cell types of all receivers of c except one (e.g., cell G of Fig. 7). We will show how to make this choice based on the relative criticalities of the receivers. As mentioned before, as the LR iterations start to converge, the cell sizes do not change dramatically. So, we can predetermine the type of a noncritical cell based on the sizes of its neighbors in the previous iteration. We expect this approximation to be good enough for noncritical cells. In Fig. 7, if the size of cell G is predetermined before DP optimization, then the driver cell C can assume G has fixed input capacitance during DP optimization. Note that the larger the extracted trees, the more accurate our LRS optimization will be. At the boundaries between different trees, we will rely on approximations for the input capacitances of the receivers. In the example of Fig. 7, the output load of H will be computed using the reference sizes of K and L from the previous iteration. The LR framework is supposed to compensate for such inaccuracies by choosing the multipliers accordingly [4]. To solve this problem, we first define a static criticality metric for each cell based on the sum of the LMs of its timing arcs. It is known that the LMs are expected to be large for critical paths. Then, we process each multi-fanout net and compare the criticality weights of the receivers based on a parameter we define as relative criticality threshold β, which is determined empirically. If the criticality weight of one receiver is at least β times larger than the other receivers of the same net, then we fix all the receivers of this net except the most critical one. This way, the tree can continue to be expanded through the most critical receiver. In our implementation, we have chosen a single β value empirically, and used it for all cells and for all iterations without fine tuning. It is also possible to change the β values dynamically in different iterations; however, we have not explored such dynamic selection schemes. Fig. 8 summarizes our critical tree extraction algorithm. Fig. 7 shows an example where two subtrees {A, B, C, D, E, F, H} and {I, J, K, L, M, N, Q} are extracted after three noncritical cells {G}, {P}, and {R} were predetermined.

A. Critical Tree Extraction

B. DP-Based Optimization

For a forward traversal of G, if all nets had single fanout, then G would be a tree, and we could solve the LRS opti-

We process the trees extracted in Section VI-A in topological order and use DP-based algorithms to choose the best

Similar to the timing arc delay, dynamic power is a function of input slew rate and output load. Hence, we can model the dynamic power costs similar to the way the delay costs are modeled in the previous sections. Specifically, we can replace “delay” in (13) and (21) with “power” and add the respective terms to the subnode and edge costs exactly the same way as in Sections V-B and V-D. Although it is straightforward to optimize dynamic power in our formulations, the switching activity factors are not available for the industrial blocks in our experiments (Section VII). Hence, we will use leakage power as the power metric in our experiments. VI. Algorithms

OZDAL et al.: ALGORITHMS FOR GATE SIZING AND DEVICE PARAMETER SELECTION

1569

Fig. 10. Experimental results on block ind-A, comparing two different timing engines guiding LR optimization. Fig. 9.

Sample extracted tree for DP-based optimization.

subnode for each node such that the sum of edge and subnode costs is minimized. Fig. 9 shows a sample DP problem, where the subnodes corresponding to different cell types and the edges between them are not shown for clarity. For the neighboring cells outside the tree, (the dashed circles in Fig. 9), we consider a single cell type only. For the receiver cells connected to the root node (e.g., cells connected to H in Fig. 9), the reference sizes are considered from the previous iteration. For all others, the cell types have already been determined in this iteration, so accurate computations are possible. For example, the drivers of A, B, C, E, and G in Fig. 9 are supposed to be processed before this tree, because they are topologically before the nodes in this tree. The receiver of cell D is supposed to be predetermined in this iteration because its criticality is below the relative criticality threshold β. The total cost to be minimized is the cost of all solid lines and solid circles shown in Fig. 9. In our DP algorithm, we keep track of the accumulated cost up to each subnode while processing nodes in topological order. Once we reach the root node, we choose the best cell type and backtrack from there. Our overall LRS optimization algorithm has linear time and space complexity with respect to the input netlist size, assuming that the number of alternative sizes for each cell in the library and the maximum net fanout are both O(1). From Fig. 8, it is obvious that our critical tree extraction algorithm has linear complexity. Similarly, our DP algorithm has linear complexity, because the number of partial solution states for each cell is equal to the number of alternative cell types in the library, which is constant.

VII. Experimental Results In our experiments, we first show the importance of using an accurate timing engine rather than approximate timing models, as is commonly used in existing works. For this, we make use of two different timing engines: 1) the signoff timing engine, and 2) an in-house static timing analyzer. Note that the inhouse implementation is significantly more sophisticated than the simple timing models used by many of the academic gatesizing works. It uses the cell delay and slew tables from the cell library; it can handle transparent latches, multiple clock domains, and multiple clock events per domain. However, it cannot handle false paths and multicycle timing overrides, and its interconnect model is based on Elmore. In the first set of our experiments, we use a small microprocessor block ind-A (27K cells) with high-performance

Fig. 11. Convergence plots for an execution of our algorithm on ind-A. (a) Objective function αPower + TNS versus iteration index. (b) Power versus TNS values for different iterations, where the points corresponding to the first and last iterations are labeled. All values in the chart are normalized.

constraints, and we use one of the two timing engines described above to guide our LR-based optimization. After the LR iterations end, we pick the iteration that has the minimum value for our objective metric. Then, we use the signoff timer to evaluate the final results. Fig. 10 shows the results of our experiments where we vary the α value in our objective metric (1) to obtain results with different power performance tradeoffs. Here, leakage power is plotted against TNS. Note that all values in this figure are normalized with respect to the power and delay of the smallest inverter size in the library. As can be seen in Fig. 10, using the signoff timer during optimization leads to significantly better results, because the paths with timing violations are targeted exactly during LR iterations. If the signoff timer is not used, the timing violations are significantly larger for the same power levels. Also, reducing timing violations by enforcing stricter timing constraints (because timing constraints cannot be satisfied accurately) would require significantly more extra power. Note that this is the basic problem with optimization algorithms that rely on simplified timing models without feedback from an accurate timing engine. Fig. 11 illustrates the convergence characteristics of an execution of our LR framework. In Fig. 11(a), the objective function is plotted for each iteration. It can be seen that the results improve monotonically, especially in the earlier iterations, and they converge to a final value in the later iterations. In Fig. 11(b), the total leakage power is plotted against the TNS for each iteration. Observe that the improvements are much larger in the earlier iterations and the rate of change reduces in the later iterations. In the second set of experiments, we compare: 1) the proposed algorithm in this paper, and 2) LR-based sizing without using our graph model on two real microprocessor blocks indA (27K cells) and ind-B (142K cells) with high-performance

1570

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 10, OCTOBER 2012

Fig. 12. Experimental results comparing our algorithm versus processing each cell one by one in LR for 2 industrial designs in (a) and (b). TABLE I

is spent by the signoff timer during LR iterations. Our LRS optimization takes about 15% of the runtime on average. Note that the runtimes in Table I do not only depend on the number of sizeable cells. For example, the runtime of ind-C is larger than that of ind-F despite the smaller number of sizeable cells in ind-C. The reason here is mostly due to the runtime differences in the signoff timer executions, which may depend on other design metrics such as the complexity of interconnect parasitics, the number of clock domains, the number of false paths, and so on. The peak memory usage for the largest block ind-F for our optimization engine was 3.5 GB, excluding the memory of the black box timing engine.

Experimental Results on Real Microprocessor Blocks

VIII. Conclusion Industrial Flow Our Algorithm Design #sizeable Power TNS WNS Power TNS WNS CPU Power Cells (Norm) (Norm) (Norm) (Norm) (Norm) (Norm) (h) Impr ind-C 81K 391K 5.8 0.8 294K 0.0 0.0 2.8 25% ind-D 60K 271K 4.2 1.7 194K 0.8 0.8 1.6 28% ind-E 68K 209K 85.8 11.7 191K 27.5 11.7 1.1 9% ind-F 111K 420K 96.7 5.0 401K 30.8 5.0 1.2 5% ind-G 361K 1961K 436.7 3.3 1214K 8.3 1.7 13.2 38% Average 650K 125.8 4.5 459K 13.5 3.8 4.0 29%

constraints. For (2), we still use the slack-based LR model we proposed in Section IV, and we use accurate delay and slew models from the library. However, the LRS optimization is done one cell at a time (similar to [4]) rather than our proposed DP-based algorithm. Fig. 12 shows the results of our experiments for different α values (similar to Fig. 10). As can be seen here, our models lead to better results especially for high-performance constraints (i.e., smaller α values prioritizing timing violation reductions). For the same power levels with minimum timing violations in these plots, our model has 26% and 32% less TNS for designs ind-A and ind-B, respectively. This shows the benefit of our graph model and DP algorithm that considers discrete sizes during optimization. In the third set of experiments, we evaluate the effectiveness of our model on a set of high-performance industrial microprocessor blocks, which have multiple clock domains and complex timing constraints. Here, we make comparisons between: 1) our proposed algorithm, and 2) a state-of-the-art industrial optimization flow. For (2), we obtained blocks from microprocessor designers, who have run the industrial flow on these blocks over a period of time. Hence, the runtimes of the industrial flow cannot be reported. Note that for all of the designs, the detailed parasitics information is available and utilized by the timing engine during optimization. Note that the timing results are computed based on the reference timer of the industrial flow. Our experiments are performed on Linux systems with dual 3 GHz quad-core Intel Xeon CPU and 32 GB memory. Our results are listed in Table I, where all values are normalized with respect to the smallest inverter power and delay values. Observe that our algorithm leads to 29% average reduction in leakage power (38% reduction in the best case), while reducing the TNS significantly. Our runtimes are reasonable for such large blocks, and about 85% of the runtime

We have proposed models and algorithms for the gate sizing and device parameter selection problem. We have focused on the problems encountered in high-performance industrial designs, which are typically ignored by the existing academic gate-sizing works in the literature. For this, we have proposed an LR framework to decouple timing analysis from optimization. This way, we could afford to use simpler timing models during optimization, while the accurate slack values computed by a black box timing engine are used to guide the optimization. For the LRS, we have proposed a graph model to capture the delay costs of discrete cells and a DP algorithm to minimize the LRS objective on the extracted critical trees. Our experiments show significant improvements with respect to a state-of-the-art industrial flow. Acknowledgment The authors would like to thank B. Tse from Intel Corporation, Santa Clara, CA, for helping them with the experimental study. References [1] C. S. Amin, C. Kashyap, N. Menezes, K. Killpack, and E. Chiprout, “A multi-port current source model for multiple-input switching effects in CMOS library cells,” in Proc. DAC, 2006, pp. 247–252. [2] M. R. C. M. Berkelaar and J. A. G. Jess, “Gate sizing in MOS digital circuits with linear programming,” in Proc. DATE, 1990, pp. 217–221. [3] P. K. Chan, “Algorithms for library-specific sizing of combinational logic,” in Proc. DAC, 1990, pp. 353–356. [4] C. P. Chen, C. C.-N. Chu, and D. F. Wong, “Fast and exact simultaneous gate and wire sizing by Lagrangian relaxation,” IEEE Trans. Comput.Aided Des., vol. 18, no. 7, pp. 1014–1025, Jul. 1999. [5] D. Chinnery and K. Keutzer, “Linear programming for sizing, vth and vdd assignment,” in Proc. ISLPED, 2005, pp. 149–154. [6] H. Chou, Y.-H. Wang, and C. C.-P. Chen, “Fast and effective gate sizing with multiple-Vt assignment using generalized Lagrangian relaxation,” in Proc. ASPDAC, 2005, pp. 381–386. [7] O. Coudert, “Gate sizing for constrained delay/power/area optimization,” IEEE Trans. VLSI Syst., vol. 5, no. 4, pp. 465–472, Dec. 1997. [8] S. Natarajan, M. Armstrong, M. Bost, R. Brain, M. Brazier, C.-H. Chang, V. Chikarmane, M. Childs, H. Deshpande, K. Dev, G. Ding, T. Ghani, O. Golonzka, W. Han, J. He, R. Heussner, R. James, I. Jin, C. Kenyon, S. Klopcic, S.-H. Lee, M. Liu, S. Lodha, B. McFadden, A. Murthy, L. Neiberg, J. Neirynck, P. Packan, S. Pae, C. Parker, C. Pelto, L. Pipes, J. Sebastian, J. Seiple, B. Sell, S. Sivakumar, B. Song, K. Tone, T. Troeger, C. Weber, M. Yang, A. Yeoh, and K. Zhang, “A 32 nm logic tech. featuring 2nd-generation high-k + metalgate transistors, enhanced channel strain and 0.171m2 SRAM cell size in a 291 MB array,” in Proc. IEDM, 2008, pp. 1–3.

OZDAL et al.: ALGORITHMS FOR GATE SIZING AND DEVICE PARAMETER SELECTION

[9] M. L. Fisher, “An applications oriented guide to Lagrangian relaxation,” Interfaces, vol. 15, no. 2, pp. 10–21, 1985. [10] S. Held, “Gate sizing for large cell-based designs,” in Proc. DATE, 2009, pp. 827–832. [11] S. Hu, M. Ketkar, and J. Hu, “Gate sizing for cell-library-based designs,” IEEE Trans. Comput.-Aided Des., vol. 28, no. 6, pp. 818–825, Jun. 2009. [12] Y.-L. Huang, J. Hu, and W. Shi, “Lagrangian relaxation for gate implementation selection,” in Proc. ISPD, 2011, pp. 167–174. [13] J. Lee and P. Gupta, “Incremental gate sizing for late process changes,” in Proc. ICCD, 2010, pp. 215–221. [14] W.-N. Li, “Strongly NP-hard discrete gate sizing problems,” in Proc. ICCD, 1993, pp. 468–471. [15] C. Ling, H. Zhou, and C. Chu, “A revisit to floorplan optimization by Lagrangian relaxation,” in Proc. ICCAD, Nov. 2006, pp. 164–171. [16] Y. Liu and J. Hu, “A new algorithm for simultaneous gate sizing and threshold voltage assignment,” in Proc. ISPD, 2009, pp. 27–34. [17] N. Menezes, R. Baldick, and L. T. Pileggi, “A sequential quadratic programming approach to concurrent gate and wire sizing,” IEEE Trans. Comput.-Aided Des., vol. 16, no. 8, pp. 867–881, Aug. 1997. [18] M. M. Ozdal, S. Burns, and J. Hu, “Gate sizing and device technology selection algorithms for high-performance industrial designs,” in Proc. ICCAD, Nov. 2011, pp. 724–731. [19] M. M. Ozdal and M. D. F. Wong, “A length-matching routing algorithm for high-performance printed circuit boards,” IEEE Trans. Comput.Aided Des. Integr. Circuits Syst., vol. 25, no. 12, pp. 2784–2794, Dec. 2006. [20] M. Rahman, H. Tennakoon, and C. Sechen, “Power reduction via near optimal library-based cell-size selection,” in Proc. DATE, 2011, pp. 1–4. [21] H. Ren and S. Dutt, “A network-flow based cell sizing algorithm,” in Proc. Int. Workshop Logic Syn., 2008, pp. 7–14. [22] S. Roy, W. Chen, C. C.-P. Chen, and Y. H. Hu, “Numerically convex forms and their application in gate sizing,” IEEE Trans. Comput.-Aided Des., vol. 26, no. 9, pp. 1637–1647, Sep. 2007. [23] S. S. Sapatnekar, V. B. Rao, P. M. Vaidya, and S.-M. Kang, “An exact solution to the transistor sizing problem for CMOS circuits using convex programming,” IEEE Trans. Comput.-Aided Des., vol. 12, no. 11, pp. 1621–1634, Nov. 1993. [24] A. Srivastava, D. Sylvester, and D. Blaauw, “Power minimization using simultaneous gate sizing, dual-Vdd and dual-Vth assignment,” in Proc. DAC, 2004, pp. 783–787. [25] H. Tennakoon and C. Sechen, “Gate sizing using lagrangian relaxation combined with a fast gradient-based pre-processing step,” in Proc. ICCAD, 2002, pp. 395–402. [26] J. Wang, D. Das, and H. Zhou, “Gate sizing by Lagrangian relaxation revisited,” IEEE Trans. Comput.-Aided Des., vol. 28, no. 7, pp. 1071– 1084, Jul. 2009. [27] L. Wei, K. Roy, and C.-K. Koh, “Power minimization by simultaneous dual-Vth assignment and gate sizing,” in Proc. CICC, 2000, pp. 413– 416.

Muhammet Mustafa Ozdal received the B.S. degree in electrical engineering in 1999 and the M.S. degree in computer engineering in 2001 from Bilkent University, Ankara, Turkey, and the Ph.D. degree in computer science from the University of Illinois at Urbana-Champaign, Urbana, in 2005. He is currently a Research Scientist with the Strategic CAD Labs, Intel Corporation, Hillsboro, OR. His current research interests include algorithms for physical design, power-performance optimization, and software-level power modeling and optimization. Dr. Ozdal was a recipient of the IEEE/ACM William J. McCalla ICCAD Best Paper Award in 2011. He was a Technical Program Committee Member of DAC, ICCAD, ISPD, ISLPED, DATE, and SLIP, the Technical Program Chair of SLIP 2012, and the Contest Chair of ISPD 2012.

1571

Steven Burns received the B.A. degree in mathematics from Pomona College, Claremont, CA, and the M.S. and Ph.D. degrees in computer science from the California Institute of Technology (Caltech), Pasadena, CA. He is currently a Senior Principal Engineer with the Strategic CAD Labs (SCL), Intel Corporation, Hillsboro, OR. He has been working with this group for 16 years. He currently leads a team of researchers in the Power, Performance and Technology Group, SCL. He was an Assistant Professor of computer science with the University of Washington, Seattle, WA. While at Caltech and the University of Washington, he was involved in the research of asynchronous circuit designs and computer-aided design (CAD) tools to enable their design. His current research interests include timing and race analysis for pulsed domino circuits, algorithms and methodology for sizing and power optimization of large synthesis and data-path blocks, transformation-based design environments, advanced synthesis algorithms and methods, physical synthesis of standard cells, and CAD for future process technologies.

Jiang Hu received the B.S. degree in optical engineering from Zhejiang University, Zhejiang, China, in 1990, the M.S. degree in physics in 1997, and the Ph.D. degree in electrical engineering from the University of Minnesota, Minneapolis, MN, in 2001. He was with IBM Microelectronics, Austin, TX, from January 2001 to June 2002. Currently, he is an Associate Professor with the Department of Electrical and Computer Engineering, Texas A&M University, College Station. His current research interests include computer-aided design for very large-scale integrated circuits and systems, especially on large-scale circuit optimization, clock network synthesis, robust designs, and on-chip communication. Dr. Hu was a recipient of the Best Paper Award at the ACM/IEEE Design Automation Conference in 2001, the IBM Invention Achievement Award in 2003, and the Best Paper Award at the IEEE/ACM International Conference on Computer-Aided Design in 2011. He was a Technical Program Committee Member of DAC, ICCAD, ISPD, ISQED, ICCD, DATE, and ISCAS, the Technical Program Chair of the ACM International Symposium on Physical Design 2011, and an Associate Editor of the IEEE Transactions on Computer-Aided Design and the ACM Transactions on Design Automation of Electronic Systems.