Transistor Sizing of Custom High-Performance Digital Circuits With Parametric Yield Considerations

Transistor Sizing of Custom High-Performance Digital Circuits With Parametric Yield Considerations Daniel K. Beece, Jinjun Xiong, Chandu Visweswariah,...
4 downloads 1 Views 204KB Size
Transistor Sizing of Custom High-Performance Digital Circuits With Parametric Yield Considerations Daniel K. Beece, Jinjun Xiong, Chandu Visweswariah, Vladimir Zolotov, IBM Thomas J. Watson Research Center Yorktown Heights, NY

Abstract— Transistor sizing is a classic Computer-Aided Design problem that has received much attention in the literature. Due to the increasing importance of process variations in deep sub-micron circuits, nominal circuit tuning is not sufficient, and the sizing problem warrants revisiting. This paper addresses the sizing problem statistically in which transistor sizes are automatically adjusted to maximize parametric yield at a given timing performance, or maximize performance at a required parametric yield. Specifically, we describe an implementation of a statistical tuner using interior point nonlinear optimization with an objective function that is directly dependent on statistical process variation. Our results show that for circuits which are sensitive to process variation, a statistically aware tuner can give more robust, higher yield solutions when compared to deterministic circuit tuning and is thus an attractive alternative to the Monte Carlo methods that are typically used to size devices in such circuits. To the best of our knowledge, this is the first publication of a working system to optimize device sizes in custom circuits using a process variation aware tuner.

I. BACKGROUND Transistor sizing to improve circuit performance is a well-studied problem. TILOS [1] was a well-known early attempt to solve the problem heuristically. By modeling delays as convex functions of transistor size, an exact solution was proposed in [2]. Unfortunately, custom-designed high-performance microprocessor circuits are timed with transistor-level timing [3] in which each pin-to-pin transition of each channel-connected component (CCC) is simulated in the time-domain. Thus it is not possible to either ensure convexity or create convex models for delay as a function of transistor sizes. Sophisticated slew (or rise/fall time) propagation and slew-dependence make convexity even harder to guarantee. Instead, EinsTuner [4], [5] uses non-convex delay models as dictated by time-domain simulation, and gradients obtained by time-domain adjoint analysis to obtain optimal transistor sizes. Arbitrary custom transistor-level circuitry is supported in EinsTuner by means of “state analysis” which automatically determines all pin-to-pin transitions of CCCs with complementary topologies. In addition, EinsTuner uses a novel optimization formulation inspired by [6]. Thus EinsTuner does transistor sizing for a very wide range of circuits by solving the possibly non-convex exact problem to a stationary point, in contrast to the previous approaches that solved a more approximate problem to a global optimum. In this work, we describe an extension to EinsTuner that uses statistical timing [7], [8] to model the behavior of a circuit in terms of probability distributions, thereby enabling the device sizing problem to depend upon the statistical correlations exhibited by process variability. Many high-performance dynamic circuits, such as those commonly used in array read and write circuitry, are particularly vulnerable to process variations. These types of circuits have multiple timing tests that must be satisfied for safe operation; in the space of process variations, each of these timing tests has a unique sensitivity signature, and the joint probability of all timing tests passing simultaneously determines the parametric yield. Device sizing of such circuits is tedious, manual and time-consuming; techniques such as

Yifang Liu Texas A & M University College Station, TX

Monte Carlo simulation followed by manual tweaking of transistor sizes [9] are commonly used. Statistical timing can take all timing tests and all process variations into account simultaneously. Thus, a tuner based on statistical timing can automate the optimization of such circuits, reducing the amount of manual intervention, and thus improving designer productivity, while finding a better quality solution. The remainder of this paper is organized as follows. In Section II we give the formulation of the statistical tuning optimization problem. An essential part of the tuner is the gradient of the objective function; although a detailed understanding of the gradient computation is not required to understand the other sections, for completeness we describe how it is done in Section III. Section IV discusses the results of the tuner on a set of benchmark circuits. Conclusions and future work direction are given in Section V. II. O PTIMIZATION FORMULATION Our statistical tuner builds upon the formulation described in [4], [5], which we refer to as deterministic EinsTuner. A timing quantity A, such as delay, is represented as a canonical form [7], [8]: A = a0 +

n X

ai ∆Xi

(1)

i=1

a0 is the mean of A and ∆Xi are normalized independent Gaussian random variables that model various process parameters. The ai terms are the sensitivity of A to the ith process variation. Our formulation of a statistical tuner exploits the relationship between parametric yield and chip slack [10]. The later is defined as the statistical minimum of the slack of all timing tests, i.e., if the ith timing test has a required arrival time RATi and an actual arrival time ATi , then chip slack c is given by: c = min(RATi − ATi ) ∀ i ∈ {tests}

(2)

The parametric yield y is then expressed in terms of the distribution of chip slack. More precisely, chip slack is a random variable C with a probability density function (PDF), and the probability that C < c is the cumulative distribution function (CDF) for that PDF. Yield is then P (C ≥ c), the probability that the random variable is greater than some specific value, which is 1-CDF. This is illustrated in Fig. 1, where the plot on the left is a Gaussian PDF which has a yield curve, derived from the associated Gaussian CDF, on the right. Using 1-CDF as yield illustrates the intuitive property that as more performance is demanded from the circuit (increasing slack), the percentage of chips which can safely be operated at a higher clock frequency (parametric yield) declines. Deterministic tuning maximizes the worst slack over all the timing tests. Specifically, since the worst slack is the deterministic minimum of RATi − ATi over all tests, the tuner tries to find a solution with the “best” or largest value of worst slack: max [min (RATi − ATi ) ∀ i ∈ {tests}]

(3)

PDF : chip slack probability versus chip slack

1−CDF : parametric yield versus chip slack 100%

0.5

(a) optimize yield for target slack

0.4

(b) optimize slack for target yield

yt

0.3 0.2 0.1

ct

Fig. 1.

Chip slack (left) and yield (right).

Statistical tuning, on the other hand, uses an objective function that depends on the chip slack distribution, as is shown graphically in Fig. 1. In the yield curve in the right, the line (a) shows the optimization of yield for a target slack ct while the line (b) illustrates the optimization of slack for a target yield yt . Regardless of which objective is used, as seen on the left, the tuner has shifted the PDF to the right (the chip is faster) and made it narrower (the design is more robust). We now pose the statistical tuning problem using the concept of chip slack and its relation to parametric yield. The variables of the tuning problem are a set of device widths w = {Wi } and a set of node slews s = {Si }.  y(w, s; ct ) (4) max c(w, s; yt ) w,s s.t.

Simin ≤ Si ≤ Simax ∀ i

(5)

s.t.

Sb ≥ skab (w, s) ∀ k edges a→b

(6)

s.t.

width, area, etc, constraints

(7)

Each of the above equations is described in more detail below. Equation (4) is the tuning objective. We consider two possibilities, maximize yield for a given slack y(ct )≡y(w, s; ct ), or maximize chip slack for a given yield c(yt )≡c(w, s; yt ), using the formulation [10] of parametric yield in Fig. 1. Specifically, we assume that the chip slack PDF can be approximated as a Gaussian with mean c0 and standard deviation σc , so parametric yield y as a function of chip slack c is: » – c − c0 y(c) = 1 − Φ (8) σc Φ(x) is the unit normal CDF for the unit normal φ(x). For the tuner, the mean and standard deviation are functions of the optimization variables, i.e, c0 ≡c0 (w, s) and σc ≡σc (w, s), so a yield objective for target slack ct is: » – ct − c0 (w, s) y(w, s; ct ) = 1 − Φ (9) σc (w, s) The objective of slack for target yield is obtained by inverting (8) for a specific target value of yield yt : c(w, s; yt ) = c0 (w, s) + Φ−1 (1 − yt ) · σc (w, s)

(10)

Φ−1 (x) is the inverse unit normal CDF; since yt is a fixed number, Φ−1 (1 − yt ) is constant. Equation (5) is the slew constraint. Each node slew Si has a technology-specific minimum Simin and maximum Simax ; for noise reasons, slew is typically not allowed to exceed a third of the cycle time. Equation (6) is a result of treating slews deterministically. Since slews are propagated during static timing using the non-differentiable

min or max operators, the slews become optimization variables, which leads to this constraint. Specifically, the slew Sb at each node b is constrained to be larger than the output slew of any edge incident upon that node in the timing graph. The output edge slews skab (w, s) for the kth edge from node a to node b are nonlinear functions of a subset of transistor widths and node slews. Equation (7) refers to a number of possible constraints which are used to keep the resulting device widths reasonable and to reduce problem complexity, i.e., transistor width and β ratio limits for the technology, input loading, total area, etc. These constraints are typically inequalities involving linear combinations of the device widths. The statistical formulation differs from deterministic EinsTuner in three ways. First, the deterministic formulation adds node arrival times {ATi } as variables to the problem (see Section 3.1 in [4]). Next, in the same way that (6) is needed as a constraint for node slew in terms of edge output slew, there are additional constraints for node arrival times in terms of edge delay: ATb ≥ ATa + dkab (w, s) ∀ k edges a→b

(11)

The nonlinear functions dkab (w, s) give the delay for a transition along the kth edge from node a to node b. Like the node slew functions skab (w, s), these edge delays depend upon a subset of transistor widths and node slews. The third difference between the two tuners is that deterministic EinsTuner, in lieu of the objective function (4), defines another variable to compute worst slack (3), z, and a set of additional constraints, one for each timing test: z ≥ ATi − RATi , ∀i ∈ {tests}

(12)

The deterministic tuner tries to maximize slack at the primary outputs by minimizing z. EinsTuner [5] uses IPOPT [11], a general-purpose, large-scale, interior point nonlinear optimization package, which has an integrated C ++ API that use the concept of group partial separable functions (for details, see Section 3.3.1 of [12]). Briefly, a function f (x), defined on a set of variables x = {Xi },P is group partially separable if it can be expressed as a sum, f (x) = j gj (x), of group functions, gj (x), each of which depends upon a subset of x. Each group function itself is expressed as a sum of terms linear in x combined with nonlinear functions of x called nonlinear element functions (NLEs); each NLE that is defined also requires the specification of it’s gradient with respect to the optimization variables. EinsTuner poses its problem to IPOPT by defining a number of group functions. For example, an area constraint is specified by defining a group function with only a linear part and then adding it as a constraint. Deterministic EinsTuner has two different types of NLEs, one for the output edge slew (6) and one for the edge delay (11). The values of these NLEs and gradients involve calls to EinsTLT to do a fast time-domain simulation [13] and adjoint sensitivity [14]. The NLE gradient computation uses a chain rule that combines the sensitivities of the adjoint calculation with the dependency of each function on the optimization variables, as described in [4]. For example, if a single width is used to size two different devices that impact a specific edge, then the sensitivity of the NLE functions with respect to that width need to consider it twice, once for each device. The statistical tuner has the same slew formulation (6) as deterministic EinsTuner, and so uses the same NLE function and gradient for the output edge slew. The other NLE for the statistical tuner is the objective function, which, via its dependency on chip slack, depends upon the canonical forms for edge delay, as described in the

remainder of this section. The details of the gradient computation for the objective function are given in Section III. The canonical forms for edge delay are constructed by scaling the same delay functions computed for the deterministic tuner’s NLE using the technique developed for transistor-level statistical timing [9]. Specifically, for the kth edge in the timing graph from node a to node b, the canonical form for edge delay dkab ≡dk is: dk = dk0 +

n X i=1

dki ∆Xi = dk0 · (1 +

n X

δik ∆Xi )

(13)

i=1

The term dk0 ≡dkab,0 is the nonlinear delay function dkab (w, s) in (11). The terms dki ≡dkab,i for process sensitivity in the canonical form are k obtained by scaling dk0 by constant coefficients δab,i ≡ δik , which are called asserted sensitivities. For example, if the kth edge delay has 10% variablity on the ith process variable, then the asserted sensitivity coefficient is δik = 0.10.

Fig. 2. slack.

Reconstruction of complement chip slack from chip slack and edge

Since the edge delay canonical form (13) has only one variable function, the last term,

∂c(yt ) ∂c0 ∂σc = + Φ−1 (1 − yt ) · ∂vi ∂vi ∂vi

(14)

The gradient for the yield objective (9) is slightly more complicated:

can be simplified by factoring out

∂dk 0 : ∂vi

n X ∂dk0 X ∂c0 ∂c0 = · δjk k ∂vi ∂v ∂d i j j=0 ∀k∈E

III. G RADIENT C OMPUTATION In this section, we describe how to compute the gradients which are needed by IPOPT to solve the problem described in Section II. The computation is via a chain-rule that combines analytic expressions derived from the concept of edge slack [10] with the adjoint sensitivity calculation from EinsTLT. This section is selfcontained; although it is essential to statistical EinsTuner, a detailed understanding of the chain-rule computation is not required to read the remaining sections of the paper. The gradient is the sensitivity of each non-linear function the sensitivity of each non-linear function with respect to each of the optimization variables, i.e. the device widths w = {Wi } and node slews s = {Si }; these sensitivities are computed via chain-rule. If we let vi represent one of these width or slew variables, then gradient for the slack objective function (10) with respect to vi is:

∂dk j , ∂vi

(19)

∂dk

The term ∂v0i is the sensitivity of the deterministic formulation’s ∂c0 edge delay NLE, so the only remaining unknown in (19) is ∂d k. j

To complete the chain rule, we use the notion of edge slack and complement edge slack from [10] (Fig 2). For the edge k from node a to b, we imagine any cut-set partition which contains the edge such that the node a and all the test inputs are in one partition and the node b and all the test outputs are in the other partition. The late mode edge slack ǫk for the edge is the minimum slack of all paths through the edge: ǫk = RATb − dk − ATa

(20)

where RATb is the minimum (earliest) of all the required arrival times at node b and ATa is the maximum (latest) of all the actual arrival times at node a. Since the RATb and ATa do not depend upon the edge delay dk : n

∂y(ct ) ∂c0 ∂y(ct ) ∂σc ∂y(ct ) = · + · ∂vi ∂c0 ∂vi ∂σc ∂vi

(15)

t) t) are simplified by chain ruling Φ(x) and ∂y(c The terms ∂y(c ∂c0 ∂σc = φ(x), which gives: using ∂Φ(x) ∂x » – „ « ∂y(ct ) ct − c0 1 ·φ (16) = ∂c0 σc σc – « » „ ∂y(ct ) ct − c0 ct − c0 (17) ·φ = ∂σc σc2 σc

After (16) and (17) are substituted into (15), the only two unknowns c 0 and ∂σ . Since the derivations of these left for either gradient are ∂c ∂vi ∂vi 0 two terms are similar, we only indicate how to compute ∂c . ∂vi Chip slack depends upon every possible path from test inputs to outputs, so we need to form the chain rule over the set E of all edges in the timing graph. For the kth edge, the chain rule combines the partials of c0 with respect to delay dk (13) and the partials of dk with respect to v: # " n X X ∂c0 ∂dkj ∂c0 · (18) = ∂vi ∂dkj ∂vi ∀k∈E j=0

X ∂c0 ∂ǫki ∂c0 ∂c0 = · =− k k k k ∂dj ∂ǫ ∂d ∂ǫ i j j i=0

(21)

∂ǫk

where we have used the fact ∂dik = −1 when i = j and is zero j otherwise, which is implied by (20). The complement edge slack κk is the statistical minimum over the complement edge set, which consists of all the edges in the cutset except for the edge k itself. The minimum of edge slack and complement edge slack for any edge is, by definition, the chip slack: c = min(ǫk , κk )

(22)

Since chip slack can be computed by taking the minimum over all test outputs (2) and ǫk is known for any edge via (20), we can use the technique in Section IV of [10], called the “reconstruction method,” to invert (22) to compute κk without a cut-set. Moreover, since the edges in the complement edge set do not depend upon the edge delay dk , there is no sensitivity of the complement edge slack κk on dk . ∂c0 Once κk has been determined, ∂ǫ k can be derived from expressions j for the gradient of the statistical maximum of edge slack and complement edge slack in [10], specifically equations (14), (28), (29) and (30) in Section IV.B, using the fact that the min operator can be

computed using min(a, b) = −max(−a, −b): θk

=

{σǫ2k + σκ2 k − 2

n−1 X

ǫkj κkj }1/2

(23)

j=1

∂c0 ∂ǫk0 ∂c0 ∂ǫki ∂c0 ∂ǫkn

= = =

– » k ǫ − κk 1−Φ 0 k 0 θ « » k – „ k ǫ0 − κk0 ǫi − κki · φ ∀ i6=0, n − θk θk » k – „ k« ǫ − κk ǫ − nk · φ 0 k 0 θ θ

(24) (25) (26)

Substituting (23) to (26) into (21), and then into (19), and com∂dk bining the results with ∂v0i from the adjoint calculation completes the chain rule. IV. N UMERICAL RESULTS The optimization of statistical chip slack at 50% parametric yield is analogous to a basic mode in deterministic EinsTuner called cycletime optimization, which improves the worst case slack subject to constraints (limits on the total area A, β ratios, input loads, etc). In both cases, as the sizing progresses, the optimizer focuses its effort on improving the slowest, most critical, path. In this section, we first illustrate the essential difference in the way the statistical and deterministic tuners work to improve critical paths, then compare the results on set of benchmark circuits. First, consider the six-inverter circuit in Fig. 3, which has two chains of three inverters driven by a common input A, with two outputs D1 and D2. This circuit has four timing paths from input rise/fall to outputs fall/rise. The timing tests (asserted RATs) are such that all four paths need to be considered during tuning. Variation is

Fig. 3.

Six inverters in two chains.

added to this circuit by using canonical forms (13) with two terms using asserted sensitivities (δik ). Since the paths in the upper chain and lower chains are through different CCCs, we use different values for the asserted sensitivities for the paths in the upper and lower chains. This is physically reasonable, for example, if the FETs come from different threshold voltage families, e.g., the upper chain uses regular threshold voltage cells (i.e. δiR ) while the lower chain is made from low threshold voltage cells (δiL ). To compare the results of the two tuners, a distribution was created for the pre-tune configuration and the final deterministic solution using the same asserted sensitivities. The circuit was run through both tuners, generating canonical forms using asserted sensitivities δ1L = 0.20 and δ2L = 0.01 for the low voltage devices, i.e., “threshold voltage variation for low voltage devices is a 20% effect within die and 1% globally,” and δ1R = 0.05 and δ2R = 0.01 for the normal voltage devices (these values are typical [9]). The PDFs for chip slack and the slack for the rise and fall delays on the outputs are shown in Fig. 4 for the pre-tune configuration, and deterministic and statistical solutions. In

order to keep the figure readable, the distributions are scaled to the same height, i.e., they are not normalized to unit area. As can be seen by looking at the pre-tune distributions (the top of Fig. 4), initially the path to D1-R has the smallest value of slack. The tuners improve the timing of the circuit starting from this initial setting, which moves the distributions to the right and makes them narrower. In order to improve the worst case slack, the deterministic tuner has to move any distribution to the right when it becomes critical, which is when its µ−3σ point impacts the deterministic min. In this case, after only a few iterations, the paths to D2 must be considered. Eventually, the D1-F path becomes critical and the tuner is forced to move all four paths. The tuner is oblivious to the fact that the distributions are unequal and impact chip slack distribution differently. The final deterministic solution (the middle of Fig. 4) has all four distributions with more or less the same µ − 3σ common point (about 0.757 ps). The statistical tuner, on the other hand, moves the narrow distributions “inside” the broader ones, which enables it to find the solution that gives more area to the devices in the top chain, making them faster. The resulting solution (the bottom of Fig. 4) has the four paths contributing to the final chip slack PDF. The difference in the two solutions is apparent looking at the resulting transistor widths, which are shown in Table I. Compared to the deterministic tuner, the statistical tuner has slowed down the paths to D2 and made the paths to D1 faster, so we expect the devices I3, I4 and I5 in the lower chain that drive D2 to be smaller and the devices I0, I1 and I2 in the upper chain to be larger in the statistical solution. TABLE I S IX - INVERTER CIRCUIT DEVICE WIDTHS ( IN µM ). device I0 N1 I0 P1 I1 N1 I1 P1 I2 N1 I2 P1 I3 N1 I3 P1 I4 N1 I4 P1 I5 N1 I5 P1

pre-tune 2.00 2.00 2.00 2.00 3.00 4.50 2.00 2.00 2.00 2.00 3.00 4.50

deterministic 5.15 8.20 7.80 6.50 8.25 12.80 13.15 19.90 14.90 11.65 13.20 15.70

statistical 10.15 17.85 8.30 9.20 7.55 13.55 9.85 9.90 16.20 10.80 6.85 13.50

There are situations where both tuners should find essentially the same solution. For example, if there is only one critical path that can be improved subject to the constraints, or more generally, when there is no difference between the statistical and deterministic timing on all the potential critical paths. It is straightforward to show for canonical forms based on (13), if the same asserted sensitivities are used for all variables for all paths, deterministic and statistical timing will give essentially the same results. This is due the fact when δik is the same for two canonical forms X and Y , the PDF of the statistical min(X, Y ) is virtually indistinguishable from the PDF of the input which has smaller mean, independent of the tightness probability. In other words, as long as x0 < y0 , it does not matter if x0 ≪y0 or x0 ≈y0 , the PDF for the statistical min(X, Y ) is essentially the same as the PDF for X. As a consequence, deterministic timing tracks statistical timing as the mean of the statistical min(X, Y ) will be the same as deterministic min of the means, min(x0 , y0 ). For such

Pre−Tune

D2−R

probability

Chip Slack

D2−F

D1−F

D1−R Worst Deterministic Slack

Deterministic Tuner D1−R D2−F

probability

Chip Slack

Worst Deterministic Slack

D2−R

D1−F

D1−R

Statistical Tuner Worst Deterministic Slack

probability 0.72

D1−F

Chip Slack

D2−F D2−R

0.73

0.74

0.75 chip slack

0.76

0.77

0.78

Fig. 4. Chip slack and the rise/fall slack at the primary outputs for the six-inverter circuit (Fig. 3), for pre-tune configuration (top), the deterministic (middle) and statistical (bottom) solutions.

problems we expect both tuners to find the same chip slack PDF; differences should be minor, due to the inherent noise in the adjoint sensitivity calculation, final ”snap-to-grid” adjustments, etc, and the fact that we are comparing stationary-point solutions to different nonconvex optimization problems. However, for circuits, like the one in Fig. 3, where there is more than one critical path, and they have different sensitivities, we expect the statistical tuner to find a ”better” solution. Table II shows the results of the deterministic and statistical tuners on a set of benchmark circuits using the same values of asserted sensitivities for regular and low threshold voltage devices, as was done for the six-inverter circuit. These benchmarks were taken from the deterministic EinsTuner regression suite and each test has at least two competing critical paths with different variability. For each circuit we list the total number of devices, nets, independently tunable devices, the number of CCCs, and the chip slack mean c0 and sigma σc for the pre-tune, deterministic and statistical solutions, along with the number of IPOPT iterations and total CPU time. For some tests, the pre-tune configuration is not feasible; in these cases, the pre-tune values are not shown. The data in the table demonstrate the viability of device sizing based on statistical static timing for circuits that are sensitive to process variability. As expected, the statistical tuner is able to find a solution which has more yield for the faster implementations. In some cases, the relative (per iteration) performance of the statistical tuner is comparable to the deterministic tuner, though in others it is somewhat slower, e.g., circuit ”ifaa”, due to an increase in the number of EinsTLT iterations required for each IPOPT iteration (the time spent in EinsTLT is dominates the total CPU time). We believe this is due to the fact that even though the formal problem for the statistical tuner has fewer total variables, NLEs and constraints, the statistical objective function depends upon all optimization variables, which can, in some cases, increase the average number of EinsTLT

iterations per IPOPT iteration. However, since the statistical tuner is finding a solution which is better than the deterministic tuner, it is reasonable to ask at what IPOPT iteration did the statistical tuner find a a solution comparable to the deterministic tuner. In Table II, the column ”equiv itr” show the equivalent number of iterations it took the statistical tuner to get to a solution that is essentially equivalent to the deterministic tuner; for example, for ”iiff”, it took the statistical tuner 12 iterations to get to a chip slack distribution that would be considered an improvement over the one derived from the deterministic solution; the remainder of the iterations were spent to to improve the performance of the circuit another 7.5 ps. The significance of these results is not to argue that a statistical tuner is better than one based on a deterministic tuner. Rather, since a statistical approach to timing allows the simultaneous consideration of different timing tests and process variations, a statistical tuner makes available automatic device sizing, an important circuit design productivity aid, to the design of circuits which are sensitive to process variation and a challenge for techniques which rely only on deterministic timing. V. C ONCLUSIONS AND FUTURE WORK In this paper, we presented a novel transistor sizing approach which takes yield considerations into account. We described how to add process sensitivity aware optimization to EinsTuner, an existing state-of-the-art tuner. The resulting tuner augments the tool’s device sizing capability to exploit the statistical capability of EinsTLT and EinsTimer to implement a statistical objective function which optimizes parametric yield for a given slack, or optimizes chip slack for a given yield. Experiments show that for circuits which are dependent upon process variation, the statistical tuner can find better solutions than deterministic tuning, making it an attractive alternative to other sizing methods, such a Monte Carlo. To the best of our knowledge, this is the first description of a working system that can

TABLE II B ENCHMARKS CIRCUITS.

name iiff tgmx str1 str2 sti2 mslc uss0 ifaa lsom ibsx

devices 16 84 174 234 240 280 422 937 6763 8382

problem size nets tunable 18 16 44 36 93 30 127 38 127 45 153 134 178 102 510 156 2112 636 4363 108

CCCs 7 24 50 22 67 64 74 251 1093 2084

pre-tune c0 σc ps -30.43 3.70 -79.36 3.46 infeas 42.11 6.72 infeas infeas -169.59 8.55 -279.49 9.24 30.18 1.55 37.87 10.09

cpu sec 5.3 18.9 9.2 16.9 18.0 21.2 464.2 40.4 416.9 401.6

optimize device sizes using a process variation aware tuner for fully custom circuits. Deterministic EinsTuner has options to support tuning objectives other than cycle-time optimization; we are investigating how to extend our statistical tuner to cover these modes. We are also looking at how to model slews statistically, and also how to simultaneously optimize both power and timing in the presence of process variations. R EFERENCES [1] J. P. Fishburn and A. E. Dunlop. TILOS: A posynomial programming approach to transistor sizing. IEEE International Conference on ComputerAided Design, pages 326–328, November 1985. [2] S. S. Sapatnekar, V. B. Rao, P. M. Vaidya, and S. M. Kang. An exact solution to the transistor sizing problem for CMOS circuits using convex optimization. IEEE Transactions on Computer-Aided Design of ICs and Systems, CAD-12(11):1621–1634, November 1993. [3] V. B. Rao, J. P. Soreff, T. B. Brodnax, and R. E. Mains. EinsTLT: transistor-level timing with EinsTimer. Proc. TAU (ACM/IEEE workshop on timing issues in the specification and synthesis of digital systems), December 1999. Austin, TX. [4] A. R. Conn, I. M. Elfadel, W. W. Molzen, Jr., P. R. O’Brien, P. N. Strenski, C. Visweswariah, and C. B. Whan. Gradient-based optimization of custom circuits using a static-timing formulation. Proc. 1999 Design Automation Conference, pages 452–459, June 1999. New Orleans, LA. [5] K. Bard, B. Dewey, M. Hsu, T. Mitchell, K. Moody, V. Rao, R. Rose, J. Soreff, and S. Washburn. Transistor-level tools for high-end processor custom circuit design at ibm. Proceedings of the IEEE, 95(3):530–554, March 2007. [6] A. Srinivasan, K. Chaudhary, and E. S. Kuh. RITUAL: A performance driven placement algorithm for small cell ICs. IEEE International Conference on Computer-Aided Design, pages 48–51, November 1991. [7] C. Visweswariah, K. Ravindran, K. Kalafala, S. G. Walker, and S. Narayan. First-order incremental block-based statistical timing analysis. Proc. 2004 Design Automation Conference, pages 331–336, June 2004. San Diego, CA. [8] H. Chang and S. S. Sapatnekar. Statistical timing analysis considering spatial correlations using a single PERT-like traversal. IEEE International Conference on Computer-Aided Design, pages 621–625, November 2003. San Jose, CA. [9] D. Sinha, A. Bhanji, C. Visweswariah, G. Ditlow, K. Kalafala, N. Venkateswaran, and S. Gupta. A hierarchical transistor and gate level statistical timing flow for microprocessor designs. 2009 Design Automation Conference User Track, July 2009. User track, San Francisco, CA. [10] J. Xiong, V. Zolotov, and C. Visweswariah. Incremental criticality and yield gradients. Design and Test in Europe, pages 1130–1135, March 2008. Messe Munich, Germany. [11] A. W¨achter. An interior point algorithm for large-scale nonlinear optimization with applications in process engineering. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, January 2002.

deterministic IPOPT c0 itr ps 11 -28.02 12 -52.38 8 -39.91 10 41.75 10 -45.31 29 -137.15 28 -128.87 12 -270.00 16 38.19 7 39.93

σc 3.41 2.10 6.47 8.02 8.36 15.93 6.44 9.69 3.57 10.34

cpu sec 51.0 21.9 11.2 30.1 16.4 104.2 298.9 70.7 601.1 1521.6

IPOPT itr 77 14 14 8 4 44 14 14 12 14

statistical equiv c0 itr ps 12 -20.79 14 -50.37 14 -34.29 4 45.98 4 -36.43 30 -129.79 14 -117.31 14 -250.26 12 38.44 10 40.56

σc 5.85 2.02 6.41 10.84 7.90 2.10 6.25 10.32 4.21 9.23

[12] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. LANCELOT: A Fortran Package for Large-Scale Nonlinear Optimization (Release A). Springer Verlag, 1992. [13] C. Visweswariah and R. A. Rohrer. Piecewise approximate circuit simulation. IEEE Transactions on Computer-Aided Design of ICs and Systems, 10(7):861–870, July 1991. [14] P. Feldmann, T. V. Nguyen, S. W. Director, and R. A. Rohrer. Sensitivity computation in piecewise approximate circuit simulation. IEEE Transactions on Computer-Aided Design of ICs and Systems, 10(2):171–183, February 1991.