Probability of State Transition Errors in a Finite State Machine Containing Soft Failures

269 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 3, MARCH 1984 State Transition Errors in a Finite State Machine Containing Soft Failures Probabil...
Author: Clarence Conley
0 downloads 1 Views 2MB Size
269

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 3, MARCH 1984

State Transition Errors in a Finite State Machine Containing Soft Failures Probability of

GUANG XING WANG

AND

G. ROBERT REDINBO,

Abstract -A Markov model of a finite state machine realization containing gates and memory elements each subject to internal soft errors is given and a computational method for determining the probability of state transition errors is presented. The system inputs are taken as stochasticaily driven and the long-run stationary probability distribution of the states is developed. Similar results are determined for a fault-tolerant realization using error-correcting codes to form cluster states according to a technique of Reed. The state transition error performance of the coded machine is compared to that of the original realization; a dramatic improvement is seen when normally small soft error rates are encountered. The computational aspects of the required probabilities are investigated and a simplified approximation approach is proposed and analyzed. Bounds on the approximation inaccuracies are derived. Index Terms Approximation bounds, cluster states, errorcorrecting codes, fault-tolerant machines, long-run state probabilities, Markov chains, soft errors, soft fails, state transition errors. I. INTRODUCTION

A NEW type of error, the so-called soft fail, is prevalent in VLSI logic circuits. Soft fails are random nonrecurring errors with which no physical defects are associated. Recently, with the development of semiconductor technology, the level of integration is continuously increasing, and consequently, the influence of soft fails on the behavior of VLSI systems has become very important [1]-[4]. On one hand, the analysis of the probability of error caused by soft fails in a general combinational (VLSI) logic circuits has been discussed in previous literature [5], [6]. This paper addresses the analysis of the probability of the error caused by soft fails in finite state machines (FSM). The errors which arise in the state transitions of an FSM are more serious than ones in other aspects of the system because it is difficult to recover from state errors without a major intervention of some external controller which could be as large as- the original machine and be more susceptible to internal noise. Furthermore, handling state errors with separate hardware can degrade the speed performance significantly. Therefore, we concentrate on analyzing the probability of errors which arise in the state transition of the Manuscript received November 3, 1982; revised July 27, 1983. G. X. Wang is with the Department of Computer Information and Science, Northeast Institute of Technology, Shenyang Liaoning, People's Republic of China. G. R. Redinbo is with the Center for Integrated Electronics, Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12181.

SENIOR MEMBER, IEEE

machine. A general analysis procedure will be provided. Then a new type of fault-tolerant FSM using error-correcting codes and cluster states as described by Reed [7]-[9] will be reviewed. As an example, the probability of error which arises in the state transition of the FSM using error-correcting codes and cluster states will be calculated and compared to the general unprotected realization. The results show that this kind of coded FSM has impressive noise immunity. II. PRELIMINARY DEFINITIONS AND NOTATION

We will begin with a formal definition of an FSM over alphabets which are contained within binary vector spaces. A finite state machine A is defined through the specification of three spaces and two mappings. A = (I, S, O;8o, wo)

(1)

where

8o: S

x

-> S;

state transition mapping; I, input space (dimension 1); S, state space (dimension k). (2)

For the purposes of our analysis and in the interest of brevity the FSM A will be described using only the input and state spaces along with the state transition mapping, So; the output space 0 and the output mapping wo will not be considered further. A binary error-correcting code may be used for the protection of the state components to form a fault tolerant FSM [7], [8]. The code can be linear or nonlinear, systematic or nonsystematic; it is only the distance properties that are important to the fault-tolerant level. One example of a state encoding using a linear error-correcting code can be constructed as follows. Let P denote the parity check subspace of an (n-k) dimensional binary vector space. It is determined by the parity check equations of the linear code. Each k positions represented by x E S gives rise to a distinct set of (n-k) components p E P. The direct product (S x P) = G is a linear subspace of V, the n-dimensional binary vector space. The state encoding is a one-to-one mapping from S onto G. 1-1 S-G STATE ENCODING x F-> (x,p); x E S, k state positions p E P, (n-k) parity positions

0018-9340/84/0300-0269$01.00 © 1984 IEEE

(3)

270

2EEE

hj[i, (/3, y)]

where G = (S GC V

x

(4) The encoded next-state-function will be denoted by A and is implicitly defined by

z(t + 1)

=

A(i(t), s(t),p(t));

i(t) E I, z(t + 1) E G, (5) s(t) E S,p(t) E P.

Since G is a subset of the larger n-dimensional space V, it is possible to examine the extension of the encoded next state mapping A to the space V. This new extended mapping will be denoted by A and involves the concept of "cluster states." Cluster states are determined by collecting all elements of V which are close to an original state, defined as the center of the cluster. In this way the center of each cluster is an element of G. The decision function for clustering is based upon the Hamming distance function as defined on V. DH(U, V)

=

(6)

WH(U (® V)

where ®) is a component-wise EXCLUSIVE-OR operation of binary vectors and u, E V; WH(Q) = weight of the binary v

vector u

i)

v.

The extended next state mapping for an approach called bounded distance clustering may be defined. A

(i, (B, y))

=

DH((\,tY),Y(S,P))J

A(i, (s,p)) 2-

for all

E

EI;

s

i

S

yE

E S;

P

p E

P (7)

where the (s, p) E G is the center of the cluster state, and dG denotes the minimum distance of the code. Thus, the cluster states associated with each (s, p) E G are defined as all elements within Hamming distance (dG- 1)/2. Since there may be elements of V that are not included as cluster states, we may assign "don't care" values to all values defined on these elements. These "don't care" terms allow simplification of the excitation logic for any realization of the encoded FSM. On the other hand any "don't care" values appearing in the original state assignment on S, and therefore G, will carry over to the extended state space V. There are other ways to extend A. For example, using a slightly broader definition of cluster states, a minimum distance clustering criterion could be employed. The extended next state mapping A defines the excitation equations for implementing the FSM. z(t + 1)

=

A(i(t), (p., y));

z(t + 1) E G, i(t) E I,: E S, y E P.

(8)

Note that G is the set of the centers of the "cluster states." Label the n memory elements which correspond to the n components in each vector of V by MI, M2, Mn. Then the n components of z E G define the required input to each Mj so that it produces the proper state value. ,

NO.

3,

MARCH

1984

= excitation equation for memory element Mj.

j= 1,2,c

G, linear code V, n-dimensional vector space.

P);

c-33,

TRANSACTIONS ON COMPUTERS, VOL.

,n

8 ES;

iEI;

(9)

yEP.

A sufficient condition for guaranteeing that one element's failure will affect only that value in the memory element is that the logic realization should not share any logic elements. It is important to note that each hj is defined on V in such a way that each cluster maps to the proper next state as defined in G, the code. Thus, in the bounded distance case even if any combination of less than (dG - 1)/2 excitation functions or memory elements fail, an element in the proper "cluster states" will appear at the output of the respective memory elements. However, a cluster of states is treated by the very construction of each cluster, like the original encoded state in G, and the system's next state values will always be in the correct cluster of states under these conditions. III.

MODEL

A general synchronous sequential machine is represented schematically by the system in Fig. 1(a) where the input and output logic will be represented by circuits C1 and C2, respectively. We will assume for simplicity and ease of presentation that these combinational logic circuits are in sum-of-product form [Fig. 1(b)]. The extensions to more complicated realization forms will be evident after the discussion. The memory elements Mi are flip-flops whose transitions are synchronized by means of a clock. It has been shown previously [5], [6] that the occurrence of soft errors1 within a gate may be modeled at its output. We assume in logic segment C1 that there are p gates which drive the final k gates in the sum-of-product realization [Fig. 1(b)]. These soft errors, modeled by (p + k) Boolean noise variables, may complement the outputs of the gates. E1, £2, * * ' £p; 1, V2, ' * ' nkrepresent the internal noise caused by soft fails in the combinational circuit C1. In addition the binary random variables Yi, y2, ' ', ykmodel soft errors in the memory elements [ 10]. Soft errors have a finite residency time, and it is assumed that their effects last longer than the period of the clock effectively removing the clock signal from further consideration. -

IV. ANALYSIS OF PROBABILITY OF THE STATE TRANSITION ERROR

In the noise environment the state transition function (2) contains the effects of the internal noise variables.

s(t + 1)

y), s()]; 8[(i(t), e, XJ internal noise variables of combinational circuit. (10) y memory noise variables. =

'q,

'Soft errors are temporary and momentary errors generally induced by particle emissions or marginal device performance which leave no permanent or detectable change in the electronic structure of semiconductor devices. They may be contrasted with intermittent errors resulting from structural defects which because of random inputs occasionally produce errors. It is possible to electronically test for intermittent errors whereas soft errors leave no permanent trace except possibly in propagating logical errors throughout the system.

271

WANG AND REDINBO: PROBABILITY OF STATE TRANSITION ERRORS

Since the judgment function J is binary-valued, the probability 4 may be expressed as the mean of this function.

6 = E{J[(i(t); e,£q, y), s(t)]}-

(14)

As a beginning reasonable assumption suppose that input signal variables are statistically independent, stationary, and possess an equiprobable distribution. Furthermore, it is realistic to assume that the input signal is independent of the internal noise variables representing the soft fails. Finally, as a good starting point the noise variables (c, 7), y) are assumed to obey a binomial distribution. Based upon these assumptions we will investigate the probabilistic behavior of the state variables. It is common practice to treat the variables as the components of vectors from appropriately defined binary vector spaces. The state transition function will be defined over an enlarged binary vector space of dimension (1 + p + 3k), designated by U. In addition, it will be convenient to define certain vector subspaces of U starting with I as the 1 dimensional subspace representing the input variables.

(a) Clock

N1

1 fl

£2

p I = {i E U; i =

(il,J2 ...

+ 3k

,6uIi7?.),bi= 0,1}.

(15)

Likewise, identify the subspace H1 with the noise variables q internal to the combinational logic circuit Cl. e,

HI

=

{(E,

)

E U; (e,)

I

A= (b) Fig. 1. General system model; (a) block diagram; (b) state transition part.

The noiseless performance may be determined by inserting zero vectors for the noise variables. s(t + 1)

-

=

[(i(t);

9,

0,

0,

zero

vectors

.

=

P{{[8(i(t);

E,

For computational convenience the judgment function J.

1,2,

,

(16)

H2

= {y

E U;

+ p + 2k

0 (O,

k

Yi,0;7, Y2,

,Yk, °, °, *

0), Yi =0, 1} -

(17)

Finally, the subspace S corresponds with the present state variables. + S

=

{s E U;

s

p

+ 2k

=(0,0i

X1, X2,*

Xk)

we

introduce

0} . a new

(12)

function,

=

VS( S.

The state transition function, (10)

8[(i(t); 0, 0, 0), s(t)]} E,

3) 8[(i(t);

0,

1}

With Vi = I H1 ®) H2 where denotes the direct sum of the respective subspaces, it follows that U

1; if {6[(i(t); E, X, y), s(t)] if {8[(i(t);

=

(18)

J{(i(t); E, Xq, Y), s(t)}

O;

**k*Ep,7

C2,

X,y), S(t)]

i) Ci(t); O, O,0 ),s(t)]}

=

E1,

The subspace H2 is directly related with the noise variables in the memory elements,

(11)

A state transition error occurs when, due to a previous malfunction, the system arrives in the state si but should be in state sj at time (t + 1). The probability that the digital realization of the state transition function bo[i(t), s(t)] will be in error due to soft fails may be expressed

O;

,

qj= 0,1}.

ci,

O), s(t)]

BoIi(t), s(t)];

2k

s(t

0

X, y), s(t)] 0, 0, O),

S(t)]}

= 0

(13)

may

(19) be stated

s(t + 1) E S; \ = 8(vi; s(t)) s(t) E S; (20) vi E Vi. The transition table corresponding to the noisy case has 2"P columns, resulting from the present input i and noise + 1) = 8(i;

e, q,

y; s(t)),

272

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 3, MARCH 1984

variables (E,q, y), and 2k rows representing the distinct state variables. From the state transition table and the assumptions concerning the input and noise variables the one-step state transition matrix of the FSM may be developed. Let M(i; £, , y) be a 2k X 2k stochastic matrix where the row i, column j entry, mij, is the probability of a one-step transition from s(t) = si E S to s(t + 1) = sj E S when the composite vector v consists of input variable i and when internal noise variables (£, , y) are active.

mi=

P{8(i; 6 ,m; si)

iEI (e,,1)EH1 yEH2

=

sj}.

(21)

, y; ) sj} , associAnother judgment function, J1{8(i; S, ated with the transition from state si to sJ can be defined. if 8(i; El rj, Y; Si) = Si if 8(i; E, r, 'y; si) # sj.

J115(i; 6, r1, y; Si), sj} =

(22)

Then the entries of M(i; 8, -7, y) can be expressed as mij

iEl

(E,71)EH, yElH2

Pc

p

W[hi] is Hamming weight of vector hi E JHI. Likewise, the

other item in (24) can be expanded.

Pyl(. (1

-

pm)k-W(y);

y E H2.

(27)

Combining the appropriate equations (24)--(27) into (23) we arrive at

mij= 2'Z

=

X J1{6(i; e, r, y; si), Si} .

iEI (e, i7)EH1 yEH2 (1 - p )P+k-W(E 1)

.

pW()

-

pW(e,

pm)k-W(V)

P(si)

(28)

transition follow follow Under the prevailing conditions, the state tra'nsitions a Markov chain [1 ]. Its matrix M is statioinary. The Markov state transition process possesses stationar3y transition probabilities and, generally speaking, 0 < miJ,< 1 for all i,j. Thus, the transition matrix M =. [mij] is a regular transition matrix, and for any initial probability dli ztriNiitinon vi-c-torj1 T,1 X M' approaches the vector a as t tends, to infinity, where a is the unique probability vector such tha t a M = a [11].

=

mi, mean recurrent time of state si.

-; mi

(29) The probability computation of the state transition errors relies upon several simplifications from the easily proved facts that i(t) and s(t) are independent of each other and the noise variables (e, 7, 'y), and i(t) and s(t) are mutually independent of each other.

=E E E E J[(i(t); r, r, y), s(t)] iEl (E, ,)EH, yEH2

SES

P(i)P(E, T) * P(y) * P(s).

(30) Substituting (25)-(27) and (29) into this equation yields *

J{8(i, £, , y; Si), sj}P(i; E, rj, y). (23)

(24) P(i, e, q, y) = P(i) * P(e, r) .P(y), and the inputs are equiprobable P(i) = 2-'. Let the probability of a single gate,failure in the combinational circuit C, be denoted by Pc whereas the probability of a single memory cell failure be labeled by pm. Then one of the joint probability functions needed in (24) arises from the usual definition of the binomial distribution [11]. . (1 P(r, =7) )P+k-Wf(e,,q)l; (£, -q) E Hi. (26)

p(y)

ai

=2-1

Due to the assumptions outlined earlier, the input, internal noise, and memory noise vectors are statistically independent,

=

Since the state transition matrix of an FSM is regular, this state probability distribution vector a is unique.

E

iEI (e, 71)EH1 {Pc

(1

E E J[(i(t); c, rj, y), s(t)]

yEll2

SiES

-9)P +k-W[6,X] . W[y] ( k1 - p )kWy

1M/

(31) where mi is the mean recurrent time for state vector si [11]. V. EXAMPLE OF ERROR PROBABILITY ANALYSIS

A simple example will be used to better understand the analysis of state transition errors in a fault-tolerant FSM. An FSM realization and its state transition table are shown respectively in Fig. 2. One choice of a fault-tolerant version will use a simple (5, 2) code for defining the cluster states. Such a code can correct single errors. The encoding rule used for this code choice gives the encoded state

(XI,X2,X3,X4,X5)

S=

where X3= X1;

X4

= X2; X5

=

X1 ®

X2-

A straightforward application of the clustering technique leads immediately to a simple realization of the required excitation equations in sum-of-product form. Each realization of individual excitation variables does not share logic gates with those of other variables. Thus, the elements of a next state variable vector are correct if a single element of the present state vector has become erroneous. Combinational circuits for parts of the coded FSM appear twice according to the encoding rules since a systematic codeis being used. The next state for the ith state variable xi has the generic folwnfrm equation following form: x!'+[(i(r); (e, q), y), s(t)] P

= V

i1; X'1,X'2

(Yj(il, i2, ®

Ej)

+

V

J=l

Fj4

(i

Xk,) (2 (32) ®Yi3y

273

WANG AND REDINBO: PROBABILITY OF STATE TRANSITION ERRORS

Let P{c i(t), s(t)} denote the stationary probability of correct

state transition for whole circuit, given i(t) and s(t), while P{e i(t), s(t)} is the label for the conditional stationary probability of error of the state transition. Because the' internal noise vectors (ES 'q), are independent, n

P{c i(t), s(t)}

17 (1- Pi[e i(t), s(t)] i=l

n

n

+ E

k=l

Pk[e i(t), s(t)] * J7 (1 j=1

-

Pj[e I (t), s(t)]),

(34)

j#:k

x2

P{e i(t), s(t)} = 1 P-F{c i(t), s(t)}. Finally, incorporating our assumptions P{i(t)}= 2-' and P{s(t)} = 1/m(s), we have 2-1

(a)

x x

=2-1 EH

01

n

1 I

i(t)EI S(t)ES i=1 n

00

10

01

01

11

01

11

10

00

10

01

00

(b)

, il; X1, X2t The term Yj(il, i2, , Xt) represents the jth implicant in the excitation logic while Fj,j = 1, 2, ,1, denotes any single input variables that may enter directly into the final OR gate of the standard sum-of-product implementation of this logic. (In' most cases many of the Fj variables are zero.) The probability of error of i th element of the next state representation is related to the extended input. (Note that the extended input includes the input signal, 'present state, and noise variables (e, 71, y).) Consider the conditional stationary probability of error of 'the'i th branch output, given the input vector i (t) and state vector s(t). It may' be expanded in a straightforward manner.

Pi{e i(t), s(t)} (e,qr1)i; yi, s) X0 (E-,,)iEHi yi=° =[Xi`(i; .W[(e, )Pi+1-W[(E.

4

)i](l

[I -Pi (e |i (t), s(t))] n

- E Pk(e i(t), s(t)) 17 [1 k=l j=lm(s) jOk

Fig. 2. Original finite state machine; (a) realization; (b) state transition table.

= 1, 2,.. , n

( = , P{~~Pfe i i(t), s(t)}

i(t)EI s(t)ES

m

(i;

O, O: O, S)]

Pj(e i (t) s(t))

where Pi [e i (t), s(t)] is calculated by (33). A computer program was used to calculate this probability of error for the original FSM, depicted in Fig. 2(a) and the similar probability for the corresponding FSM using error correcting codes. The results appear in.Fig. 3(a) and (b), as p, the probability of 'gate failure varies over tw'o ranges 10-12_10-9 and 10-3-0., respectively. Note that both axes employ logarithmic scales. The curves in Fig. 3(a) clearly show that the probability { for the FSM using an errorcorrecting code is much less than in the original FSM when p, the probability of gate failure, is very small. On the other hand, from Fig. 3(b) we find that as p increases, the probability of transition error in the FSM using an error-correcting code approaches that of transition errors in the original FSM. VI. APPROXIMATION TECHNIQUES Calculating the probability distribution of states for the fault-tolerant FSM using error-correcting codes and cluster states appears complex; the number of states'in this kind of fault-tolerant FSM is much larger than in'the original realization. Thus, it is desirable to reduce the computational effort, particularly with regard to the probability of state transition error. A simple approximation approach for the distribution of the states for a coded FSM is presented which is very accurate when the probability of gate failure p is small. A first important step involves finding the one-step state transition matrix. It will be assumed for brevity that P

q1)]p7i (1 -

Pc

Pm

The notation

n, total number of output circuits.

(33)

(35)

H = H1 ( H2

274

c-33,

IEEE TRANSACTIONS ON COMPUTERS, VOL.

E- 8

NO.

3,

MARCH

1984

allows (27) to be expressed as Xmj

E-1O

=

2-

-

aEI

JJ{8(a; (e, v,y); si), sj}

X*

(e,r, y)EH

p/ W['(, ~

E-ll E-12

1:

Probability of error in original

2:

Probability of error in the correcting code.

-

FSM. FSM using

an error-

Y)] (1

It may be divided into two parts

r1=j 2{X Ji{8(a; (0, 0,0); s s (1

E-13 E-14

Z

+ E

E-15

-

E-16

-

p)n-W[(e, 71. Y)]

aEl

(E,

,

y)EH-{Q}

-p)

J1{6(a; (e, -q, y); si),sy}

W[(ryq, y)]n

W[(E,q,y]36

where the first part is called the principal term and denoted by mRP while the remaining part is labeled as mW. The second part contains the effects of the soft errors.

E-17 5-18-

mi = m$R + mf.

E-19

-

E-20

-

E-21

(37)

/MR = 2-1 E Ji{8(a; (0, 0, 0); si), s} (I aEl I E-11

-10

(38)

Recall that for the FSM using an error-correcting code, the next state mapping 8 corresponds to the A, (8) where z(t +1) E G, and G labels the centers of the cluster states. Thus, if there were no errors, the next state must be one of the centers of the cluster states no matter what the present state

-

E-22 E-12

p)n.

E

(a)

and input.

8(a; (0, 0, 0); si) =A s,;

c E C, a E I, si E S

where the C is the set of indexes labeling the centers of the cluster states and the judgment function Ji{8(i; (0, 0, 0); Si), sJ} = 0 if j f C. (39)

Substituting these properties into (38), we have mR = if j C = 2 'h1(1 - p)" if jE C (40) where hij is the number of times the entry sj appears in the next state table in the row for present state si under all inputs a E L. By selecting p small enough and defining 2hij ~~~~~~~~~~~~~~~~~~~~~~~MiP'° it is possible to ensure that

E-2-

s z / \L

X

MP° - mP = E-3

jEC

2-hij(1-(1

-

p)n) < n p.

(41)

-

This inequality will be useful later. The second part mS-, due to the effects of soft errors will be considered next.

iE=-I

E-4

E-3

E-2

E-1

(b) Fig. 3. Probability of error for the uncoded and coded FSM.

1

= 2 Z I

X

X i.I(r, (, y)EH-{O}

JO{8(i; (e,q,l y); s,), sJ}

(42) pW[(E,n )] (l p)n-W[(E( ' y)] It is well known that [minl is a stochastic matrix and so for every row of the matrix -

275

WANG AND REDINBO: PROBABILITY OF STATE TRANSITION ERRORS N

1 m1 j = j=l

1;

N

=

j=l

N

Ein0

p)fn

+

Ao

E

N

E2-hijX=

1.

1-(1-p)n


0 for every i and j. So m- < n

-

hij hij =

0

a

and 2a

=

j=l

(Mj

i

n .p

1;

N= 2'.

(45)

hEC h*j

(muj-I)aj

4 C.

I-mijai

(46)

j E C.

-

l)ajji

+ ,

mij,ai

=

-Aj ;

itj,

(50)

Aj < \o

(N- Ke)*l