The Counterf low Pipeline Processor Architecture

C F P P The Counterflow Pipeline Processor Architecture ROBERT F. SPROULL THE COUNTERFLOW pipeline processor architecture is our proposal for a s...

Author: Kelley Thompson

9 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

The Intel Architecture Processors Pipeline

ARM Processor Architecture

VLSI Processor Architecture

LOW-POWER PROCESSOR DESIGN

Formalizing the Uni-processor Simplex Architecture

AMD Eighth-Generation Processor Architecture

Network Processor: Architecture and Applications

Low Latency Transport Architecture

User Mode Execution. Computer Architecture - Overview. Supervisor Mode Execution. Processor Status Register. processor architecture

Fast Fourier Transform Pipeline Architecture for OFDM

Kin : A High Performance Asynchronous Processor Architecture

Processor Architecture Design for Smart Cameras

Network Processor: Architecture, Performance Evaluation and Applications

A Low Power Asynchronous GPS Baseband Processor

OpenGL and Graphics Hardware. Overview of OpenGL Pipeline Architecture Alternatives

CS654 Advanced Computer Architecture Lec 5 Performance + Pipeline Review

OVERVIEW. Fundamentals of Computer Architecture. Introducing The Processor. Introducing The Processor. A Review Of Chapters 1 to 7

ESE 345 Computer Architecture Designing a Single-Cycle Processor Datapath Designing a single-cycle processor

ESE 345 Computer Architecture Designing a Multicycle Processor Datapath and Control Designing a multicycle processor

A Processor Architecture for Motion Sensing Systems using Accelerometer

A parallel camera image signal processor for SIMD architecture

A MULTIMEDIA CO-PROCESSOR ARCHITECTURE FOR REAL-TIME VIDEO CODING

A Processor Architecture Defense against Buffer Overflow Attacks

High-Efficient Architecture of Godson-T Many-Core Processor

C

F

P

P

The Counterflow Pipeline Processor Architecture ROBERT F. SPROULL

THE COUNTERFLOW pipeline processor architecture is our proposal for a simple and regular pipeline structure to underlie a family of RlSC processor microarchitectures. The CFPP uses a bidirectional pipeline-in which instructionsflow and results counterflow-to move partially executed instructions and results through the processor, subject to a few pipeline rules that guarantee correct operation.It uniformly handles functions that add complexity to conventional designs, including operand forwarding, register renaming, and pipeline flushing after branches and traps. The pipeline’s structure favors asynchronous implementations patterned after micropipelines.’ The CFPP structure has a number of properties that promise advantages:

IVAN E. SUTHERLAND Sun Microsystems

Laboratories, Inc. CHARLES E. MOLNAR Washington University

rn Local control. The CFPP re-

quires only local information to decide whether an item in the pipeline should advance. It does not need to compute global pipeline stall signals and distribute them to all pipe stages. The complexity of a global stall 48

current processor designs. Local control also facilitates asynchronous implementation. rn Regularity. The CFPP’svery regular structureshould allow a regular silicon layout and should also facilitate devising correctness proofs. Local communication. Pipeline stages communicate primarily with their nearest neighbors, allowing short communication paths that can be very fast. Modularity. The uniform communication behavior of pipeline stages allows variations in the detailed design of individual stages and in their ordering. For example, one design may call for a single ALU, another for separate adder-subtracter and multiplier units, yet both will have the same overall pipeline structure. Simplicity. The structure’soverall simplicity may make CFPP processors easier to design than their conventional counterparts.

As yet, we do not know how CFPPs computation and the time re- will perform. Their mostly local comquired to compute and distribute municationsand control should permit the signal are major headaches in very fast operation. On the other hand, 0740-7475/94/$04.000 1994 IEEE

IEEE DESION & TEST OF COMPUTERS

CFPP designs may introduce performancelimiting delays, such as the time between instruction issue and acquisition of all required operands. We have not yet simulated enough CFPP variants to understand the limits to achievable performance.

Register file

0

t'

41

Basic structure In a CFPP, a relatively long pipeline connects an instruction fetch unit at the bottom with a register file at the top (see Figure 1). Instructions flow up through the stages of this instruction pipeline in sequence. Earlier instructions remain above later ones because the design forbids instructions to pass one another. An instruction moves up as far and as fast as it can, stalling only when the pipeline stage immediately above cannot yet accept a new instruction or when it reaches the last pipeline stage equipped to execute it. Above a stalled instruction, a gap of arbitrary size may open in the pipeline; not every stage must contain an instruction. Each instruction carries bindings for its source operands and its destination(s). Each binding associates a data value with a register name; it consists of a register name, a data value, and a validity bit to indicate whether the association is valid (see Figure 2). Register names include the general registers found in most RISC architectures, as well as other processor state componentssuch as a condition code register. The pipeline in Figure 1 shows three bindings with each instruction-these might be, for example, two sources and one destination.The bindings of a justlaunched instruction are all marked invalid but carry the register names given by the instruction. Executing an instruction places new data into the data values of the destination bindings, which are then marked valid. When the instruction reaches the top of the pipeline, the register file copies the data values recorded in the destination bindings into the corresponding registers. Until the register file FALL 1994

~

- I

I

I I

1

Instruction

Stage 2

Results

I

Program counter and instruction fetch

Figure 1. Simplified CFPP.

Update result bindings Opcode

Instruction latch

c

I

I + Garner source bindings

t

t

+

\' Validity bit

Register name

Figure 2. Instruction and result bindings, with logic for implementing pipeline rules.

receives this final record, instructions are speculative and a trap or branch can cancel them if necessary. When an instruction executes, the CFPP uses the outputs in two ways. First, they enter the instruction's destination

bindings and eventually retire into the register file as just described. Second, the stage that executes the instruction inserts each newly updated destination binding into the downward-flowingresult pipeline so that subsequent in49

C

structions can observe it. In Figure 1, each stage of the result pipeline accommodates two bindings. Any later instruction that requires a source binding whose register name matches the register name of a result binding will gamer the value-that is, copy the value into the instruction pipe binding. Through these mechanisms, the result pipeline provides forwarding (also known as bypassing) uniformly for all stages2 As a result binding flows down, a s u b sequent instruction may modify it. Each stage must detect matches between instruction and result b i n d i n m a t is, cases in which the register name in an instruction binding matches the register name in a result binding (see Figure 2). An executed instruction that has a destination binding matching a result binding must copy the value from the destination binding into the result binding so that later instructionswill gamer the most recent binding for that register. A different situation arises when the instruction has yet to execute. In that case, any result binding that matches a destination binding in the instruction is deleted from the result pipeline because the binding will not be valid for later instructions. 50

F

P

P

fetches register values and sends corresponding bindings down the result pipe. The register-fetchingpolicy will affect performance, even leading to deadlock if the policy prevents an instruction from ever receiving a value for one of its sources,but every policy leads to correct computations. To ensure correct operation,the CFPP must detect matching bindings between each result flowing down and each instruction flowing up. Each result must meet each subsequent instruction in some pipeline stage,where comparators detect matching register names in bindings. Thus, we must prevent adjacent stages from simultaneouslyadvancing an instruction and a result, which would prevent the detection of matches. In an asynchronous implementation, an arbiter between each pair of stages enThus, a particular result binding typi- forces this communication protocol. cally passes only a short span of instructions: the section between the instruction Pipeline rules Correct CFPP operation,described in that computesthe value and the next instruction that overwrites that value. The general terms in the preceding section, result modification rules ensure that depends on the followingset of pipeline every result meeting an instruction in the rules, which each stage must obey. (We countefflow pipeline holds correct bind- give an additional rule in our later disings for that instruction. Several different cussion of traps and conditional bindings for a register may be in transit branches.) The first few pipeline rules in different parts of the pipeline at the concern instructions: same time.Thesemultiple result bindings PO: No overtaking. Instructions serve the same function as register remust stay in program order in the naming in other designs. instruction pipeline. Another source of operands for inPI: Execute. If an instruction’s structionsis the register file,which must source bindings are all valid, and if also insert bindings into the result pipe. the stage holding the instruction Many policies might be appropriate for contains suitable computing logic, determining which register values to the instruction can execute. When fetch and send down the pipe. For exan instruction completesexecution, ample, the register file could fetch regthe stage marks its destination bindisters randomly. A preferable policy is ings valid and fills in their values. to send down bindings known to match P2: Insert result. When an instrucsources of instructions in the pipe; the tion completes execution, the relevant instructions will garner these stage inserts one or more copies of register values, which will enable the each of its destination bindings instructions to execute. In one impleinto the result pipeline. mentation of this policy, the instructionP3: Stall for operands. An unexedecoding stage sends source register cuted instruction must not pass b e addresses to the register file, which IEEE DESIGN

TEST OF COMPUTERS

yond the last pipeline stage capable struction cannot retire into the register file until it has executed.

TaMe 1. Snapshotsof examplepipeline operation. Stage

Instruction pipe

Result pipe

R

The following are matching rules, which apply when an instruction and result are present in the same pipe stage and have matching bindings (see Figure 2):

MO: Gamer instruction operands. If a valid result binding matches an invalid source binding, the stage must copy the result value to the source value and mark the source valid. MI: Kill results. If an invalid destination binding matches a valid r e sult binding, the stage must mark the result binding invalid. M2: Update results. If a valid destination binding matches a result binding, the stage must copy the destination value into the result value and mark the result valid.

An example

PC = 101 PC = 102 PC = 103

A := B+C B := A+B D := C-1

Table 1 presents a series of "snapshots" of the pipeline operation. Two FALL 1994

Register contents: A[14] B[2] C[3] D[21]

0 1 2

F

PC=101

'T

Fetch, send source names B, C to register file B[2] C[3] &

R

Register contents: A[ 141 B[2] C[3] D[21]

0 1 2

F

A[] := B[]+C[] 'T PC=102?

R

0 1 2 F

0 1

Fetch, send source names A, B to register file A[ 141 B[2] B[2] C[3]

A[] := B[]+C[] 'T B[] := A[]+B[] 'T PC=103

R

A simple example illustrates the operation of the counterflow pipeline under the pipeline rules. We assume a fivestage pipeline as in Figure 1. The first stage, F, contains a program counter,fetches instructions, and sends them up the instruction pipeline. It also sends source register addresses to the register file. Stages 2,1, and 0 are identical, each containing an ALU capable of executing an integer instruction. The final stage, R, is the register file, containing registers A, B, C, and D. Our example uses the following instruction sequence, in which PC repre sents the program counter:

Remarks

Fetch delayed due to cache miss A[ 141 B[2]

A[] := B[2]+C[3] B[] := A[]+B[]

Register contents: A[14] B[2] C[3] D[21] Must not swap with instruction below Must not swap with result above

B[2] C[3] 1

Register contents: A[14] B[2]C[3] D[21] Garner B, C; execute

2

F

PC=103'T

R

Fetch, send source name C to register file A[ 141 B[2]

2 F

A[5] := B[2]+C[3] 'T A[5] B[] :=A[]+B[2] 'T B[2] C[3] D[] := C[]-1 T' PC=104?

R

A[5] := B[2]+C[3]'T

A[5] B[2]

0 1 2 F

B :=A[5]+B[2] ?' D := C[3]- 1 'T

A[5] B[2]C[3]

0 1

Register contents: A[14] B[2] C[3] D[21] Insert result GarnerB Literal-1 held in binding value Fetch, send source names to register file Register contents: ~ [ 5 BPI 1 ~ ( 3 ~1 ( 211 Garner A; execute Garner C; execute

... PC=105

Fetch, send source names to register file

51

C

1

Register file

t

c

k-

Memory-recover

t

1'

-

Add-recover

t

c

figure 3. CFff with memory and arithmetic sidings. Unlabeled stages may contain logic for executing additional instructions such as shifts.

columnsshow what the instruction and result registersof each pipeline stage are holding; we assume that each stage can hold two result bindings at once. The notation A[] denotes a source, destination, or result binding with register name A and an invalid value; A[ 141 denotes a valid value of 14 filled in. Upward- and downward-pointing arrows indicate that an instruction will move up or a result will move down in the next snapshot. Keep in mind that many other execution histories are possible because of differ52

F

P

P

?nces in execution timing and in the movement of instructions and results Lhrough the pipeline. The first snapshot, at the top of Table 1, shows the CFPP's initial state. In the second snapshot, stage F has fetched and launched the first instruction into the instruction pipeline with source and destination bindings corresponding to the register names given in the instruction. All bindings are marked invalid. The instruction also contains the opcode and possibly other information, such as the program counter. Stage F has transmitted the two source register names to the register file so that their values can be fetched and inserted into the result pipeline. In the last snapshot, rule M2 has altered the binding for A in result stage R. In this example, instructions execute whenever all source operands are available, not necessarily in the order of instruction issue. Traps and conditional branches Two of the most troublesome aspects of conventional processor design are traps and conditional branches. A full CFPP structure such as that shown in Figure 3 accommodates these cases easily. The basic idea is that a specially marked result traveling down the result pipeline invalidates instructions following a trap or a wrongly predicted branch. If execution of an instruction causes a trap, the stage that generates the trap inserts a special trap result binding instead of the normal destination bindings into the result pipeline. To invalidate subsequentinstructions, we introduce the following additional pipeline rule, applied when a result and instruction meet in a stage: M3: Kill instruction. If a result binding is either a trap result or a wrong-branch result, the stage must mark the instruction invalid. The instruction can proceed up the pipeline, but it will have no ef-

fect on the result pipeline or the register file. When the trap result reaches the bottom of the result pipeline, the stage responsible for program counter control (the bottom stage in Figure 3) interprets it, changing the program counter so as to start fetching instructions from a suitable trap handler. As these new instructions enter the pipeline, they will not meet the trap result and so will r e main valid. The trap result may carry information from the offending instruction,such as the program counter or trap condition, to be used in handling the trap. The CFPP handles conditional branches similarly. A conditionalbranch instruction, which cites the condition code register as a source, goes up the instruction pipeline, like any other instruction. Meanwhile, the program counter control makes a branch prediction and continues to launch instructions into the pipeline from the predicted instruction stream.When the conditional-branch instruction executes, if it determines that the branch was predicted incorrectly, it sends down the result pipeline a wrong-branch result binding, which contains a value for the correct program counter target of the branch. Thisspecial result will kill all subsequent instructions it meets in the pipeline. When the branch result reaches the pro gram counter control stage, the program counter changes to reflect the result of the branch, and instructions will then be fetched from the proper path. We can usefully divide the activities for handling conditional branches among several stages. The arrangement of Figure 3, for example, splits responsibility for program counter control and instruction decoding into separate stages. The instruction decoder can be responsible for branch predictions and can send new program counter values to the program counterstage via the result pipeline. IEEE DESION & TEST OF COMPUTERS

Function units and sidings The CFPP architecture allows different stages to perform different kinds of processing. For example, one stage might execute integer arithmetic instructions;another, floating-point. One stage might add; another, multiply. If the multiplier is above the adder in the instruction pipeline, multiply-add sequences used to form inner products will take advantage of the result pipe’s ability to forward a multiplication result quickly to the adder. The pipeline can also use auxiliary sidings to execute instructions.Astage in the instruction pipeline launches an operation into a siding, and one or more later stages recover results. The sidings themselves are pipelined, so several operations can progress concurrently. While a siding is performing its job, the instruction that launched the operation proceeds normally along the instruction pipeline, recovering and handling results in proper sequence with other instructions in the pipe. Instructions invalidated by traps or conditional branches still must recover results from operations they have previously launched into sidings, to keep the sidingsand the instruction pipeline coordinated. Figure 3 shows two arithmetic sidings: a multiplier and an adder. When an instruction calling for an addition reaches the add-launchstage, it stalls if necessary to wait for all source operands to be valid and then launches the addition into the siding. The instruction proceeds up the instruction pipeline and recovers a result from the siding at the add-recover stage, waiting if necessary for the siding operation to finish. The add-recover stage puts the result value into its destination binding and launches a result into the result pipeline, just as if execution of the instruction required only the resources of the add-recover stage. Figure 3 also shows a memory siding that has two recovery points. At the FALL 1994

memory-launchstage, a memory load instruction launches a memory load operation into the bottom of the siding by providing the memory address to the siding. The memory siding responds in one of two ways at the first recovery point. If the value was present in a cache, the siding delivers it at this point. If not, the memory siding indicates a miss to the instruction pipeline, and the load instruction advances up the pipeline without a valid destination binding. At the final recovery point, the memory must return a value or indicate an error such as a page fault. This arrangement reports cached data values early so that they can move quickly down the result pipeline to serve as operands for instructions that need them. Instructions can launch memory write operations at the same point as loads. Processing a write is complicated by the requirement that values in memory change only after traps and conditional branches for all earlier instructions have occurred. One way to meet this requirement is to have write operations commit at the final stage of the instruction pipeline, memoryrecover, even though write data may have been supplied to the siding earlier. Many siding arrangements are possible; Figure 3 shows some possibilities. One drawback of sidingsis that they reduce the regularity of pipeline design and layout and may lead to delays or slow paths. On the other hand, sufficient pipelining of the sidings can usually sustain high throughput.The siding mechanism also may provide a simple way to integrate functionally specialized coprocessors into a machine architecture to achieve concurrent operation as well as perfect synchronization with the instruction stream.

example,separate fixed-point and floating-point register files. But any instruction that may alter the contents of a register must execute before passing the corresponding register file. Augmenting the design in Figure 3 with a register cache helps reduce the latency entailed in fetching values from the register file and passing them down the result pipeline. Located just above the instruction decoder, the register cache stage can copy any result binding that passes. When an instruction passes through the cache stage, it garners values for source bindings from bindings held in the cache. Moreover, if a binding held in the cache matches the instruction’sdestination, the cache entry must be invalidated because execution of the instruction will change the value. Trap results and wrongly predicted branch results must sweep the cache of values computed by instructions subsequent to the trap or branch. Rather than trying to identify these registers exactly,the results can invalidate the entire cache. A register cache also reduces the load on the register file and the result pipeline. Rather than fetching registers required for each instructionas it is decoded as shown in Figure 3, a modified design fetches registers only when the cache fails to supply a value for a binding in a passing instruction.

Implementations

There are numerous ways to implement CFPP structures.Some of the variations are immediately evident: the number of pipeline stages, the functional specialization of each stage,types of sidings and the locations of their launch and recovery points, and so on. This section describes less obvious variations: the packagingand control of the information that flows in the pipeline Register files and register caches and synchronous and asynchronous The CFPP architecture permits sev- implementations. eral separate register files to coexist at different places along the pipelinefor Packet structure. One way to in53

Figure 4. Pipelines operating at maximum throughput have half their latches empty.

crease performance is to combine several result bindings into a fixedsize result packet that flows down the pipeline as a unit. The examples presented in Figures 1 and 2 and Table 1 use result packets containing two bindings. Alternatively, a result packet can contain three bindings, so that a single packet can carry all the operands used by a typical instruction: two full-word operands and a condition code. The packet structure can make special provision for the condition code binding, whose value requires only a few bits. On the other hand, one might argue that the instruction and result pipelines should operate at about the same speed, and therefore a result packet should have room for all the source and destination bindings for one instruction: perhaps two full-word operands, a condition code operand, and a full-word destination. Carrying a separate condition code destination binding is not necessary because the destination val54

ue will overwrite the source value as the packet passes any instruction that changes the condition code. Likewise, the instruction pipe can carry instruction packets containing more than one instruction; this would be analogous to a conventional superscalar design. Each pipeline stage contains latches for holding an instruction packet and a result packet. These latches provide the storage essential for pipelined processing. Connected to the latches are circuits that enforce the pipeline rules, including the comparators that detect matching instruction and result bindings (Figure 2). Pipeline control.The pipeline control we envision for CFPP designs is elas tic. That is, packets can be inserted into or removed from a stream of packets. Thus, when an instruction executes, it inserts a new result into the result path. This result might occupy a previously

invalid result binding in a result packet, or it might require inserting a fresh result packet into the result pipe. When a trap or conditional branch kills an instruction, it can delete the instruction packet from the instruction pipe unless the instruction must synchronize with a siding. Rule M1 can delete invalidated results from the result pipe. We can, however, achieve correctness of the CFPP structure without insertion and deletion. Rather than deleting an instruction or result, we can simply mark it invalid and let it propagate through the rest of the pipe without causing side effects. Rather than inserting new result packets, we can send a continuous supply of packets containing empty result binding slots down the result pipe and insert a result binding in the first empty slot that passes after an instruction executes. A major objective of the CFPP design is to allow local control of the pipeline. Whether a packet can advance from one stage to the next depends on only the state of the two stages. Three conditions must be met for advancement: 1) The packet must be ready to advancefor example, if an instruction is executing, the execution must be complete.2) There must be space in the next stage. 3) Adjacent stages may not exchange an instruction and a result. These conditions require only local knowledge. The need for a packet to obtain space in the next stage before advancing has an unexpected consequence: The pipeline achieves maximum throughput when it is half full. The left column of Figure 4a is a schematic of a unidirectional pipeline with storage latches shown as squares and combinational logic shown as circles. A darkened square represents a latch containing a packet. The bottom three latches are full and the top latch is empty.The column of squareslabeled tzshows the next state of the pipeline; the topmost packet has advanced upward. Columns t3 and t4 show subsequent states of the pipeline. IEEE DESIGN & TEST OF COMPUTERS

Note that only one packet moves in each cycle. Figure 4b shows the same structure with empty and full stages alternating. This configuration achieves maximum throughput-ach packet advances on each cycle-but the pipeline is only half full. To recover full use of the logic, which may be extensive if it is performing floating-point arithmetic, we can provide two latches with each stage. For example, Figure 4c shows a standard master-slave register for each stage, and Figure 4d shows two symmetric latches and a multiplexer,which eliminate the need to transfer master to slave. The pipeline achieves the highest throughput when one of the two latches associated with each stage is full and one is empty. In these configurations, each stage is always processing a packet through its logic, and each packet can advance.

Transitions in Figure 5 that involve movement of instructions and results have the following labels: AI (accept instruction from below), PI (pass instruction upward), AR (accept result from above), and PR (pass result downward). No transitions that simultaneously advance instructions and results are allowed. For example, in state R, either the result can pass down or an instruction can be accepted from below, but not both.

Synchronous implementation. Synchronous implementations of the CFPP are straightforward.They achieve local control easily by means of a combinational function on the state of neighboring stages, each of which uses the representation shown in Figure 5. The control can use a fixed policy to decide between passing an instruction or a result packet when either is possible. One sensible policy is to pass instructions Countersynchronized control. preferentially, because a processor's ulWhat distinguishes the CFPP from two timate aim is to shove instructions independent countefflowing pipelines through the pipe as rapidly as possible. is that its two pipelines must interact to Many instructionsand results passing apply the pipeline rules. Astage applies through stages will require little or no the rules by inspecting the states of its processing-little to enforce the pipeline adjacent stages and preventing certain rules and none if instructionsand results items from advancing. Figure 5 shows a are not both present. When executing, state diagram that reflects whether an an instruction may pause longer in a instruction or a result or both are pre- stage. Thus, a twospeed clocking stratesent at each pipeline stage. The follow- gy, using one very fast clock that takes a ing are the names of the states: single clock cycle to handle a nonexe cuting item and more cycles to execute E: Empty. Neither an instruction an instruction, may be advantageous. nor a result is present. I:Instruction. Only an instruction is Asynchronous implementation. present. We devised the CFPP architecture with R: Result. Only a result is present. an asynchronous control structure in rn F: Full. Both an instruction and a mind. Because all the stages are similar result are present. and the interfaces between them are rn C: Complete. The pipeline rules identical, we can illustrate the asynhave been enforced, and both in- chronous approach by showing the d e struction and result are ready to sign of a typical stage and how adjacent move on. In practice, we might fur- stages communicate. Figure 6 shows the ther divide this state to allow the interface between stages, which consists result to advance while the in- of two communication processes and a struction executes. control element called a cop. ThroughFALL 1994

Figure 5. State diagram for a CFPP stage.

Stage

I

PR! 4

f14 COP

I11 PI? Y AR?

I

f

PI!

Stage

AR!

+~

Figure 6. Interface between two adjacent stages of an asynchronous CFPP implemen tation.

out this discussion, we use transitionsignaling conventions (also called twophase or nonreturn-to-zerosignaling) and bundled-data transfer protocols illustrated in other work.^.',^,^ Each stage sends the cop signals indicating what the stage is prepared to do. The signal AI? indicates willingness to accept an instruction,PI? readinessto pass an instruction forward, AR? willingness to accept a result, and PR? readiness to pass a result forward.The C elements in the cop match requests to passwith willingness to receive and grant permission for the appropriate communication to 55

C

AI?

cI

X

PR7

+

I

i

c

AI!

F

P

P

AI?

t

fl ,GI

GI G2

r

pi7

t

Y

AI?

t

I

-;i

Arbiter

t

AI!

t

Latch

AR7

Figure 7. Cop internals.

PI!

PI?

Figure 9. Stage-to-stagecommunications: CFPP (a);micropipeline (b).

Figure 8. State diagram of five-wire arbiter in the cop.

proceed (signal GI or GR). The communication, in turn, uses the signals PI! and AI! or PR! and AR! to indicate completion to the stages it links. If both instruction and result communicationsare possible, the cop chooses one. (We call it the cop because it controls traffic between stages, letting traffic move in only one direction at a time.) After each communication,both participating stages must announce willingness to participate in the next communication by signaling on the X and Y wires. Only after both stages have determined that the transfer is complete can the cop allow the next transfer. This communication protocol adheres to the five-state diagram shown in Figure 5. The complete process AI?, PI?, GI, AI!, PI!, X, Y corresponds to the lower stage’s passing an instruction to the 56

upper stage. The diagram represents this transaction as a single event, called AI in the upper stage and PI in the lowerstage. The interpretation of the event in the state diagram is like the interpretation of communicating sequential process (CSP): events; it requires agreement and synchronization of the two communicators. The expansion of this event into the signaling protocol used by the cop requires a delicate interplay of arbitration and synchronization of adjoining stages. It is not the straightforward twowire handshake expansion of a CSP event.c

Cop internals. Figure 7 shows the cop implemented with a five-wire arbiter and C elements. When a C element receives both the pass and accept signals for a communication, it signalsa request to the arbiter. The corresponding grant enables the communication, which eventually occurs. Finally, both stages indicate they are finished with the transaction by signaling on X and Y, freeing the arbiter for the next transaction. The arbiter exhibits the eight states

shown in Figure 8. Only one of these states actually requires arbitration, as indicated in the figure. Notice that R1 and G1 alternate, as do R2 and G2, and either grant alternates with D.

Communicationprocesses. In Figures 6 and 9a, instructions and results flow from stage to stage in a form of micropipeline.The cop tells the instruction transfer process, represented by the shaded area in Figure 9a, to act (GI). The transfer process moves an instruction from one stage to the next, latches it, and informs both stages that the transfer is complete (AI! and PI!). Figure 9 shows the clear correspondence between the CFPP and micropipeline structures.Their only significant difference is the arbitration needed to enable a transfer. The communication process represented by the shading can take a variety of forms, such asserial communication, purely self-timed data, and others. The wiring shown in the illustration most closely matches the micropipeline form. The latch is initially transparent:that is, IEEE DESIGN & TEST OF COMPUTERS

input data flows to its outputs (data instruction, the garnering would paths are not shown). The GI signal have occurred in the stage above. makes the latch opaque, so as to hold Once garnering is complete, the its output values, and then generates the stage unlocks both arbiters,which AI?/PI!output. Finally, a transition on the then can allow the instruction to AI? signal (shown entering the latch dimove up and the result to move agonally in Figure 9a) returns the latch down. to a transparent state. The pipeline latches can be physically associated with eiThis design places all the required arther of the two stages or with the cop. bitration in the cops. It allows each stage to assume that a new result or inStage internals. Figure 10 shows a struction may arrive during processing, simplified view of the connections that but that neither an instruction nor a reimplement instruction and result sult may leave until processing is compipeline stages. Control wiring connects plete and the X and Y signals unlock the the communication processes so that cops. when a stage receives a new instruction Control of instruction execution re(AI!), the stage asksimmediately to pass quires additional circuitry not shown in it onward (PI?), and when it has been Figure 10. The data path from instrucpassed (PI!), the stage immediately re- tion latch to instruction output implied quests another (AI?). Similarwiring han- in the figure contains suitable combidles results. national logic for computing destinaThe remaining component of each tion values. The control path linking AI! stage is a fullempty detector, shown in and PI? may need a delay that correFigure 1 1. Taking as input the four com- sponds to the logic delay. To stall an inmunication completion signals, it an- struction that must wait for operands, a nounces that a stage has become full or stage needs additional control circuitempty. A full signal starts the garnering ry that inhibits PI? until the instruction process, which leads to two subtle b e has garnered all operands from passing haviors: results and the combinational logic has finished computing new values. DevisUntil the garnering process is fin- ing a complete control circuitry design ished and the X and Y signals have that is simple, fast, and correct poses a reached the adjacent cops, the in- challenge. Once obtained, however, struction and result are locked in the design applies with minor variations the stage. The lock occurs because to all stages of the CFPP. the arbiters above and below the stage are locked, preventing fur- Relation to other approaches ther communication. A locked Readers familiar with various processtage can take as long as necessary sor designs will recognize in the CFPP to capture newly garnered source structure many elements of other devalues in the instruction latch. The signs, but arranged differently. The X and Y sequencing signals unlock CFPP is an extreme attempt to replace the arbiters. physically long communication paths Meanwhile, PI? has already re- with pipelined transport of data and inquested permission to transfer the structions. As chip densities and traninstruction to the next stage. sistorperformance increase, the relative Remember that the upper arbiter delay of long wires increases. We hope chose to move the result instead, that a few very fast local pipeline transleading to a full stage. If the upper fers with local control can replace the arbiter had chosen to transfer the single longer delay required to send bits FALL 1994

Figure IO. Partial sketch of control wiring within a pipeline stage.

PI!

I

AR!

Y

I Full

Empty

II!

PR!

Figure I I . Full-empty detector, which generates the X and Y completion signals.

an equivalent distance. Whether this aim is achievable is a question that awaits further simulation and implementation studies. At the opposite extreme from the CFPP is a dataflow architecture, in which the processor broadcasts a new result to an associative memory that holds instructions waiting to e ~ e c u t e . ~ The new result may complete the Sourcevalues necessary to execute one or more of these instructions. The processor identifies the instructions, reads them out of the associative memory, and routes them to a suitable functional unit. Comparators in the mociative memory perform the same function as the comparators that enForce the CFPP’s pipeline rules. Hows7

ever, the dataflow mechanism’s associative memory presents a large load and long pathways to the broadcast wires, slowing the operation. Ideally, we would like to compare the performance of these two designs in a closed form, like that provided by the logical effort method.* Perhaps the best design is a compromise between entirely dataflow and entirely pipelined extremes. For example, we can imagine treating small blocks of instructions in a dataflow fashion but pipelining the blocks through the processor; results would be presented in one cycle to a few immediately following instructions and then to subsequent blocks.

ue and delivers both instruction and value to a computational unit that can execute the instruction. Is the counterflow pipeline a practical mechanism to achieve this control and communication? w What is the best performance we can expect from a CFPP design, and can the structure compete with alternative processor architectures?

The last question, of course, is critical. The CFPP design calls for more circuitry than a conventional design-for example, in the comparators that implement the pipeline rules. Does performance alone justify this cost? Or is the CFPP so much easier to design than a W E SKETCH PRESENTED HERE Of the conventional processor that the reCFPP architecture and its implementa- duced design time is the competitive tion is merely a beginning. The struc- selling point? The CFPP’s simple, reguture admits a great many variations and larstructure offers benefits such as modraises a number of questions: ularity, ease of layout, and simple correctness arguments. Further work w What organization of stages and will tell us whether these benefits outtheir capabilities can efficiently weigh its apparent disadvantages. @ process instruction streams acceptable to conventional RISC processors? Does the result pipe- Acknowledgments line’s forwarding mechanism alThe CFPP project would not have been low greater concurrency than possible without the energy, inspiration, conventional RISC designs? and contributions of Ian Jones, Chris w What instruction-ordering and oth- Kappler, Mike Wessler, and Robert Yung. er compile-time measures will im- We are extremely grateful to Wes Clark, prove the CFPP’s performance? who gave this article a meticulous critique. How can we improve the asynchronous implementation? Can we devise stage-tostage signaling References protocols that would allow higher 1. I.E. Sutherland, “Micropipelines,” instruction throughput? Should we Comm. ACM, Vol. 32, No. 6, 1989, pp. attempt to use less arbitration, or 720-738. is arbitration normally fast enough 2. J.L. Hennessy and D.A. Patterson, Comto be used freely? puter Architecture:A Quantitative Apw Is the latency of operand delivery proach, Morgan Kaufman, San Mateo, via the result pipeline too great to Calif., 1990, p. 261. be practical? One view is that a 3. C.A. Mead and L. Conway, Introduction processor’s critical delay extends to VUISystems,Addison-Wesley, Readfrom the moment it computes a ing, Mass., 1980. value until it identifies the next in- 4. W.A. Clark and C.E. Molnar, “Macrostruction that depends on that valmodular Computer Systems,” in Com58

puters in Biomedical Research, Vol. 4, R. Stacy and B. Waxman, eds., Acade mic Press, New York, 1974, pp. 45-85. 5. C.A.R. Hoare, Communicating Sequential Processes, Prentice-Hall, Englewood Cliffs, N.J., 1985. 6. E.L. Brunvand and R.F. Sproull, “Translating Concurrent Communicating Programs into Delay-Insensitive Circuits,” Proc. ICCAD, IEEE Computer Society Press, Los Alamitos, Calif., 1989. 7. V. Popescu et al., “The Metaflow Architecture,” IEEE Micro, Vol. 11, No. 3, June 1991, pp. 10-13,63-73. 8. I.E. Sutherland and R.F. Sproull, “Logical Effort: Designing for Speed on the Back of an Envelope,” in Advanced Research in VU],Carlo H. Sequin,ed., MIT Press, Cambridge, Mass., 1991. 9. E.L. Brunvand, “Parts-r-Us: A Chip Apart,” Tech. Report CMU-CS-87-119, Computer Science Dept., Camegie MelIon Univ., Pittsburgh, 1987. 10. D.L. Dill, Trace Theory for Automatic Hierarchical Verification o f SpeedIndependent Circuits, MIT Press, Cambridge, Mass., 1989.

Robert F. Sproull is a vice president and fellow at Sun Microsystems Laboratories. At the company’s laboratoly in Chelmsford, Massachusetts, he leads a section that focuses on reducing the impedance between users, computers, and information. Since his undergraduate days, he has been building computer graphics hardware and software: early clipping hardware, an early device-independent graphics package, page description languages, laser printing software, and window systems. He has also been involved in VLSI design, especially of asynchronous circuits and systems. Prior to IEEE DESIGN & TEST OF COMPUTERS

joining Sun, he was a principal of Sutherland, Sproull and Associates, an associate professor at Carnegie Mellon University, and a member of the Xerox Palo Alto Research Center. Sproull is the coauthorwith William Newman of principles oflnteractiue Computer Graphics. He is a member of the IEEE Computer Society.

pated today’s virtual reality systems. He is the cofounder of Evans and Sutherland, which manufactures advanced computer image generators.As head of the Computer Science Department at Caltech, he helped make integrated circuit design an acceptable field of academic study. Sutherland is on the boards of several small companies and is a member of the National Academy of Engineering, the National Academy of Sciences, the ACM, and the IEEE. He received the ACM’s Turing Award in 1988.

Washington University and a consultant to Sun Microsystems Laboratories. He has studied asynchronous systems since his involvement in the Washington University Macromodule Project. He has special interests in computer systems for biomedical research applications and in the relationship of computation models to computing mechanisms. He received BS and MS degrees from Rutgers University and an ScD from Massachusetts Instituteof Technology, all in electrical engineering.

Charles E. Molnar is a professor in the Institute for Biomedical Computing of

Send correspondence to Robert F. Sproull, Sun Microsystems Laboratories, 2 Elizabeth Dr., Chelmsford, MA 01824; or [email protected] somewhat more detailed technical report on the CFPP architecture, including a sketch of a correctness argument, is available from Sproull.

Ivan E. Sutherland is a vice president and fellow at Sun Microsystems Laboratories. Sketchpad, his 1963MassachusettsInstitute of Technology PhD thesis, pioneered the computer graphics field. His 1966work, with Sproull, on a head-mounted display antici-

What i s D&T.3 IEEEDesign & Test o f Computers is a quarterly magazine published by the IEEE Computer Society specifically for design and test engineers, researchers, and educators. D&Tfeaturespeer-reviewed original works describing the body of methods used in designing and testing the complete range of electronic product hardware and supportive software. Readers can expect to see technically accurate articles that explore current practice and experience in w Design for test and testing of electronic components and assemblies w Diagnosis of design, process, assembly, and field (system) defects Tester and tester-related issues in manufacturing and test w ECAD software-algorithms and tools for hardware design and softwarelhardware codesign D&Tpublishes direct, clear, and concise articles, tutorials, booWproduct reviews, and conference reports contributed by authors working in the industry.

Article submission Submit six copies of manuscripts to either the Editor-in-Chiefor the Associate EIC at the addresses printed on the inside front cover of this issue. For a special issue, submit copies to the appropriate Guest Editor. Each submitted article undergoes at least three technical reviews. Staff editors carefully edit (gram- EEE mar, punctuation, structure,and style) each accepted article for readability.All submissions must be original, previously unpublished works. of Computers For Author Guidelines,fax Claire Weller in the CS West Coast Office at (714) 8214010. Practicultechnology for the systems engineer

Desig&est