Multicycle Datapath Implementation

Multicycle Datapath Implementation Adapted from instructor’s supplementary material from Computer Organization and Design, 4th Edition, Patterson ...

Author: Bertha Flowers

13 downloads 0 Views 516KB Size

Report

Download PDF

Recommend Documents

Lecture 10: Datapath Control; Multicycle

A Multicycle Data Path Implementation

ESE 345 Computer Architecture Designing a Multicycle Processor Datapath and Control Designing a multicycle processor

Multicycle Garantiebestimmungen

Processor Design. Processor: Datapath and Control. Single cycle processor. Multicycle processor. Microprogramming

The Processor: Datapath & Control. Single Cycle Implementation Details

5 Stages for Multicycle Execution

Processor: Datapath and Control Introduction. Clock cycle time and number of cpi are determined by processor implementation. Datapath and control

Creating a Single Datapath. Creating a Single Datapath

1 Objective. 2 Datapath Design. 2.1 Datapath Components

Single-cycle datapath

V. Datapath & Control! (Ch )!

Automatic Functional Datapath Optimization

Pipelined datapath and control

3 Single-Cycle Datapath

Introduction to Pipelining: Datapath

Datapath With Control

Datapath & Control Design

Datapath with Control

Datapath & Control Design. What blocks we need. More Implementation Details. Simple Implementation. MIPS Instruction Format. Managing State Elements

Datapath Components CS240 Laboratory 9

Multi-cycle datapath. Fig. 5.30

Chapter 5. Datapath and Control

Datapath Configuration Tool Cheat Sheet

Multicycle Datapath Implementation Adapted from instructor’s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, © 2008, MK] and and Computer Architecture: From Microprocessors to Supercomputers, B. Parhami, 2005 Oxford Press

Review: A Multicycle Data Path Inst Reg

x Reg

jjta

Address

rs,rt,rd

(rs)

PC imm

Cache

z Reg

Reg file

ALU (rt)

Data Data Reg

op

y Reg

fn

Control

Fig. 14.2 Abstract view of a multicycle instruction execution unit for MicroMIPS For naming of instruction fields MicroMIPS. fields, see Fig Fig. 13 13.1. 1 Feb. 2011

Computer Architecture, Data Path and Control

Slide 2

Cycle 1

Cycle 2

Notes for State 5: % 0 for j or jal, 1 for syscall, don’t-care for other instr’s @ 0 for j, jal, and syscall, 1 for jr, 2 for branches # 1 for j, jr, jal, and syscall, ALUZero () for beq (bne), bit 31 of ALUout for bltz For jal, RegDst = 2, RegInSrc = 1, RegWrite = 1 State 0 InstData = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1

Jump/ Branch

Cycle 3

Cycle 4

State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘’ JumpAddr = % PCSrc = @ PCWrite = #

InstData = 1 MemWrite = 1

lw/ sw

ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’

State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’

Start

ALUtype

PC control t l

Cache control t l

Dispatch table 2

0

State 4

InstData = 1 MemRead = 1

RegDst = 0 RegInSrc = 0 RegWrite = 1

State 7

State 8 RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1

Register control t l

ALU iinputs t

ALU Sequence ffunction ti control t l

FnType LogicFn AddSub ALUSrcY ALUSrcX RegInSrc RegDst RegWrite

InstData MemRead MemWrite IRWrite

0 1 2 3

lw

State 3

ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies

JumpAddr PCSrc PCWrite

Dispatch table 1

State 6

sw

State 1

Note for State 7: ALUFunc is determined based on the op and fn fields

Cycle 5

MicroPC 1

Address

Microprogram memory or PLA

Incr Data Microinstruction register

op (from instruction register)

Feb. 2011

Control signals to data path

Review: Microprogramming fetch:

PCnext, CacheFetch # State PC + 4imm, PCdisp1 # State lui1: lui(imm) # State rt  z, PCfetch # State add1: x + y # State rd  z, PCfetch # State sub1: x - y # State rd  z, PCfetch # State slt1: x - y # State rd  z, PCfetch # State addi1: x + imm # State rt  z, PCfetch # State slti1: x - imm # State rt  z, ,  PCfetch # State and1: x  y # State rd  z, PCfetch # State or1: x  y # State rd  z, PCfetch # State xor1: x  y # State rd  z, PCfetch # State nor1: x  y # State rd  z, PCfetch # State andi1: x  imm # State rt  z, PCfetch # State ori1: x  imm # State rt  z, PCfetch # State xori: x  imm # State rt  z, PCfetch # State l lwsw1: 1 x + imm, i mPCdisp2 PCdi 2 # State St t lw2: CacheLoad # State rt  Data, PCfetch # State sw2: CacheStore, PCfetch# State j1: PCjump, PCfetch # State jr1: PCjreg, PCfetch # State branch1: PCbranch, PCfetch # State jal1 jal1: PCj mp $31PC, PCjump, $31PC PCfetch # syscall1:PCsyscall, PCfetch # State

0 (start) 1 7lui 8lui 7add 8add 7sub 8sub 7slt 8slt 7addi 8addi 7slti 8slti 7and 8and 7or 8or 7xor 8xor 7nor 8nor 7andi 8andi 7ori 8ori 7xori 8xori 2 3 4 6 5j 5jr 5branch State 5jal 5syscall

Sequence control

Slide 3

Review: Exception Control Control States

Cycle 1

Cycle 2

Jump/ Branch

Cycle 3

Cycle 4

State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘’ JumpAddr = % PCSrc = @ PCWrite = #

State 6

Cycle 5

InstData = 1 MemWrite = 1

sw State 0 InstData = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1

State 1 ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’

lw/ sw

Start

ALU ALUtype

Illegal operation

Fig. 14.10 Feb. 2011

State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’

lw

State 3

State 4

InstData = 1 MemRead = 1

RegDst = 0 RegInSrc = 0 RegWrite = 1

State 7

State 8

ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies

RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1

State 10 IntCause = 0 CauseWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘’ EPCWrite = 1 JumpAddr = 1 PCSrc = 0 PCWrite = 1

Overflow

State 9 IntCause = 1 CauseWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘’ EPCWrite = 1 JumpAddr = 1 PCSrc = 0 PCWrite = 1

Exception states 9 and 10 added to the control state machine. Computer Architecture, Data Path and Control

Slide 4

MIPS Pipelined Datapath and Control

Single‐Cycle vs. Multicycle vs. Pipelined Clock Time needed Time allotted

Instr 1

Instr 2

Instr 3

Instr 4

Clock Time needed

Time saved

Time allotted

1 2

3 cycles

5 cycles

3 cycles

4 cycles

Instr 1

Instr 2

Instr 3

Instr 4

1

2

3

4

5

f

r

a

d

w

f

r

a

d

w

f

r

a

d

w

f f = Fetch r = Reg read a = ALU op d = Data access w = Writeback

r

a

d

w

f

r

a

d

w

f

r

a

d

w

f

r

a

d

3 4 5 6 7

6

7

8

9

10

11

Cycle

1 2

1

2

3

4

5

6

7

f

f

f

f

f

f

f

r

r

r

r

r

r

r

a

a

a

a

a

a

a

d

d

d

d

d

d

d

w

w

w

w

w

w

3 4 5

w

Start-up region

8

9

10

11

Cycle Drainage region

w

Pipeline stage

Instruction (a) Task-time diagram

Feb. 2011

(b) Space-time diagram

Computer Architecture, Data Path and Control

Slide 6

• Pipelined laundry: overlapping execution – Parallelism improves performance Parallelism improves performance 

Four loads: Four loads: 



Non‐stop: 

Chapter 4 — The Processor — 7

Speedup / = 8/3.5 = 2.3 Speedup p p = 2n/0.5n + 1.5 ≈ 4 = number of stages

§4.5 An O Overview off Pipelining

Pipelining Analogy Pipelining Analogy

MIPS Pipeline MIPS Pipeline Five stages, one step per stage 1. IF: Instruction fetch from memory 2. ID: Instruction decode & register read ID: Instruction decode & register read 3. EX: Execute operation or calculate address lw 4. MEM: Access memory operand 5. WB: Write result back to register

Chapter 4 — The Processor — 8

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 IFetch Dec

Exec

Mem

WB

Pipeline Performance Pipeline Performance • Assume time for stages is – 100ps for register read or write 00ps o eg ste ead o te – 200ps for other stages

• Compare pipelined datapath with single‐cycle p pp p g y datapath Instr

Instr fetch Register read

ALU op

Memory access

Register write

Total time

lw

200ps

100 ps

200ps

200ps

100 ps

800ps

sw

200ps

100 ps

200ps

200ps

R-format

200ps

100 ps

200ps

beq

200ps

100 ps

200ps

Chapter 4 — The Processor — 9

700ps 100 ps

600ps 500ps

Pipeline Performance Pipeline Performance Single‐cycle (Tc= 800ps)

Pipelined (T p ( c= 200ps) p)

Chapter 4 — The Processor — 10

Pipeline Speedup Pipeline Speedup • If all stages are balanced If all stages are balanced – i.e., all take the same time – Time between instructions Time between instructionspipelined i li d = Time between instructionsnonpipelined Number of stages

• If not balanced, speedup is less • Speedup due to increased throughput p p g p – Latency (time for each instruction) does not decrease Chapter 4 — The Processor — 11

Pipelining and ISA Design Pipelining and ISA Design • MIPS ISA designed for pipelining g pp g – All instructions are 32‐bits • Easier to fetch and decode in one cycle • c.f. x86: 1‐ f 86 1 to 17‐byte instructions 17 b i i

– Few and regular instruction formats • Can decode and read registers in one step g p

– Load/store addressing • Can calculate address in 3rd stage, access memory in 4th stage

– Alignment of memory operands • Memory access takes only one cycle Chapter 4 — The Processor — 12

Graphically Representing MIPS Pipeline Reg

ALU

IM

DM

Reg

• Can help with answering questions like: – How many cycles does it take to execute this code? How many cycles does it take to execute this code? – What is the ALU doing during cycle 4? – Is there a hazard, why does it occur, and how can it be fixed?

Why Pipeline? For Performance! Ti ( l k Time (clock cycles) l )

IM

Regg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

Inst 3

DM

ALU

Inst 2

R Reg

ALU

Inst 1

IM

ALU U

O r d e r

Inst 0 Inst 0

ALU

I n s t r.

Once the pipeline i f ll is full, one instruction is completed every cycle so CPI = 1 cycle, so CPI = 1

Inst 4 Time to fill the pipeline Time to fill the pipeline

R Reg

Regg

Reg

Reg

DM

Reg

Hazards • Situations that prevent starting the next p g instruction in the next cycle • Structure hazards – A required resource is busy

• Data hazard – Need to wait for previous instruction to complete its data read/write

• Control hazard Control hazard – Deciding on control action depends on previous instruction Chapter 4 — The Processor — 15

Structure Hazards Structure Hazards • Conflict for use of a resource Conflict for use of a resource • In MIPS pipeline with a single memory – Load/store requires data access L d/ t i d t – Instruction fetch would have to stall for that cycle • Would cause a pipeline “bubble” W ld i li “b bbl ”

• Hence, pipelined datapaths require separate i instruction/data memories i /d i – Or separate instruction/data caches Chapter 4 — The Processor — 16

A Single Memory Would Be a Structural Hazard Ti ( l k Time (clock cycles) l )



Regg

Mem

Regg

Reg

Mem

Reg

Reg

Mem

Reg

Reg

ALU

Inst 4

R Reg

ALU

Inst 3

Mem

Reading data from memory

M Mem

ALU

Inst 2

R Reg

ALU U

O r d e r

Inst 1

M Mem

ALU

I n s t r.

lw

Mem

Mem

Mem

Reading instruction from memoryy

Mem

Fix with separate instr and data memories (I$ and D$)

Reg

Data Hazards Data Hazards • An instruction depends on completion of data access by a previous instruction access by a previous instruction – add sub

Chapter 4 — The Processor — 18

$s0, $t0, $t1 $t2 $s0, $t2, $s0 $t3

Register Usage Can Cause Data Hazards • Dependencies backward in time cause hazards Dependencies backward in time cause hazards

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

DM

IM

Reg

ALU

or

DM

ALU

and $6,$1,$7

Reg

ALU U

sub $ $4,$1,$5 ,$ ,$

IM

ALLU

add $1 $1,

$8,$1,$9

xor $4,$1,$5 

Read before write data hazard

Reg

Reg

Reg

Reg

DM

Reg

Loads Can Cause Data Hazards • Dependencies backward in time cause hazards Dependencies backward in time cause hazards

Reg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

DM

IM

Reg

ALU

sub $ $4,$1,$5 ,$ ,$

IM

ALU

$1 $1,4($2) 4($2)

ALU U

O r d e r

lw

ALLU

I n s t r.

and $6,$1,$7 or

$8,$1,$9

xor $4,$1,$5 

Load‐use data hazard

Reg

Reg

Reg

Reg

DM

Reg

How About Register File Access? Time (clock cycles) ( y )

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

Inst 1

Regg

ALU

IM

ALU

O r d e r

add $1, ,

ALU U

I n s t r.

Inst 2 add $2,$1,

clock edge that controls register writing

Reg

Reg

Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half Reg

DM

Reg

clock edge that controls l k d th t t l loading of pipeline state registers

One Way to “Fix” a Data Hazard Reg

DM

Reg

IM

Reg

DM

IM

Reg

ALU

IM

ALU

O r d e r

add $1,

ALU

I n s t r r.

Can fix data hazard by waiting – stall – waiting – but impacts CPI

stall stall sub $4,$1,$5 and $6,$1,$7

Reg

DM

Reg

Forwarding (aka Bypassing) Forwarding (aka Bypassing) • Use result when it is computed – Don Don’tt wait for it to be stored in a register wait for it to be stored in a register – Requires extra connections in the datapath

Chapter 4 — The Processor — 23

Another Way to “Fix” a Data Hazard

or

$8,$1,$9

xor $4,$1,$5

IM

Reg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

and $6,$1,$7

DM

ALU

sub $4,$1,$5

Reg

ALU

IM

ALU

O r d e r

add $1,

ALU

I n s t r.

Fix data hazards by forwarding results as soon as they are available to where they are needed

Reg

Reg

Reg

Reg

DM

Reg

Forwarding Illustration

sub $4,$1,$5

and $ $6,$7,$1 ,$ ,$

Reg

DM

IM

Reg

DM

IM

Reg

ALU

IM

ALU

O r d e r

add $1,

ALU

I n s t r r.

EX forwarding

Reg

Reg

DM

Reg

MEM forwarding

Yet Another Complication! • Another Another potential data hazard can occur when there is potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded? should be forwarded?

add $1,$1,$4

Reg

DM

IM

Reg

DM

IM

Reg

ALU

add $1,$1,$3

IM

ALU

O r d e r

add $1,$1,$2

ALU

I n s t r.

Reg

Reg

DM

Reg

Load‐Use Load Use Data Hazard Data Hazard • Can’t always avoid stalls by forwarding – If value not computed when needed If value not computed when needed – Can’t forward backward in time!

Chapter 4 — The Processor — 27

Code Scheduling to Avoid Stalls Code Scheduling to Avoid Stalls • Reorder code to avoid use of load result in the next instruction next instruction • C code for A = B + E; C = B + F;

stall

stall

lw lw add sw lw add sw

$t1, $t2, , $t3, $t3, $t4 $t4, $t5, $t5,

0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $t4 16($t0)

13 cycles Chapter 4 — The Processor — 28

lw lw lw add sw add sw

$t1, $t2, , $t4, $t3, $t3 $t3, $t5, $t5,

0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $t4 16($t0)

11 cycles