Pipelining: the laundry analogy

Pipelining: the laundry analogy Time 6 PM 7 8 9 10 11 12 1 2 AM 6 PM 7 8 9 10 11 12 1 2 AM Task order A B C D Time Task order A B ...
Author: Amelia Oliver
2 downloads 1 Views 233KB Size
Pipelining: the laundry analogy Time

6 PM

7

8

9

10

11

12

1

2 AM

6 PM

7

8

9

10

11

12

1

2 AM

Task order A B C D

Time Task order A B C D

1998 Morgan Kaufmann Publishers

1

Pipelining •

Improve perfomance by increasing instruction throughput Program execution Time order (in instructions) lw $1, 100($0)

2

Instruction Reg fetch

lw $2, 200($0)

4

6

8

ALU

Data access

10

12

14

ALU

Data access

16

18

Reg Instruction Reg fetch

8 ns

lw $3, 300($0)

Reg Instruction fetch

8 ns

8 ns Program execution Time order (in instructions)

2

lw $1, 100($0)

Instruction fetch

lw $2, 200($0)

2 ns

lw $3, 300($0)

4

Reg Instruction fetch

2 ns

6

ALU Reg Instruction fetch

2 ns

8

Data access ALU Reg

2 ns

10

...

14

12

Reg Data access

Reg

ALU

Data access

2 ns

2 ns

Reg

2 ns

Ideal speedup is number of stages in the pipeline. Do we achieve this? 1998 Morgan Kaufmann Publishers

2

Pipelining



What makes it easy – all instructions are the same length – just a few instruction formats – memory operands appear only in loads and stores



What makes it hard? – structural hazards: suppose we had only one memory – control hazards: need to worry about branch instructions – data hazards: an instruction depends on a previous instruction



We’ll build a simple pipeline and look at these issues

1998 Morgan Kaufmann Publishers

3

Basic Idea

IF: Instruction fetch

ID: Instruction decode/ register file read

EX: Execute/ address calculation

MEM: Memory access

WB: Write back

0 M u x 1

Add 4

Add

Add result

Shift left 2

PC

Read register 1

Address

Instruction Instruction memory

Read data 1 Read register 2 Registers Read Write data 2 register Write data

16



Sign extend

0 M u x 1

Zero ALU ALU result

Address Data memory Write data

Read data

1 M u x 0

32

What do we need to add to actually split the datapath into stages? 1998 Morgan Kaufmann Publishers

4

Pipelined Datapath

0 M u x 1

IF/ID

ID/EX

EX/MEM

MEM/WB

Add Add Add result

4

PC

Address Instruction memory

Instruction

Shift left 2 Read register 1

Read data 1 Read register 2 Registers Read Write data 2 register Write data

0 M u x 1

Zero ALU ALU result

Address

Read data

Data memory Write data

16

Sign extend

1 M u x 0

32

Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem? 1998 Morgan Kaufmann Publishers

5

Corrected Datapath

0 M u x 1

IF/ID

ID/EX

EX/MEM

MEM/WB

Add Add

4

Add result

PC

Address Instruction memory

Instruction

Shift left 2 Read register 1

Read data 1 Read register 2 Registers Read Write data 2 register Write data

16

Sign extend

0 M u x 1

Zero ALU ALU result

Address Data memory Write data

Read data

1 M u x 0

32

1998 Morgan Kaufmann Publishers

6

Graphically Representing Pipelines Time (in clock cycles) Program execution order (in instructions) lw $10, 20($1)

sub $11, $2, $3



CC 1

CC 2

CC 3

IM

Reg

ALU

IM

Reg

CC 4

CC 5

DM

Reg

ALU

DM

CC 6

Reg

Can help with answering questions like: – how many cycles does it take to execute this code? – what is the ALU doing during cycle 4? – use this representation to help understand datapaths

1998 Morgan Kaufmann Publishers

7

Pipeline Control PCSrc 0 M u x 1

IF/ID

ID/EX

EX/MEM

MEM/WB

Add

Shift left 2

Address Instruction memory

Instruction

RegWrite

PC

Add result

Add

4

Read register 1

Branch

MemWrite Read data 1

Read register 2 Registers Read Write data 2 register Write data

ALUSrc

Zero Zero ALU ALU result

0 M u x 1

MemtoReg Address Data memory Write

Read data

1 M u x 0

data Instruction 16 [15– 0]

Instruction [20– 16] Instruction [15– 11]

Sign extend

32

6

0 M u x 1

ALU control

MemRead

ALUOp

RegDst

1998 Morgan Kaufmann Publishers

8

Pipeline control •

We have 5 stages. What needs to be controlled in each stage? – Instruction Fetch and PC Increment – Instruction Decode / Register Fetch – Execution – Memory Stage – Write Back



How would control be handled in an automobile plant? – a fancy control center telling everyone what to do? – should we use a finite state machine?

1998 Morgan Kaufmann Publishers

9

Pipeline Control •

Pass control signals along just like the data

Instruction R-format lw sw beq

Execution/Address Calculation Memory access stage control lines stage control lines ALU ALU Mem Mem Reg ALU Op0 Src Branch Read Write Dst Op1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 X 0 0 1 0 0 1 X 0 1 0 1 0 0

stage control lines Reg Mem to write Reg 1 0 1 1 0 X 0 X

WB Instruction

IF/ID

Control

M

WB

EX

M

WB

ID/EX

EX/MEM

MEM/WB

1998 Morgan Kaufmann Publishers

10

Datapath with Control PCSrc

ID/EX

0 M u x 1

WB Control

IF/ID

EX/MEM

M

WB

EX

M

MEM/WB WB

Add Add Add result

Instruction memory

ALUSrc

Read register 1

Read data 1 Read register 2 Registers Read Write data 2 register Write data

Zero ALU ALU result

0 M u x 1

MemtoReg

Address

Branch

Shift left 2

MemWrite

PC

Instruction

RegWrite

4

Address Data memory

Read data

Write data Instruction 16 [15– 0]

Instruction [20– 16] Instruction [15– 11]

Sign extend

32

6

ALU control

0 M u x 1

1 M u x 0

MemRead

ALUOp

RegDst

1998 Morgan Kaufmann Publishers

11

Dependencies •

Problem with starting next instruction before first is finished – dependencies that “go backward in time” are data hazards

Time (in clock cycles) CC 1 Value of register $2: 10

CC 2

CC 3

CC 4

CC 5

CC 6

CC 7

CC 8

CC 9

10

10

10

10/– 20

– 20

– 20

– 20

– 20

DM

Reg

Program execution order (in instructions) sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

IM

Reg

IM

DM

Reg

IM

DM

Reg

IM

Reg

DM

Reg

IM

Reg

Reg

Reg

DM

Reg

1998 Morgan Kaufmann Publishers

12

Software Solution • •

Have compiler guarantee no hazards Where do we insert the “nops” ? sub and or add sw



$2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2)

Problem: this really slows us down!

1998 Morgan Kaufmann Publishers

13

Forwarding •

Use temporary results, don’t wait for them to be written – register file forwarding to handle read/write to same register – ALU forwarding Time (in clock cycles) CC 1 Value of register $2 : 10 Value of EX/MEM : X Value of MEM/WB : X

CC 2

CC 3

CC 4

CC 5

CC 6

CC 7

CC 8

CC 9

10 X X

10 X X

10 – 20 X

10/– 20 X – 20

– 20 X X

– 20 X X

– 20 X X

– 20 X X

DM

Reg

Program execution order (in instructions) sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

what if this $2 was $13?

IM

Reg

IM

Reg

IM

DM

Reg

IM

Reg

DM

DM

Reg

IM

Reg

Reg

Reg

DM

Reg

1998 Morgan Kaufmann Publishers

14

Forwarding

ID/EX WB Control

PC

Instruction memory

Instruction

IF/ID

EX/MEM

M

WB

EX

M

MEM/WB WB

M u x Registers ALU

Data memory

M u x

M u x IF/ID.RegisterRs

Rs

IF/ID.RegisterRt

Rt

IF/ID.RegisterRt

Rt

IF/ID.RegisterRd

Rd

M u x

EX/MEM.RegisterRd

Forwarding unit

MEM/WB.RegisterRd

1998 Morgan Kaufmann Publishers

15

Can't always forward •

Load word can still cause a hazard: – an instruction tries to read a register following a load instruction that writes to the same register. Time (in clock cycles) Program CC 1 execution order (in instructions) lw $2, 20($1)

and $4, $2, $5



or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7



IM

CC 2

CC 3

Reg

IM

CC 4

CC 5

DM

Reg

Reg

IM

DM

Reg

IM

CC 6

CC 8

CC 9

Reg

DM

Reg

IM

CC 7

Reg

DM

Reg

Reg

DM

Reg

Thus, we need a hazard detection unit to “stall” the load instruction 1998 Morgan Kaufmann Publishers

16

Stalling •

We can stall the pipeline by keeping an instruction in the same stage Time (in clock cycles) Program execution CC 1 CC 2 order (in instructions) lw $2, 20($1)

and $4, $2, $5

or $8, $2, $6

IM

CC 3

Reg

IM

CC 4

CC 5

DM

Reg

Reg

Reg

IM

IM

CC 6

CC 7

DM

Reg

Reg

DM

CC 8

CC 9

CC 10

Reg

bubble add $9, $4, $2

slt $1, $6, $7

IM

DM

Reg

IM

Reg

Reg

DM

Reg

1998 Morgan Kaufmann Publishers

17

Branch Hazards •

When we decide to branch, other instructions are in the pipeline! Time (in clock cycles) Program execution CC 2 CC 1 order (in instructions) 40 beq $1, $3, 7

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)



IM

CC 3

Reg

IM

CC 4

CC 5

DM

Reg

Reg

IM

DM

Reg

IM

CC 6

CC 8

CC 9

Reg

DM

Reg

IM

CC 7

Reg

DM

Reg

Reg

DM

Reg

We are predicting “branch not taken” – need to add hardware for flushing instructions if we are wrong 1998 Morgan Kaufmann Publishers

18

Improving Performance •

Try and avoid stalls! E.g., reorder these instructions: lw lw sw sw



$t0, $t2, $t2, $t0,

0($t1) 4($t1) 0($t1) 4($t1)

Add a “branch delay slot” – the next instruction after a branch is always executed – rely on compiler to “fill” the slot with something useful

1998 Morgan Kaufmann Publishers

19

Dynamic Scheduling



The hardware performs the “scheduling” – hardware tries to find instructions to execute – out of order execution is possible – dynamic branch prediction



All modern processors are very complicated – DEC Alpha 21264: 9 stage pipeline – PowerPC and Pentium: branch history table for branch prediction – Compiler technology is important

1998 Morgan Kaufmann Publishers

20