CMSC 411 Computer Systems Architecture Lecture 4 MIPS ISA & Basic Pipelining
COMPUTER ARCHITECTURE VS. INSTRUCTION SET ARCHITECTURE
CMSC 411 - 1
CS252 S05
2
Instruction Set Architecture: Critical Interface software
instruction set
hardware
• Properties of a good abstraction – – – –
Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels CMSC 411 - 1
3
Example: MIPS r0 r1 ° ° ° r31 PC
0
Programmable storage
Data types ?
2^32 x bytes
Format ?
31 x 32-bit GPRs (R0=0)
Addressing Modes?
32 x 32-bit FP regs (paired DP) PC
Arithmetic logical Add, AddU, Sub, SubU, And, Or, Xor, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, SLL, SRL, SRA
Memory Access LB, LBU, LH, LHU, LW, SB, SH, SW
Control
32-bit instructions on word boundary
J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ CMSC 411 - 1
CS252 S05
4
Instruction Set Architecture (ISA) “... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.” – Amdahl, Blaauw, and Brooks, 1964 SOFTWARE -- Organization of Programmable Storage -- Data Types & Data Structures: Encodings & Representations -- Instruction Formats -- Instruction (or Operation Code) Set -- Modes of Addressing and Accessing Data Items and Instructions -- Exceptional Conditions CMSC 411 - 1
5
ISA vs. Computer Architecture • Old definition of computer architecture = instruction set design – Other aspects of computer design called implementation – Insinuates implementation is uninteresting or less challenging
• H&P’s view is computer architecture >> ISA • Architect’s job much more than instruction set design; technical hurdles today more challenging than those in instruction set design • Since instruction set design not where action is, some conclude computer architecture (using old definition) is not where action is – H&P disagree on conclusion – Agree that ISA not where action is (ISA in CA:AQA 4/e appendix)
CMSC 411 - 1
CS252 S05
6
Computer Architecture Is An Integrated Approach • What really matters is the functioning of the complete system – hardware, runtime system, compiler, operating system, and application – In networking, this is called the “End to End argument”
• Computer architecture is not just about transistors, individual instructions, or particular implementations – E.g., Original RISC projects replaced complex instructions with a compiler + simple instructions
CMSC 411 - 1
7
Computer Architecture Is Design & Analysis Design
Architecture is an iterative process: • Searching the space of possible designs • At all levels of computer systems
Analysis
Creativity Cost / Performance Analysis
Good Ideas
Bad Ideas
Mediocre Ideas
CMSC 411 - 1
CS252 S05
8
MIPS INSTRUCTION SET ARCHITECTURE
CMSC 411 - 1
9
A "Typical" RISC ISA • • • •
32-bit fixed format instruction (4 formats) 32 32-bit GPR (R0 contains zero, DP take pair) 3-address, reg-reg arithmetic instruction Single address mode for load/store: base + displacement – no indirection
• Simple branch conditions • Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
CMSC 411 - 3 (from Patterson)
CS252 S05
10
Example: MIPS Register-Register 31
26 25
op
21 20
rs
16 15
rt
11 10
6 5
rd
0
opx
rd ← rs OP rt Register-Immediate 31
26 25
op
21 20
rs
16 15
0
immediate
rt
rt ← rs OP immed Jump / Call 31
26 25
0
target
op
CMSC 411 - 3 (from Patterson)
11
5 Steps of MIPS Datapath Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
Adder
4
Write Back
MUX
Next PC
Memory Access
Next SEQ PC Zero? RS
L M D
MUX
Data Memory
ALU
Imm
MUX MUX
RD
Reg File
Inst
Memory
Address
IR ← mem[PC];
RT
Sign Extend
PC ← PC + 4 Reg[IRrd] ← Reg[IRrs] opIRop Reg[IRrt]
WB Data
CMSC 411 - 4 (from Patterson)
CS252 S05
12
5 Steps of MIPS Datapath Instruction Fetch
Execute Addr. Calc
Instr. Decode Reg. Fetch Next SEQ PC
Next SEQ PC
Adder
Zero? RS
MUX
MEM/WB
Data Memory
EX/MEM
ALU
A ← Reg[IRrs]; B ← Reg[IRrt]
MUX MUX
Imm
ID/EX
Reg File
IF/ID
Memory
Address
IR ← mem[PC]; PC ← PC + 4
RT
WB Data
4
Write Back
MUX
Next PC
Memory Access
Sign Extend
RD
RD
RD
rslt ← A opIRop B WB ← rslt Reg[IRrd] ← WB CMSC 411 - 4 (from Patterson)
13
Instruction Set Processor Controller IR
← mem[PC];
Ifetch
PC ← PC + 4
opFetch-DCD
A ← Reg[IRrs]; B ← Reg[IRrt]
br if bop(A,b)
jmp PC ← IRjaddr
RI
RR r ← A opIRop B
LD
r ← A opIRop IRim
r ← A + IRim
WB ← r
WB ← Mem[r]
PC ← PC+IRim WB ← r
Reg[IRrd] ← WB
Reg[IRrt] ← WB
CMSC 411 - 4 (from Patterson)
CS252 S05
Reg[IRrt] ← WB
14
Visualizing Pipelining Time (clock cycles)
Ifetch
DMem
Reg
DMem
Reg
DMem
Reg
ALU
Reg
ALU
O r d e r
Ifetch
ALU
I n s t r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Reg
Reg
DMem
CMSC 411 - 4 (from Patterson)
Reg
15
Pipelining Is Not Quite That Easy!
• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
CMSC 411 - 4 (from Patterson)
CS252 S05
16
One Memory Port/Structural Hazards Time (clock cycles)
Reg
DMem
Reg
DMem
Reg
ALU
DMem
Reg
ALU
Ifetch
DMem
ALU
O r d e r
Reg
ALU
I Load Ifetch n s Instr 1 t r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Instr 2
Ifetch
Instr 3
Reg
Ifetch
Instr 4
Reg
Reg
Reg
Reg
DMem
CMSC 411 - 4 (from Patterson)
17
One Memory Port/Structural Hazards Time (clock cycles)
Stall
Reg
DMem
Reg
ALU
Instr 2
Ifetch
DMem
Ifetch
Bubble
Reg
DMem
Bubble Bubble
Ifetch
Instr 3
Reg
Reg
Reg
Bubble ALU
O r d e r
Reg
ALU
I Load Ifetch n s Instr 1 t r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Bubble
Reg
DMem
How do you “bubble” the pipe? CMSC 411 - 4 (from Patterson)
CS252 S05
18
Speed Up Equation for Pipelining
CPIpipelined = Ideal CPI + Average Stall cycles per Inst
Speedup =
Cycle Timeunpipelined Ideal CPI × Pipeline depth × Ideal CPI + Pipeline stall CPI Cycle Timepipelined
For simple RISC pipeline, ideal CPI = 1: Speedup =
Cycle Timeunpipelined Pipeline depth × 1 + Pipeline stall CPI Cycle Timepipelined
CMSC 411 - 4 (from Patterson)
19
Example: Dual-port vs. Single-port • Machine A: Dual ported memory (“Harvard Architecture”) • Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate • Ideal CPI = 1 for both • Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster
CMSC 411 - 4 (from Patterson)
CS252 S05
20
Data Hazard on R1 Time (clock cycles)
or
Ifetch
DMem
Reg
DMem
Reg
DMem
Reg
DMem
Reg
ALU
and r6,r1,r7
sub r4,r1,r3
Reg
ALU
O r d e r
Ifetch
ALU
add r1,r2,r3
WB
ALU
I n s t r.
MEM
ALU
IF ID/RF EX
Ifetch
Ifetch
r8,r1,r9
xor r10,r1,r11
Reg
Ifetch
Reg
Reg
CMSC 411 - 4 (from Patterson)
Reg
DMem
21
Three Generic Data Hazards • Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 • Caused by a “true / flow dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
CMSC 411 - 4 (from Patterson)
CS252 S05
22
Reg
Three Generic Data Hazards • Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5 CMSC 411 - 4 (from Patterson)
23
Three Generic Data Hazards • Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 • Will see WAR and WAW in more complicated pipes CMSC 411 - 4 (from Patterson)
CS252 S05
24
Forwarding to Avoid Data Hazard
DMem
Reg
DMem
Reg
DMem
Reg
Ifetch
and r6,r1,r7 or
Reg
ALU
Ifetch
DMem
ALU
sub r4,r1,r3
Reg
ALU
O r d e r
add r1,r2,r3 Ifetch
ALU
I n s t r.
ALU
Time (clock cycles)
Ifetch
r8,r1,r9
Reg
Ifetch
xor r10,r1,r11
Reg
Reg
CMSC 411 - 4 (from Patterson)
Reg
DMem
Reg
25
HW Change for Forwarding NextPC
mux MEM/WR
EX/MEM
ALU
mux
ID/EX
Registers
Data Memory
mux
Immediate
CMSC 411 - 4 (from Patterson)
CS252 S05
26
Forwarding to Avoid LW-SW Data Hazard
DMem
Reg
DMem
Reg
DMem
Reg
Ifetch
sw r4,12(r1) or
Reg
ALU
Ifetch
DMem
ALU
lw r4, 0(r1)
Reg
ALU
O r d e r
add r1,r2,r3 Ifetch
ALU
I n s t r.
ALU
Time (clock cycles) Reg
Ifetch
r8,r6,r9
Reg
Ifetch
xor r10,r9,r11
Reg
Reg
DMem
Reg
27
CMSC 411 - 5 (from Patterson)
Data Hazard Even with Forwarding
and r6,r1,r7 or
Ifetch
DMem
Reg
DMem
Reg
Ifetch
Ifetch
r8,r1,r9 CMSC 411 - 5 (from Patterson)
CS252 S05
Reg
Reg
Reg
DMem
ALU
O r d e r
sub r4,r1,r6
Reg
ALU
lw r1, 0(r2) Ifetch
ALU
I n s t r.
ALU
Time (clock cycles)
Reg
Reg
DMem
28
Data Hazard Even with Forwarding
Reg
Ifetch
and r6,r1,r7 or r8,r1,r9
DMem
Reg
Reg
Bubble
Ifetch
Bubble
Reg
Bubble
Ifetch
DMem
Reg
Reg
Reg
DMem
ALU
sub r4,r1,r6
Ifetch
ALU
O r d e r
lw r1, 0(r2)
ALU
I n s t r.
ALU
Time (clock cycles)
DMem
29
CMSC 411 - 5 (from Patterson)
Software Scheduling Instead Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW
Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd
Fast code: LW LW LW ADD LW SW SUB SW
Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd
Compiler optimizes for performance. Hardware checks for safety. CMSC 411 - 5 (from Patterson)
CS252 S05
30