Chapter 5: A Closer Look at Instruction Set Architectures

Chapter 5: A Closer Look at Instruction Set Architectures Here we look at a few design decisions for instruction sets of the computer. Some basic issu...
Author: Caitlin Jordan
1 downloads 2 Views 433KB Size
Chapter 5: A Closer Look at Instruction Set Architectures Here we look at a few design decisions for instruction sets of the computer. Some basic issues for consideration today. Basic instruction set design Fixed–length vs. variable–length instructions. Word addressing in a byte–addressable architecture. Big–Endian vs. Little–Endian Designs Stack machines, accumulator machines, and multi–register machines Modern Design Principles

NOTE:

We shall not discuss expanding opcodes, covered in Section 5.2.5 of the textbook. The discussion is overly complex.

Design Considerations: Instruction Size The MARIE has a fixed instruction size: 16 bits, divided as follows: Bits Content

15 – 12 Opcode

11 – 0 Operand address (if used)

Instructions of a fixed size are easier to decode. Moreover, they facilitate prefetching of instructions, in which one instruction is fetched while the previous one is executing. Instructions with variable size make more efficient use of memory. When memory was costly, this was an important design consideration. For example, the VAX–11/780 had a number of addition instructions: Add register to register Add memory to register Add memory to memory Suppose an 8–bit opcode, 16 registers (so that 4 bits identify each register), and 32–bit addressing. Instruction Lengths: Register to register 8 + 4 + 4 = 16 bits (two bytes) Memory to register 8 + 32 + 4 = 44 bits (six bytes) Memory to memory 8 + 32 + 32 = 72 bits (nine bytes) With fixed length instructions, each instruction would have to be at least 9 bytes long.

RISC vs. CISC Considerations The RISC (Reduced Instruction Set Computer) movement advocated a simpler design with fewer options for the instructions. Simpler instructions could execute faster. Machines that were not RISC were called CISC (Complex Instruction Set Computers). The VAX series was definitely CISC, a fact that lead to its demise. One of the main motivations for the RISC movement is the fact that computer memory is no longer a very expensive commodity. In fact, it is a “commodity”; that is, a commercial item that is readily and cheaply available. If we have fewer and simpler instructions, we can speed up the computer’s speed significantly. True, each instruction might do less, but they are all faster. The Load–Store Design One of the slowest operations is the access of memory, either to read values from it or write values to it. A load–store design restricts memory access to two instructions 1) Load a register from memory 2) Store a register into memory A moment’s reflection will show that this works only if we have more than one register, possibly 8, 16, or 32 registers. More on this when discussing the options for operands. NOTE: It is easier to write a good compiler for an architecture with a lot of registers.

Word Addressing vs. Byte Addressing Each architecture devotes a number of bits to addressing memory. The MARIE has a 12–bit memory address. Question: What is the size of an addressable unit? In byte–addressable machines, each byte is individually addressable. In word–addressable machines, bytes are not individually addressable. Advantages of Word–Addressable Designs The CPU can address a larger memory. In the MARIE it is 212 words (not 212 bytes) Word addressing is more natural for a number of problems, such as numerical simulations that use a large amount of real–number arithmetic. The CDC–6600 used 60–bit addressable words to store real numbers. Advantages of Byte–Addressable Designs Direct access to the bytes for easy manipulation. This is good for any string–oriented processing, such as editing and message passing. Question: How are multi–byte entities (words, longwords, etc.) handled?

Word Addressing in a Byte Addressable Machine Each 8–bit byte has a distinct address. A 16–bit word at address Z contains bytes at addresses Z and Z + 1. A 32–bit word at address Z contains bytes at addresses Z, Z + 1, Z + 2, and Z + 3. Note that computer architecture refers to addresses, rather than variables. In a high–level programming language we use the term “variable” to indicate the contents of a specific memory address. Consider the statement

Y=X

Go to the memory address associated with variable X Get the contents Copy the contents into the address associated with variable Y.

Question: What is the Address of the Last Longword (Word)? We shall ask the question in general for an N–bit address space and specifically for a 16–bit address space. For byte addresses, this is easy to answer. Byte addresses N–bit byte addresses run from 0 to 2N – 1. The byte addresses run from 0 to 65535. Word addresses If a word is at address Z, its last byte is at address (Z + 1) The byte of the last word has address Z + 1 = 2N – 1, so the last word has address Z = 2N – 2. For N–bit byte addresses run from 0 to 2N – 2. The byte addresses run from 0 to 65534. Longword

A word at address Z, has last byte is at address (Z + 3) The last byte of the last longword has address Z + 3 = 2N – 1, so the last longword has address Z = 2N – 4. For N–bit byte addresses run from 0 to 2N – 4. The byte addresses run from 0 to 65532.

Big–Endian vs. Little–Endian Addressing

Address Z Z+1 Z+2 Z+3

Big-Endian 01 02 03 04

Little-Endian 04 03 02 01

Note that, within the byte, the hexadecimal digits are never reversed.

Words and Longwords in Byte Addressable Memory: Example #1 You are given the following memory map for a byte addressable memory. All values are given in hexadecimal. Address

FC

FD

FE

FF

100

101

102

103

104

Contents

99

88

66

66

10

20

30

40

55

Each addressable unit contains one byte, expressed as two hexadecimal digits. A 32–bit longword occupies four bytes. Question: What is the 32–bit longword associated with address 0x100? Answer: It occupies addresses 0x100, 0x101, 0x102, and 0x103. Big–Endian: The value, in hexadecimal, is 0x1020 3040. Little–Endian: The value, in hexadecimal, is 0x4030 2010. Note: In little–endian, we read the bytes backwards. We do not read the hexadecimal digits backwards.

Words and Longwords in Byte Addressable Memory: Example #1 (Continued) You are given the following memory map for a byte addressable memory. Address FC FD FE FF 100 101 102 103 104 Contents

99

88

66

66

10

20

30

40

55

Question: What is the 32–bit longword associated with address 0x100? Answer: It occupies addresses 0x100, 0x101, 0x102, and 0x103. Big–Endian The value, in hexadecimal, is 0x1020 3040. In decimal: 1167 + 0166 + 2165 + 0164 + 3163 + 0162 + 4161 + 0160 Little–Endian: The value, in hexadecimal, is 0x4030 2010. In decimal: 4167 + 0166 + 3165 + 0164 + 2163 + 0162 + 1161 + 0160

Words and Longwords in Byte Addressable Memory: Example #2 You are given the following memory map for a byte addressable memory. Address FC FD FE FF 100 101 102 103 104 Contents

99

88

66

66

10

20

30

40

Question: What is the 16–bit word associated with address 0x100? Answer: It occupies addresses 0x100 and 0x101. Big–Endian The value, in hexadecimal, is 0x1020 In decimal: 1163 + 0162 + 2161 + 0160 Little–Endian: The value, in hexadecimal, is 0x2010. In decimal: 2163 + 0162 + 1161 + 0160

55

One Classification of Architectures How do we handle the operands? Consider a simple addition, specifically C=A+B Stack architecture In this all operands are found on a stack. These have good code density (make good use of memory), but have problems with access. Typical instructions would include: Push X // Push the value at address X onto the top of stack Pop Y // Pop the top of stack into address Y Add // Pop the top two values, add them, & push the result Program implementation of C = A + B Push A Push B Add Pop C

Single Accumulator Architectures The MARIE is an example of this. There is a single register used to accumulate results of arithmetic operations. The MARIE realization of the instruction C = A + B Load A // Load the AC from address A Add B // Add the value at address B Store C // Store the result into address C In each of these instructions, the accumulator is implicitly one of the operands and need not be specified in the instruction. This saves space. Extension of this for multiplication If we multiply two 16–bit signed integers, we can get a 32–bit result. Consider squaring 24576 (214 + 213) or 0110 0000 0000 0000 is 16–bit binary. The result is 603, 979, 776 or 0010 0100 0000 0000 0000 0000 0000 0000. We need two 16–bit registers to hold the result. The PDP–9 had the MQ and AC. The results of this multiplication would be MQ 0010 0100 0000 0000 // High 16 bits AC 0000 0000 0000 0000 // Low 16 bits

General Purpose Register Architectures These have a number of general purpose registers, normally identified by number. The number of registers is often a power of 2: 8, 16, or 32 being common. (The Intel architecture with its four general purpose registers is different. These are called EAX, EBX, ECX, and EDX – a lot of history here) The names of the registers often follow an assembly language notation designed to differentiate register names from variable names. An architecture with eight general purpose registers might name them: %R0, %R1, …., %R7. The prefix “%” here indicates to the assembler that we are referring to a register, not to a variable that has a name such as R0. The latter name would be poor coding practice. Designers might choose to have register %R0 identically set to 0. Having this constant register considerably simplifies a number of circuits in the CPU control unit. We shall return to this %R0  0 when discussing addressing modes.

General Purpose Registers with Load–Store A Load–Store architecture is one with a number of general purpose registers in which the only memory references are: 1) Loading a register from memory 2) Storing a register to memory The realization of our programming statement C = A + B might be something like Load %R1, A // Load memory location A contents into register 1 Load %R2, B // Load register 2 from memory location B Add %R3, %R1, %R2 // Add contents of registers %R1 and %R2 // Place results into register %R3 Store %R3, C // Store register 3 into memory location C

General Purpose Registers: Register–Memory In a load–store architecture, the operands for any arithmetic operation must all be in CPU registers. The Register–Memory design relaxes this requirement to requiring only one of the operands to be in a register. We might have two types of addition instructions Add register to register Add memory to register The realization of the above simple program statement, C = A + B, might be Load %R4, A // Get M[A] into register 4 Add %R4, B // Add M[B] to register 4 Store %R4, C // Place results into memory location C

General Purpose Registers: Memory–Memory In this, there are no restrictions on the location of operands. Our instruction C = A + B might be encoded simply as Add C, A, B The VAX series supported this mode. The VAX had at least three different addition instructions for each data length Add register to register Add memory to register Add memory to memory There were these three for each of the following data types: 8–bit bytes, 16–bit integers, and 32–bit long integers 32–bit floating point numbers and 64–bit floating point numbers. Here we see at least 15 different instructions that perform addition. This is complex.

Modern Design Principles: Basic Assumptions Some assumptions that drive current design practice include: 1.

The fact that most programs are written in high–level compiled languages. Provision of a large general purpose register set greatly facilitates compiler design.

2.

The fact that current CPU clock cycle times (0.25 – 0.50 nanoseconds) are much faster than memory access times.

3.

The fact that a simpler instruction set implies a smaller control unit, thus freeing chip area for more registers and on–chip cache.

4.

The fact that execution is more efficient when a two level cache system is implemented on–chip. We have a split L1 cache (with an I–Cache for instructions and a D–Cache for data) and a L2 cache.

5.

The fact that memory is so cheap that it is a commodity item.

Modern Design Principles 1.

Allow only fixed–length operands. This may waste memory, but modern designs have plenty of it, and it is cheap.

2.

Minimize the number of instruction formats and make them simpler, so that the instructions are more easily and quickly decoded.

3.

Provide plenty of registers and the largest possible on–chip cache memory.

4.

Minimize the number of instructions that reference memory. Preferred practice is called “Load/Store” in which the only operations to reference primary memory are: register loads from memory register stores into memory.

5.

Use pipelined and superscalar approaches that attempt to keep each unit in the CPU busy all the time. At the very least provide for fetching one instruction while the previous instruction is being executed.

6.

Push the complexity onto the compiler. This is called moving the DSI (Dynamic–Static interface). Let Compilation (static phase) handle any any issue that it can, so that Execution (dynamic phase) is simplified.

The Fetch–Execute Cycle (Again) This cycle is the logical basis of all stored program computers. Instructions are stored in memory as machine language. Instructions are fetched from memory and then executed. The common fetch cycle can be expressed in the following control sequence. MAR  PC. READ.

// The PC contains the address of the instruction. // Put the address into the MAR and read memory.

IR  MBR.

// Place the instruction into the MBR.

This cycle is described in many different ways, most of which serve to highlight additional steps required to execute the instruction. Examples of additional steps are: Decode the Instruction, Fetch the Arguments, Store the Result, etc. A stored program computer is often called a “von Neumann Machine” after one of the originators of the EDVAC. This Fetch–Execute cycle is often called the “von Neumann bottleneck”, as the necessity for fetching every instruction from memory slows the computer.

Avoiding the Bottleneck In the simple stored program machine, the following loop is executed. Fetch the next instruction Loop Until Stop Execute the instruction Fetch the next instruction End Loop. The first attempt to break out of this endless cycle was “instruction prefetch”; fetch the next instruction at the same time the current one is executing. As we can easily see, this concept can be extended.

Instruction–Level Parallelism: Instruction Prefetch Break up the fetch–execute cycle and do the two in parallel. This dates to the IBM Stretch (1959)

The prefetch buffer is implemented in the CPU with on–chip registers. The prefetch buffer is implemented as a single register or a queue. The CDC–6600 buffer had a queue of length 8 (I think). Think of the prefetch buffer as containing the IR (Instruction Register) When the execution of one instruction completes, the next one is already in the buffer and does not need to be fetched. Any program branch (loop structure, conditional branch, etc.) will invalidate the contents of the prefetch buffer, which must be reloaded.

Instruction–Level Parallelism: Pipelining Better considered as an “assembly line”

Note that the throughput is distinct from the time required for the execution of a single instruction. Here the throughput is five times the single instruction rate.

What About Two Pipelines?

Code emitted by a compiler tailored for this architecture has the possibility to run twice as fast as code emitted by a generic compiler. Some pairs of instructions are not candidates for dual pipelining. C=A+B D=AC

// Need the new value of C here

This is called a RAW (Read After Write) dependency, in that the value for C must be written to a register before it can be read for the next operation. Stopping the pipeline for a needed value is called stalling.

Superscalar Architectures Having 2, 4, or 8 completely independent pipelines on a CPU is very resource–intensive and not directly in response to careful analysis. Often, the execution units are the slowest units by a large margin. It is usually a better use of resources to replicate the execution units.