Von Neumann computer • Modern electronic computers started with a design by Von Neumann (~1945)
Basic machine architecture
1 nabg
2 nabg
Von Neumann computer
Von Neumann computer
• Control + Arithmetic logic unit
• In the Von Neumann design, all I/O data transfers have to pass through registers (“accumulators”) in the arithmetic‐logic‐unit
– “Central Processing Unit” where all data processing work done
– This design was adequate for the earliest computers with their very small memories and slow limited I/O devices – It proved less appropriate when, by the mid‐1950s, machines had somewhat larger memories and faster I/O using magnetic tapes and then disks.
• Memory – Stores both program and data
• Transfer of a block of data from tape to memory using a machine with 3 registers in its ALU would have been programmed something like
• Some earlier proto computers had different memories (sometimes using differing hardware technologies) for program and data
Load r0,512 Load r1,destination Loop: Read-word from tape-control into r2 Store r2,@r1 Increment r1 Decrement r0 Test r0 Skip if zero Goto loop // Continue now block has loaded
– The program might actually be “hard wired”
• Input and output – Input of data and output of results of computation
The disk/tape transfer occupies the CPU
3 nabg
4 nabg
Autonomous transfers
Simple modern architecture
• By the late 1950s, logic circuitry had become a little cheaper and it was practical to put control logic (still implemented using vacuum tubes) into device controllers for things like disks and tapes – so potentially relieving the CPU from doing the detailed work of the transfer
• From the late 1950s onwards, the basic structure of a computer became:
– The code in the CPU would have been more like Load r0, destination Copy r0, tape-control register Start transfer of block on tape … Check if transfer complete
The tape/disk unit could have its own counter register counting the bytes transferred and its own address register for location where data to be stored, and circuitry to update these registers as data transfers took place
• Of course, there had to be a data pathway from device directly to memory – “direct memory access” 5 nabg
nabg
6 nabg
1
Disk working with direct memory access 1
Simple modern architecture • CPU (Central Processing Unit)
2 PC IR Flags
PC
CPU
ALU
IR
Block number
Flags
Byte counter Destination address Flags
Registers
Block number
CPU
– Timing and control circuitry – High speed data registers (older term for register – “accumulator”)
Byte counter Destination address
ALU
DISK
Flags
Registers
Disk cache
DISK
Disk cache
BUS
BUS CPU executing other instructions
CPU to disk: load block number XXX and start seek;
3
Disk moving heads (seeking
• Although some computer architectures allow a little more flexibility, it’s typical that instructions for data processing operations (add, multiply, xor, …) to take data from registers and place results in registers;
4 PC IR Flags
PC
CPU
ALU
IR
Block number Byte counter
Flags
Destination address Flags
Registers
– Variables in memory have to have their values copied into registers before the values can be manipulated; the results must be stored back into memory.
Block number Byte counter
CPU
Destination address
ALU
DISK
Flags
Registers
Disk cache
DISK
Disk cache
BUS
BUS
• Memory
CPU to disk: copy into memory starting at address ******;
disk to CPU: got it;
5
– Magnetic core (1950s‐1970s), later semi‐conductor (1970…) memory for OS, programs, and data
6 PC IR Flags
PC
CPU
ALU
Block number
Flags
Destination address Flags
Registers
IR
Byte counter
Disk cache
DISK
Byte counter
• Peripheral device controllers
Destination address Flags
Registers
BUS
Disk cache
DISK
BUS Data transferred "directly" into memory
Block of memory to be filled
Block number
CPU
ALU
– Sophisticated, complex controllers for disks and tapes, – Simpler controllers for slow devices like printers, terminals, card‐readers, …
disk to CPU: transfer complete;
• Bus
Block of memory now filled
– A common communications highway 7
nabg
8 nabg
CPU • The CPU of a modern small computer is physically implemented as single silicon "chip". – This chip will have engraved on it the millions of transistors and the interconnecting “wiring” that define the CPU's circuits. – The chip will have one hundred or more pins around its rim ‐‐‐ some of these pins are connection points for the signal lines from the bus, others will be the points where electrical power is supplied to the chip.
More modern CPU chips Intel 4004 – one of the first single chip CPUs (~1971)
9
10
nabg
CPU
Fetch‐decode‐execute
• Although physically a single component, the CPU is logically made up from a number of subparts. • The three most important, which will be present in every CPU, are
• The timing and control circuits are the heart of the system. • A controlling circuit defines the computer's basic processing cycle repeat fetch next instruction from memory decode instruction (i.e. determine which data manipulation circuit is to be activated) fetch from memory any additional data that are needed execute the instruction (feed the data to the appropriate manipulation circuit) until "halt" instruction has been executed;
11 nabg
nabg
12 nabg
2
CPU registers
CPU registers
• Timing & control
• ALU
– Program counter
– Lots of anonymous registers that hold data temporarily during operations like multiplications, shifts etc
• Address of next instruction to be executed – By default, it’s the address following the current instruction – Changed if current instruction is a subroutine‐call, jump (goto), branch, or “skip”
– Instruction register • Hold current instruction for decoding circuits
– Flags • Did last operation result in – – – –
Zero value +ve value ‐ve value …
Timing and Control Unit
Program counter (PC) Timing and Control Unit
Instruction register (IR) Flags
ALU
(various anonymous registers)
High speed registers
13
nabg
14 nabg
CPU registers
Instruction repertoire
• High speed registers
• Each distinct machine architecture has its own “instruction set”
– Hold data values and/or addresses of data values – On most modern machines –
– Instructions correspond to circuits built into the ALU¶ – In the machine, instructions are represented by “op‐ codes” – specific bit patterns – Assembly language uses mnemonic names
• One register reserved as “stack pointer” • One register reserved as “stack frame pointer”
• At assembly language level (hand‐written or compiler‐generated) use of registers is explicit – Code at this level is all about • Copy data to this register • Combine data in these registers • Store data from this register back into main memory • Use contents of this register as address of some data value • …
Timing and Control Unit
Program counter (PC) Instruction register (IR)
• e.g. Add for an addition instruction
Flags
•
ALU
(“mnemonic” – designed to aid the memory, the names chosen to remind the programmer of the effect of the instruction)
(various anonymous registers)
¶Not always true. Some CPUs are “micro‐coded”. Their built‐in hardware circuits implement a different, usually much simpler architecture and instruction set. They have “read‐only memory” subroutines that simulate the supposed instruction set using the simpler circuits actually present. Such approaches are old – the IBM360 series (1964) had a range of machines varying in power by 1..100 at least all supposedly with the same instruction set; actually, the smaller machines simulated all the complex instructions using simpler circuitry. Modern Intel CPU chips have to be backward compatible with the 386/486 chips introduced in the 1980s; again, they simulate some of the instructions of those old designs using their newer circuitry.
High speed registers
C++ has provision for specifying use of registers in high‐level code. Rarely useful! ‐ Makes code very machine specific & anyway use of registers better left to optimisation phase of compiler
15
nabg
16
nabg
Instructions & high level languages
Instructions • Example Motorola 68000 CPU chip, instruction repertoire includes ADD AND Bcc
CLR CMP JMP JSR SUB RTS
• It’s not hard to envisage how simple expressions in a high level language might get coded using a given instruction set:
Add two integer values Perform an AND operation on two bit patterns Test a condition flag, and possibly branch to another instruction (variants like BEQ testing equality, BLT testing less than) Clear, i.e. set to 0 Compare two values Jump or goto Call a subroutine Subtract second value from first Return from subroutine Original 1984 Macintosh used this Motorola CPU
nabg
Instruction register (IR)
ALU
High speed registers
Registers are mostly the same size (same number of bits), something like “flags” might have fewer bits. Size of register is size of data element most readily manipulated – 1‐byte, 2‐bytes, 4‐bytes, or some arbitrary “word size” (12‐bit, 18‐bit, 36‐bit, 40‐bit, 60‐bit). Operations, e.g. additions, on larger data elements will need to be done by sequences of instructions that manipulate register sized portions.
nabg
Program counter (PC) Flags
(various anonymous registers)
int main(int argc, char** argv) { int total = 0; int data[] = { 1,2,3,4,5}; int len = sizeof(data)/sizeof(int); int* ptr = data; int i=0; while(i 50 times faster than subroutines – Exploited by re‐written subroutines for faster floating point arithmetic
• “Floating point” unit – Circuits implement floating point add, subtract, multiply, divide Expanding the instruction set is no longer really an option. Modern chips typically have all heavily used operations implemented in circuitry; exotic instructions to perform special operations aren’t readily exploited by compilers so only prove worthwhile in specialist chips – e.g. a chip for decoding compressed movies might be derived from a standard chip extended with additional specialized instructions
Bus
Multiple bus
• CPU and direct memory access devices have to compete for use of the bus
• Memory can be “multi‐ported” so that it works with multiple buses. • Instructions to disks (and more sophisticated things like auxiliary I/O processors or channels) can go on the same bus as used by the CPU but the data can be transferred on a different bus
– Transfers on bus will take (at least) one clock cycle so CPU may have to wait a whole clock cycle when it wants to get the next instruction but the bus is busy with a transfer – Wait for a whole clock cycle? But we want speed!
– CPU doesn’t have to wait even a single bus cycle
95 nabg
nabg
94
nabg
96 nabg
16
Multiple memory modules
Multiple bus
• Memory can be organized in multiple modules so as to allow more than one simultaneous transfer
Device controller
CPU
– (Memory will have read and write speeds, maybe these not fast enough) – Different schemes – can use either high‐order or low‐ order bits of address to identify module
DMA data transfers on this second bus
• Low order bits will place “successive bytes/words” in different modules – Could be useful if disk and bus could potentially transfer data faster than a memory module can write words – by having successive words in different modules, transfers can utilize full speed.
Memory Bus‐1
Bus‐2 97
nabg
98 nabg
Multi‐module memory
“Cache” memories in device controllers
Multiple reads/writes in progress simultaneously on different modules
• Can speed up some I/O by providing device controllers with their own memory
Device controller
Addresses ending: 000b 001b Illustrates case 010b where low order 011 b bits select 100b module 101b 110b 111b
– “Disk write”
Memory
CPU
Bus‐1
• Give disk controller details of disk address and immediately copy the contents of main‐memory data buffer into disk controller’s memory. • Disk controller will write data to disk when opportunity arises
DMA data transfers on this second bus
Bus‐2
Useful when bus speed faster than memory write cycle time “Cache” – a secret hiding place or store
99
nabg
CPU & Memory
Instruction cache • Useful for code like loops that are executed a large number of times
• Add a “cache” of high‐speed memory to CPU – Instruction cache – Data cache
– (Operations like matrix multiplication)
• Code runs faster if instructions are in cache
• Cache may be “on chip” with the CPU circuitry, or accessed via some separate higher speed bus
– How did they get there? • Combination of hardware and (OS) software (mainly hardware) – Detect that using the same small area of instruction memory for a large number of cycles – Copy those instructions into cache – Fudge the address decoding/instruction fetch mechanism so that these instructions subsequently fetched from cache
101 nabg
nabg
100
nabg
102 nabg
17
Data cache
Data cache
• Matrix multiplication again –
• Writes to the data cache?
– Large chunks of memory holding 2d arrays of floating point numbers – Code repeatedly reading values – Code runs faster if the data are brought up into cache memory.
– Matrix multiplication ResultMatrix = MatrixA x MatrixB
– Again would run faster if the “store” operations used cache rather than real memory – so hold the ResultMatrix in cache as well – Of course, will still have to write back into main memory!
• How? Again, combination of hardware and software (mainly hardware) detects repeated access to same regions of memory, copies that block of memory into cache and fixes the addressing mechanism so that cache used on data fetches. 103 nabg
104 nabg
CPU caches
Instruction pipeline
• CPU caches add lots of complexity especially if have multi‐cpu designs where the same data sets may be manipulated by code running in parallel on different CPUs • Extra hardware
• A computer with an instruction pipeline doesn’t have a single “instruction register” – in effect it has several “instruction registers” in a pipe‐line (queue) • Instructions are “pre‐fetched” and appended to back of pipeline • Instruction at front of pipe‐line – and some of the others in the pipe‐line – are being executed
– Cache loading has to be done largely by hardware – Locking and coherence mechanisms • If have multiple CPUs, need to avoid problems that could arise if a CPU tries to update data in cache when that data is shared with other CPUS
– Need to re‐load instruction and data caches whenever have a process switch
– Extra hardware in CPU determines which instructions can run concurrently 105
nabg
106 nabg
Instruction pipeline
Instruction pipeline • Pipelined CPU – Start the floating point multiply – Store?
• Don’t have to work on a single instruction – Code like x = a[i][j]*b[j][k]; j++; if(j