Chap. 3 - Parallel Architectures

Chap. 3 - Parallel Architectures Types in an overview Multiprocessor systems with shared memory Programming of shared memory systems Cache & Memory Co...
0 downloads 2 Views 300KB Size
Chap. 3 - Parallel Architectures Types in an overview Multiprocessor systems with shared memory Programming of shared memory systems Cache & Memory Coherency Multiprocessor systems with distributed memory Programming of distributed memory systems Networks for parallel computers Vector Processors Array Computers New Trends Parallel Computer Systems – p.1/31

Types of Parallel Computers (1) Rough classification scheme according the number of control streams and the number of data streams Data Streams

Instruction Streams

Single (SD)

Single (SI)

SISD

Multiple (MI)

Multiple (MD)

SIMD MIMD

Classification principle by Flynn, 1966 Classical ’von Neumann’ computers covered in class SISD Parallel Computer Systems – p.2/31

Types of Parallel Computers (2) - Multiple Instruction Multiple Data All multiprocessor systems - each processor can work with an individual instruction stream onto an individual stream of operands Subclasses: MIMD

Multiple processors with shared memory, near to PRAM model but without step wise synchronization Multiple processors with local memory, connected via a network (Distributed Memory) Mixed architecture: Distributed Shared Memory (DSM) Hardware structured like distributed memory, but shared address space via MMU address translation plus software Parallel Computer Systems – p.3/31

Types of Parallel Computers (3) - Single Instruction Multiple Data Each instruction causes operations on multiple pairs of data Subclasses:

SIMD

Vector processors (some of the number crunchers, e.g. CRAY, NEC) Array processors Early parallel computers (Massively parallel) Nowadays a few special purpose architectures ISA-extensions: MMX, SSID, AltiVec for a small set of parallel units

Parallel Computer Systems – p.4/31

Shared Memory Multiprocessors Structure: P0

P1 Cache

P2 Cache

P3 Cache

P (p−1) Cache

Cache

Communication Network MEM

MEM

MEM

MEM

MEM

Coordination and cooperation using shared variables in memory A single instance of the operating system

Parallel Computer Systems – p.5/31

Programming Shared Memory (1) Options: Multiple processes (using fork) and communication via Shmem segments Explicit message passing among multiple processes: Unix-Pipelines, MPI Multithreading: Threads run on different nodes and utilize parallel machine, Threads run onto a shared address space OpenMP - Set of compiler directives for controlling multi-threaded, space divided execution

Parallel Computer Systems – p.6/31

Programming Shared Memory (2) Several threads run onto several processors under control of the operating system OS-specific thread functions, e.g. Solaris threads portability standard: POSIX-Threads, pthread library

Basic functions: int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine)(void*), void *arg); void pthread_exit(void *value_ptr); int pthread_join(pthread_t thread, void **value_ptr);

Parallel Computer Systems – p.7/31

Programming Shared Memory (3)

OpenMP:

Example for Loop-parallelization: for (i=0;i k , Sp = k

Parallel Computer Systems – p.22/31

Vector Processors (3) Generic structure of a vector computer Components: Control unit

Vector Unit L/S

At least one scalar processing unit Main Memory

L/S

Vector unit, composed of many (specialized) vector pipelines

L/S

Instruction Buffer

Registers: scalar and vector

Scalar Processing Unit Exec− Control

Control Unit

Interleaved main memory Load/Store units Parallel Computer Systems – p.23/31

Vector Processors (4) Vector computers are mostly Load/Store architectures. Vector registers: Act as source and destination for vector pipelines Store temporary data in chained vector operations Overlapping memory access and operand flow to vector pipelines

Main Memory

Continous store back for write operations

Sequential access by pipeline with high clock rate

L/S

Interleaved memory

Continous re−fill for load operation

Parallel Computer Systems – p.24/31

Vector Processors (5) Chaining of vector operations: VMA V0, V1, V2, V4; V0 * V1 + V2 -> V4 Add

Multiply L/S

V0

L/S

V1

V4

L/S

L/S

V2

Chaining allows to increase k and to obtain a higher speedup

Parallel Computer Systems – p.25/31

Vector Processors (6) Types of parallelism in vector computers: Vector-pipeline parallelism: Iterations on same type of operands can be executed as pipelined vector instructions Usage of multiple vector pipelines: Execute several independent vector operations in parallel Split large vector pairs and execute them in parallel using multiple pipelines Chaining of vector operations

Parallel Computer Systems – p.26/31

Array Computers SIMD computer in a real array structure A single control unit decodes instructions and generates control signals Large number of processing elements execute same instructions (step-synchronized), but on different data Instruction Stream

Control Unit Control Signals

Memory

Data

local registers

Neighborhood− Network

Data

Execution− units Parallel Computer Systems – p.27/31

New Trends (1) Parallel processing moves into modern processors Shared memory MPS: (1) Multicore processors (2) Multithreaded processors, with many virtual processors (e.g. HT) Combinations of (1) and (2) are announced Other Trend: Many ’small’ processors on a chip Relatively small local memory Connected via on-chip network DMA engines for remote memory transfer Example: IBM Cell Parallel Computer Systems – p.28/31

New Trends (2) Cell architecture:

Parallel Computer Systems – p.29/31

New Trends (3) Synergistic processor element (SPE) architecture:

Parallel Computer Systems – p.30/31

Summary Several classes of parallel computers . . . SIMD – Vector and Array processors Programming with vector or array instructions Vector processors suited for problems with huge fraction of floating point calculations Array computers for regular structured problems, e.g. image processing MIMD – Multiprocessors The most universal class Need explicit parallel programming or compiler tools for automatic code parallelization Shared memory requires techniques for consistency, but programming is easier Parallel Computer Systems – p.31/31