COSC 6385 Computer Architecture - Data Level Parallelism (II) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (II) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2014 Refe...
3 downloads 0 Views 878KB Size
COSC 6385 Computer Architecture - Data Level Parallelism (II) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2014

References •



Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan: “Larrabee: a many-core x86 architecture for visual computing”, ACM Trans. Graph., Vol. 27, No. 3. (August 2008), pp. 1-15. http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf IBM Cell processor: [2] C. R. Johns, D. A. Brokenshire “Introductioon to the Cell Broadband Engine Architecture”, IBM Journal of Research and Development, vol. 51, no. 5, pp. 503-519 http://www.research.ibm.com/journal/rd/515/johns.pdf [3] M. Kistler, M. Perrone, F. Petrini, “Cell Multiprocessor Communication Network: Built for Speed” IEEE Micro, vol. 26, no. 3, pp .10-23 ttp://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf

1

Larrabee Motivation • Comparison of two architectures with the same number of transistors – Half the performance of a single stream for the simplified core – 40x increase for multi-stream executions 2 out-of-order cores

10 in-order cores

Instruction issue

4

2

VPU per core

4-wide SSE

16-wide

L2 cache size

4 MB

4 MB

Single stream

4 per clock

2 per clock

Vector throughput

8 per clock

160 per clock

Larrabee Overview • Many-core visual computing architecture • Based on x86 CPU cores – Extended version of the regular x86 instruction set – Supports subroutines and page faulting

• Number of x86 cores can vary depending on the implementation and processor version • Fixed functional units for texture filtering – Other graphical operations such as rasterization or postshader blending done in software

2

Larrabee Overview (II)

Image Source: [1]

Overview of a Larrabee Core (I)

Image Source: [1]

3

Overview of a Larrabee Core (I) • x86 core derived from the Pentium processor – No out-of-order execution

• Standard Pentium instruction set with the addition of – 64 bit instructions – Instructions for pre-fetching data into L1 and L2 cache – Support for 4 simultaneous threads, separate registers for each thread

• Each core is augmented with a wide vector processor (VPU) • 32kb L1 Instruction cache, 32 kb L1 Data Cache • 256 KB of ‘local subset’ of the L2 cache – Coherent L2 cache across all cores

Vector Processing Unit in Larrabee • 16-wide VPU executing integer, single- and double precision floating point operations • VPU supports gather-scatter operations – The 16 elements are loaded or can be stored from up to 16 different addresses

• Support for predicated instructions using a mask control register (if-then-else statements)

4

Inter-Processor Ring Network • Bi-directional ring network • 512 bits-wide per direction • Routing decisions done before injecting message into the network

Larrabee Programming Models • Most application can be executed without modification due to the full support of the x86 instruction set • Support for POSIX threads to create multiple threads – API extended by thread affinity parameters

• Recompiling code with Larrabee’s native compiler will generate automatically the codes to use the VPUs. • Alternative parallel approaches – Intel threading building blocks – Larrabee specific OpenMP directives

5

Larrabee Performance

Image Source: [1]

Intel Xeon Phi Processor • First generation of Intel MIC (Many Integrated Cores) architecture • • • • • • •

60 cores / 1.0 GHz 512-bit wide vector engine 32 Kb L1 I/D cache, 512 Kb L2 cache (per core) Up to 1 TFLOPS double-precision performance 8 Gb GDDR5 memory and 320 Gb/s bandwidth Standard PCIe x16 form factor

6

IBM Cell Overview (I) • Cell Broadband Architecture (CBEA) defined by a consortium of IBM, Sony, and Toshiba • Originally targeting the multi-media industry – E.g. Playstation 3, Toshiba HDTV, etc.

• Sold as regular compute-blades also by IBM – IBM QS20, QS21, QS22

• Main idea: heterogeneous microprocessor consisting of – one (or more) general purpose processor element (PPE) and – (one or) more synergistic processor elements (SPEs)

7

Cell Architecture block diagram

Image Source: [2]

• Two generations available so far: – Cell BE: • 204.8 GFLOPS single precision peak performance • 14.6 GFLOPS double precision peak performance – PowerXCell 8i (2008): • 204.8 GFLOPS single precision peak performance • 102.4 GFLOPS double precision peak performance – Both have 1 PPE and 8 SPEs

8

General Purpose Processor (PPE) • Based on the IBM PowerPC processor – Supports multiple simultaneous operating environments (virtualization) – E.g. can execute an instance of a real-time operating system and an instance of a non-real-time operating system

• Performs management and application control functions

Synergistic Processor Element (SPE) • SIMD processor used for offloading compute-intensive, data parallel operations from the PPE • Each SPE has its own local storage and can access data only from the local storage – Current versions of the Cell processors: 256k local storage

• The local storage is connected to the main memory through a Memory Flow Controller (MFC) – MFC moves data from main memory to local storage or between two SPEs.

9

MFC commands

Image Source: [2]

Synergistic Processor Element (SPE) (II) • Each SPE has 128 registers • Each register is 128 bits wide which can be used to hold – Sixteen 8-bit integers or – Eight 16-bit integers or – Four 32-bit integers or single precision floating-point numbers – Two 64-bit integers or double precision floating point numbers

• Most instructions supported by the synergistic processor unit utilize all elements in a register -> SIMD

10

Simplified representation of a current Cell processor

Image Source: [3]

Element Interconnect Bus • PPE and SPEs communicate through the Element Interconnect Bus – Contains a shared command bus • Sets up end-to-end transactions • Used for coherence protocols – Point-to-point data interconnect • Four 16-byte-wide rings, two used for clockwise data transfers, two for counter-clockwise data transfers • Each ring transfer 128 byte packets ( = cache block size of an SPE) • Communication costs between two SPEs can vary between 1 hop and 6 hops – Overall bandwidth: 204.8 GB/s

11

Comparison IBM Cell and Intel Larrabee • Both use a large number of small and simple cores • Both use high-bandwidth ring bus to communicate between the cores • Intel Larrabee is homogeneous, while IBM Cell is a heterogeneous process (difference between PPE and SPE) • IBM Cell requires data to be moved explicitly to the ‘local store’, while Larrabee can address any memory area – Programm for the Cell have to be written taking the limited amount of memory available for a SPE into account

12

Suggest Documents