COSC 6385 Computer Architecture - Data Level Parallelism (II) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (II) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2014 Refe...

Author: Millicent Webster

3 downloads 0 Views 878KB Size

Report

Download PDF

Recommend Documents

COSC 6385 Computer Architecture. - Multi-Processors (II) The IBM Cell, Intel Larrabee and Nvidia G80 processors

Exploiting Parallelism for Intel Xeon Processors & Intel Xeon Phi Coprocessors

COSC 6385 Computer Architecture - Data Level Parallelism (II)

COSC 6385 Computer Architecture - Thread Level Parallelism (IV)

Intel Xeon Phi Programming Environment. Intel Xeon Phi Execution Models

Intel Xeon Phi Coprocessor

Intel Xeon Phi Core Micro-architecture

COSC 6385 Computer Architecture. - Pipelining (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Intel Xeon Phi Avril Alain Dominguez Intel

COSC 6385 Computer Architecture. - Multi-Processors (III) Synchronization

Benchmarking the Intel Xeon Phi Coprocessor

Overview of the Intel Xeon and Xeon Phi tecnologies

COSC 6385 Computer Architecture. Virtualizing Compute Resources

The Intel Architecture Processors Pipeline

Using Intel Math Kernel Library on Intel Xeon Phi Coprocessors

Intel Xeon Phi 3120AIB Workstation Compute Processor. Models Intel Xeon Phi 3120AIB Compute Processor

COSC 6385 Computer Architecture - Memory Hierarchy Design (II)

Intel Xeon Phi MIC Offload Programming Models

COSC 6385 Computer Architecture. Instruction Set Architectures

Exploring SIMD for Molecular Dynamics, Using Intel R Xeon R Processors and Intel R Xeon Phi TM Coprocessors

SIMD Enabled Functions on Intel Xeon CPU and Intel Xeon Phi Coprocessor

SFTL003. Optimize Your Code for the Latest Intel Xeon Processors and Intel Xeon Phi Coprocessor using Intel Parallel Studio XE for Linux *

COSC 6385 Computer Architecture - Data Level Parallelism (II) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2014

References •

•

Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan: “Larrabee: a many-core x86 architecture for visual computing”, ACM Trans. Graph., Vol. 27, No. 3. (August 2008), pp. 1-15. http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf IBM Cell processor: [2] C. R. Johns, D. A. Brokenshire “Introductioon to the Cell Broadband Engine Architecture”, IBM Journal of Research and Development, vol. 51, no. 5, pp. 503-519 http://www.research.ibm.com/journal/rd/515/johns.pdf [3] M. Kistler, M. Perrone, F. Petrini, “Cell Multiprocessor Communication Network: Built for Speed” IEEE Micro, vol. 26, no. 3, pp .10-23 ttp://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf

1

Larrabee Motivation • Comparison of two architectures with the same number of transistors – Half the performance of a single stream for the simplified core – 40x increase for multi-stream executions 2 out-of-order cores

10 in-order cores

Instruction issue

4

2

VPU per core

4-wide SSE

16-wide

L2 cache size

4 MB

4 MB

Single stream

4 per clock

2 per clock

Vector throughput

8 per clock

160 per clock

Larrabee Overview • Many-core visual computing architecture • Based on x86 CPU cores – Extended version of the regular x86 instruction set – Supports subroutines and page faulting

• Number of x86 cores can vary depending on the implementation and processor version • Fixed functional units for texture filtering – Other graphical operations such as rasterization or postshader blending done in software

2

Larrabee Overview (II)

Image Source: [1]

Overview of a Larrabee Core (I)

Image Source: [1]

3

Overview of a Larrabee Core (I) • x86 core derived from the Pentium processor – No out-of-order execution

• Standard Pentium instruction set with the addition of – 64 bit instructions – Instructions for pre-fetching data into L1 and L2 cache – Support for 4 simultaneous threads, separate registers for each thread

• Each core is augmented with a wide vector processor (VPU) • 32kb L1 Instruction cache, 32 kb L1 Data Cache • 256 KB of ‘local subset’ of the L2 cache – Coherent L2 cache across all cores

Vector Processing Unit in Larrabee • 16-wide VPU executing integer, single- and double precision floating point operations • VPU supports gather-scatter operations – The 16 elements are loaded or can be stored from up to 16 different addresses

• Support for predicated instructions using a mask control register (if-then-else statements)

4

Inter-Processor Ring Network • Bi-directional ring network • 512 bits-wide per direction • Routing decisions done before injecting message into the network

Larrabee Programming Models • Most application can be executed without modification due to the full support of the x86 instruction set • Support for POSIX threads to create multiple threads – API extended by thread affinity parameters

• Recompiling code with Larrabee’s native compiler will generate automatically the codes to use the VPUs. • Alternative parallel approaches – Intel threading building blocks – Larrabee specific OpenMP directives

5

Larrabee Performance

Image Source: [1]

Intel Xeon Phi Processor • First generation of Intel MIC (Many Integrated Cores) architecture • • • • • • •

60 cores / 1.0 GHz 512-bit wide vector engine 32 Kb L1 I/D cache, 512 Kb L2 cache (per core) Up to 1 TFLOPS double-precision performance 8 Gb GDDR5 memory and 320 Gb/s bandwidth Standard PCIe x16 form factor

6

IBM Cell Overview (I) • Cell Broadband Architecture (CBEA) defined by a consortium of IBM, Sony, and Toshiba • Originally targeting the multi-media industry – E.g. Playstation 3, Toshiba HDTV, etc.

• Sold as regular compute-blades also by IBM – IBM QS20, QS21, QS22

• Main idea: heterogeneous microprocessor consisting of – one (or more) general purpose processor element (PPE) and – (one or) more synergistic processor elements (SPEs)

7

Cell Architecture block diagram

Image Source: [2]

• Two generations available so far: – Cell BE: • 204.8 GFLOPS single precision peak performance • 14.6 GFLOPS double precision peak performance – PowerXCell 8i (2008): • 204.8 GFLOPS single precision peak performance • 102.4 GFLOPS double precision peak performance – Both have 1 PPE and 8 SPEs

8

General Purpose Processor (PPE) • Based on the IBM PowerPC processor – Supports multiple simultaneous operating environments (virtualization) – E.g. can execute an instance of a real-time operating system and an instance of a non-real-time operating system

• Performs management and application control functions

Synergistic Processor Element (SPE) • SIMD processor used for offloading compute-intensive, data parallel operations from the PPE • Each SPE has its own local storage and can access data only from the local storage – Current versions of the Cell processors: 256k local storage

• The local storage is connected to the main memory through a Memory Flow Controller (MFC) – MFC moves data from main memory to local storage or between two SPEs.

9

MFC commands

Image Source: [2]

Synergistic Processor Element (SPE) (II) • Each SPE has 128 registers • Each register is 128 bits wide which can be used to hold – Sixteen 8-bit integers or – Eight 16-bit integers or – Four 32-bit integers or single precision floating-point numbers – Two 64-bit integers or double precision floating point numbers

• Most instructions supported by the synergistic processor unit utilize all elements in a register -> SIMD

10

Simplified representation of a current Cell processor

Image Source: [3]

Element Interconnect Bus • PPE and SPEs communicate through the Element Interconnect Bus – Contains a shared command bus • Sets up end-to-end transactions • Used for coherence protocols – Point-to-point data interconnect • Four 16-byte-wide rings, two used for clockwise data transfers, two for counter-clockwise data transfers • Each ring transfer 128 byte packets ( = cache block size of an SPE) • Communication costs between two SPEs can vary between 1 hop and 6 hops – Overall bandwidth: 204.8 GB/s

11

Comparison IBM Cell and Intel Larrabee • Both use a large number of small and simple cores • Both use high-bandwidth ring bus to communicate between the cores • Intel Larrabee is homogeneous, while IBM Cell is a heterogeneous process (difference between PPE and SPE) • IBM Cell requires data to be moved explicitly to the ‘local store’, while Larrabee can address any memory area – Programm for the Cell have to be written taking the limited amount of memory available for a SPE into account

12