Intel Xeon Phi MIC Offload Programming Models

ver DJ2013-03 as of 4 Oct 2013 PowerPoint original available on request Intel Xeon Phi MIC Offload Programming Models Doug James Oct 2013 © The Univ...

Author: Gervase Lyons

3 downloads 3 Views 3MB Size

Report

Download PDF

Recommend Documents

Intel Xeon Phi MIC Offload Programming Models

Intel Xeon Phi Programming Environment. Intel Xeon Phi Execution Models

Offload Code to the Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor

Intel Xeon Phi 3120AIB Workstation Compute Processor. Models Intel Xeon Phi 3120AIB Compute Processor

Intel Xeon Phi Avril Alain Dominguez Intel

Exploiting Parallelism for Intel Xeon Processors & Intel Xeon Phi Coprocessors

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors

Intel Xeon Phi Core Micro-architecture

Benchmarking the Intel Xeon Phi Coprocessor

INTEL SOFTWARE LICENSE AGREEMENT Intel Many Integrated Core (Intel MIC) Platform Software Stack (MPSS) and Intel Xeon Phi Processor Software

Overview of the Intel Xeon and Xeon Phi tecnologies

Using Intel Math Kernel Library on Intel Xeon Phi Coprocessors

Comparison of Parallel Programming Models on Intel MIC Computer Cluster

SIMD Enabled Functions on Intel Xeon CPU and Intel Xeon Phi Coprocessor

Experiences with Intel Xeon Phi in the Max-Planck Society

Performance Evaluation of Breadth-First Search on Intel Xeon Phi

Streaming Store Instructions in the Intel Xeon Phi coprocessor

Concurrent Task Execution on the Intel Xeon Phi

Compiler Directives for the Intel Xeon Phi Coprocessor

OpenMP Programming on Intel R Xeon Phi TM Coprocessors: An Early Performance Comparison

Intel Xeon Phi Coprocessor Intel Manycore Platform Software Stack (Intel MPSS)

Xeon Phi TM Coprocessor

Intel Xeon Phi Coprocessor Intel Manycore Platform Software Stack (Intel MPSS) User's Guide (Windows*)

ver DJ2013-03 as of 4 Oct 2013 PowerPoint original available on request

Intel Xeon Phi MIC Offload Programming Models Doug James Oct 2013

© The University of Texas at Austin, 2013 Please see the final slide for copyright and licensing information.

Key References • Jeffers and Reinders, Intel Xeon Phi... – but some material is no longer current

• Intel Developer Zone – http://software.intel.com/en-us/mic-developer – http://software.intel.com/en-us/articles/effective-use-of-the-intelcompilers-offload-features

• Stampede User Guide and related TACC resources – Search User Guide for "Advanced Offload" and follow link

Other specific recommendations throughout this presentation

2

Overview Basic Concepts Three Offload Models Issues and Recommendations

Source code available on Stampede: tar xvf ~train00/offload_demos.tar Project codes: TG-TRA120007 (XSEDE Portal), 20131004MIC (TACC Portal)

3

Offloading: MIC as assistant processor A program running on the host “offloads” work by directing the MIC to execute a specified block of code. The host also directs the exchange of data between host and MIC.

“...do work and deliver results as directed...”

app running on host

Ideally, the host stays active while the MIC coprocessor does its assigned work.

x16 PCIe

4

Offload Models • Compiler Assisted Offload – Explicit • Programmer explicitly directs data movement and code execution

– Implicit • Programmer marks some data as “shared” in the virtual sense • Runtime automatically synchronizes values between host and MIC

• Automatic Offload (AO) – Computationally intensive calls to Intel Math Kernel Library (MKL) – MKL automatically manages details – More than offload: work division across host and MIC!

5

Explicit Model: Direct Control of Data Movement • aka Copyin/Copyout, Non-Shared, COI* • Available for C/C++ and Fortran • Supports simple (“bitwise copyable”) data structures (think 1d arrays of scalars)

*Coprocessor Offload Infrastructure 6

F90

program main use omp_lib

Explicit Offload

integer :: nprocs

nprocs = omp_get_num_procs() print*, "procs: ", nprocs end program

ifort -openmp off00host.f90 icc -openmp off00host.c #include #include

int main( void )

C/C++

{

int totalProcs;

Simple Fortran and C codes that each return "procs: 16" on Sandy Bridge host…

totalProcs = omp_get_num_procs(); printf( "procs: %d\n", totalProcs ); return 0; }

7

F90

program main use omp_lib integer :: nprocs

Explicit Offload

offload directive

!dir$ offload target(mic) nprocs = omp_get_num_procs()

runs on MIC

print*, "procs: ", nprocs end program

runs on host

ifort -openmp off01simple.f90 icc -openmp off01simple.c

#include #include

int main( void ) int totalProcs;

Add a one-line directive/pragma that offloads to the MIC the one line of executable code that occurs below it…

C/C++

{

offload pragma

runs on MIC #pragma offload target(mic) totalProcs = omp_get_num_procs(); printf( "procs: %d\n", totalProcs ); return 0;

…codes now return "procs: 240"… }

runs on host 8

F90

program main use omp_lib

Explicit Offload

integer :: nprocs

!dir$ offload target(mic) nprocs = omp_get_num_procs()

don't use "-mmic"

print*, "procs: ", nprocs end program

ifort -openmp off01simple.f90 icc -openmp off01simple.c #include #include

int main( void )

C/C++

{

int totalProcs;

Don't even need to change the compile line…

#pragma offload target(mic) totalProcs = omp_get_num_procs(); printf( "procs: %d\n", totalProcs ); return 0; }

9

F90

program main use omp_lib

Explicit Offload

integer :: nprocs

!dir$ offload target(mic) nprocs = omp_get_num_procs() print*, "procs: ", nprocs end program

off01simple #include #include

int main( void )

C/C++

{

int totalProcs;

Not asynchronous (yet): the host pauses until MIC is finished.

#pragma offload target(mic) totalProcs = omp_get_num_procs(); printf( "procs: %d\n", totalProcs ); return 0; }

10

F90 !dir$ offload begin target(mic) nprocs = omp_get_num_procs() maxthreads = omp_get_max_threads() !dir$ end offload

Explicit Offload off02block

C/C++

Can offload a block of code (generally safer than the one-line approach)…

#pragma offload target(mic) { totalProcs = omp_get_num_procs(); maxThreads = omp_get_max_threads(); }

11

program main integer, parameter :: N = 500000 real :: a(N)

F90 ! constant ! on stack

Explicit Offload

!dir$ offload target(mic) !$omp parallel do do i=1,N a(i) = real(i) end do !$omp end parallel do ...

off03omp

C/C++

int main( void ) {

…or an OpenMP region defined by an omp directive…

double a[500000]; // on the stack; literal here is important int i; #pragma offload target(mic) #pragma omp parallel for for ( i=0; i