Offload Code to the Intel Xeon Phi Coprocessor

Offload Code to the Intel® Xeon Phi™ Coprocessor ABSTRACT This article describes the concept of “Offload” which refers to writing a program from the p...

Author: Florence Price

1 downloads 1 Views 51KB Size

Report

Download PDF

Recommend Documents

Intel Xeon Phi Coprocessor

Benchmarking the Intel Xeon Phi Coprocessor

Intel Xeon Phi MIC Offload Programming Models

Xeon Phi TM Coprocessor

Streaming Store Instructions in the Intel Xeon Phi coprocessor

Compiler Directives for the Intel Xeon Phi Coprocessor

Intel Xeon Phi Programming Environment. Intel Xeon Phi Execution Models

SIMD Enabled Functions on Intel Xeon CPU and Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Manycore Platform Software Stack (Intel MPSS)

Compiler Prefetching for the Intel Xeon Phi coprocessor. Rakesh Krishnaiyer Intel Compiler Lab

Intel Xeon Phi Coprocessor Intel Manycore Platform Software Stack (Intel MPSS) User's Guide (Windows*)

Intel Xeon Phi Avril Alain Dominguez Intel

Exploiting Parallelism for Intel Xeon Processors & Intel Xeon Phi Coprocessors

SFTL003. Optimize Your Code for the Latest Intel Xeon Processors and Intel Xeon Phi Coprocessor using Intel Parallel Studio XE for Linux *

Overview of the Intel Xeon and Xeon Phi tecnologies

A Unified Interface for Benchmark Tools on the Intel Xeon Phi Processor X200 Product Family and The Intel Xeon Phi Coprocessor X200 Product Family

Intel Xeon Phi 3120AIB Workstation Compute Processor. Models Intel Xeon Phi 3120AIB Compute Processor

Intel Xeon Phi Core Micro-architecture

Using Intel Math Kernel Library on Intel Xeon Phi Coprocessors

Experiences with Intel Xeon Phi in the Max-Planck Society

Concurrent Task Execution on the Intel Xeon Phi

Performance Evaluation of Breadth-First Search on Intel Xeon Phi

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors

Offload Code to the Intel® Xeon Phi™ Coprocessor ABSTRACT This article describes the concept of “Offload” which refers to writing a program from the point of view of running code on processor(s) and offloading work from the processor to one or more Intel® Xeon Phi™ coprocessors. Execution begins on the host processor and, based on user-defined code, some sections are offloaded to the coprocessor, if present, or run on the host processor if not. Examples for a variety of languages are provided. ========================================================================== “Offload”: refers to writing a program from the point of view of running on processor(s) and offloading work from the processor to one or more Intel® Xeon Phi™ coprocessors. Execution begins on the host processor and, based on user-defined code, some sections are offloaded to the coprocessor, if present, or run on the host processor if not. A key feature of offload is that the resulting binary runs whether or not a coprocessor is present (unless you choose to use #pragma offload target(mic:coproc-num or _Cilk_offload_to) to specify that a coprocessor is required.) Offloading could simply be thought of an inline code that may be run on a coprocessor, as shown in Figure 1. When needed, the compiler produces the required activity to copy data to the memory for the coprocessor, run the code on the coprocessor, and then copy the results back to the memory for the processor. This simplistic use will work, but performance may be limited by the lack of any concurrency between the processor and coprocessor as well as the implied movement of data before and after the offloaded code. In order to control the movement of data, many additional controls are available and are important to understand if programming using offload appeals to you. Offload can be a good programming approach as long as the code spends a lot of time doing computation without I/O, the computationally intensive portion of the code and the data on which it works is relatively easy to encapsulate, the computation time is substantially higher than the data transfer time (that is, N2 computation for N data), and the data fits in coprocessor memory or can be partitioned for coprocessor memory. The best fits also have the ability to structure computation and data so that multiple agents can perform computations and data transfer asynchronously especially to overlap them.

Two offload models A number of offload controls and features are needed because the host processor and the coprocessor do not share common system memory. This leads to the need to move data back and forth between the host processor and the coprocessor. The compiler supports two distinct programming models, which differ in their approach to dealing with the lack of shared system memory, the non-shared memory model and the virtual-shared memory model. You can use both models of offloading in a single program. However, the data manipulated by the two models should be distinct. Table 1 provides a comparison of the offload options. // code running on processor foo(9); x[i] = 9; #pragma offload foo(1202); // runs on the coprocessor bar(19); // back on the processor Figure 1 - Running Code on a Processor and Offloading to a Coprocessor. Copyright © 2012 Jim Jeffers and James Reinders. All Rights Reserved.

1

Table 1 - Comparison of the Two Offload Models Pragma Offload Programming languages Fortran, C, C++ (C++ functions may be called, but C++ classes cannot be transferred) Syntax C/C++: #pragma offload

Used for

Offloaded data allowed

When data movement occurs When offload code is copied to coprocessor Function and variables available

Performance

Shared VM Model C, C++

_Cilk_shared _Cilk_offload

Fortran: !dir$ omp offload More details in Tables 4 and 5 Offloads that transfer contiguous blocks Offloads that transfer all or parts of of data back and forth with complete complex data structures, or many small user control pieces of data, on demand (no explicit controls) Scalars, arrays, bit-wise structures that All data types (pointer-based data, can be copied. No simple way to structures, classes, locks, and so on.) offload non-sequence derived types. User has explicit control of data _Cilk_shared data synchronized on movement at start of each offload demand inside _Cilk_offload directive statements At program start-up At first offload pragma or directive All functions and variables are available on host processor(s) always. Only functions and variables are available on coprocessors only if marked using an offload pragma, attribute or directive Offers the highest performance of the two offload models, because no overhead from software coherency will occur.

Only functions and variables are available on coprocessors only if marked _Cilk_offload or _Cilk_shared

Offers convenience (software coherency) that imposes some overhead.

In both models, specifying that something should run on the coprocessor does not guarantee that it will. The availability of an Intel® Xeon Phi™ coprocessor at the offload point determines if the offload succeeds. When an offload fails, the construct will execute on a host processor instead. This fallback, and the shared programming models between coprocessors and processors, makes portable programming a straightforward effort.

Choosing offload vs. native execution When to choose to use an Offload model vs. a Native Execution model is likely to be a lively debate for some time. We can offer some insight into the key criteria to consider. If you are using MPI, you should read Chapter 12 in the book Intel® Xeon Phi™ Coprocessor High Performance Programming by Jim Jeffers and James Reinders. for insights for choosing Offload vs. Native modes while using MPI. Offload is most appropriate when the program cannot be made highly parallel consistently throughout most of the application: • I/O intensive codes that have hot computational sections. The processor best handles I/O. Copyright © 2012 Jim Jeffers and James Reinders. All Rights Reserved.

2

• •

Large complex applications with a reasonably small number of hotspots such as compute phases or computational time-intensive filters or effects modules. Note that the computation in the offload has to justify the data transfer costs. Programs with memory needs that are two high where offloading computational kernels can be done with more bounded memory needs.

Native is good for programs that are largely doing operations that map to parallelism either in threads or vectors, and are not doing significant amounts of I/O or serial execution. Offload has some additional concerns. Asynchronous allocation, copies, and deallocation of data are possible with asynchronous offload execution. Another challenge of offloading is that it requires two levels of memory blocking: one to fit the input data into the coprocessor, and another within the offload code to fit within the processor caches and not oversaturate the processor memory subsystem when all cores are busy.

Non-shared memory model: using offload pragmas/directives A non-shared memory model uses the offload pragma or directives, and other pragmas or directives with the prefix offload_. This model is appropriate for dealing with flat data structures such as scalars, arrays, and structs that are bit-wise copyable (do not contain pointers, and does not invoke constructors or destructors). Data in this model is copied back and forth between the processor and the coprocessor around regions of offloaded code. The data selected for transfer is a combination of variables implicitly transferred because they are lexically referenced within offload constructs, and variables explicitly listed in clauses in the pragma. C/C++ simple offload is done via #pragma offload target(mic) while Fortran simple offload is done via !dir$ offload target(mic).

Shared virtual memory model: using offload with shared vm A shared virtual memory (VM) model is integrated with the Intel® Cilkt Plus and uses _Cilk_shared and _Cilk_offload keywords for C/C++ programming. There is no Fortran support for this shared virtual memory offload model. In this model, variables that need to be shared between processor(s) and coprocessor(s) are marked _Cilk_shared; such variables can then be used in both host and coprocessor code. Dynamically allocated memory you wish to share must be allocated with special functions: _Offload_shared_malloc, _Offload_shared_aligned_malloc, _Offload_shared_free, and _Offload_shared_aligned_free. The compiler and runtime automatically maintains coherence at the beginning and end of offload statements. Only modified data is transferred when using this model. This model is appropriate for dealing with complex pointer-based data structures, such as linked lists, trees, and the like. This model uses a software implementation of virtual memory that is shared between the host processor and the coprocessor. Code looks like: _Cilk_shared double foo; and _Cilk_offload func(y); Shared Virtual Memory offers convenience (software coherency) that imposes some overhead when updates are made and need to be transferred. An update on a coprocessor will need transferring before it can be used on a processor or another coprocessor. Not accessing shared data simultaneously will improve performance as data tracking is not required.

Copyright © 2012 Jim Jeffers and James Reinders. All Rights Reserved.

3

Intel® math kernel library (intel mkl) automatic offload The Intel® Math Kernel Library (Intel MKL) has automatic offloading capabilities. These discussed in Chapter 11 in the book Intel® Xeon Phi™ Coprocessor High Performance Programming by Jim Jeffers and James Reinders for a complete discussion of Intel MKL.

Language extensions for offload Intel introduced two models for offloading with Intel Xeon Phi coprocessors: offload pragmas and share virtual memory offload. At the same time, Intel and many others have participated in discussions of adding what Intel calls pragma offload functionality to a future OpenMP specification. nVidia’s OpenACC was introduced during this time as a subset suitable for nVidia GPUs with loss of the generality of an OpenMP based model. This included loss of key features to fully utilize Intel Xeon Phi coprocessors. Efforts to resolve these two approaches remained work for the OpenMP committee to create a convergence that will supersede nVidia’s OpenACC and Intel’s original offload implementation. In November 2012, OpenMP released their first public review document detailing their proposed solution. It is called “Technical Report 1 on Directives for Attached Accelerators” and was discussed in their Birds-of-a-Feather (BoF) at the Supercomputing 2012 (SC12) conference in Salt Lake City, Utah. Intel compilers added support for this draft in January 2013, and these directives are expected to be included in the next OpenMP specification in 2013 (expected to be called OpenMP 4.0). The concepts are the same, but the syntax may differ slightly for the most part and the keyword “offload” is expected to be “target.” You will be able to use the original Intel syntax or the new OpenMP syntax in the Intel compilers for the foreseeable future. The Intel compilers provide the language extensions listed in Table 2 to facilitate programming for Intel® MIC Architecture. Table 3 is a translation table for comparing the newly proposed OpenMP target model described in OpenMP TR1, with nVidia’s OpenACC, with Intel’s Language Extensions for Offload (LEO). The Intel compilers support OpenMP TR1 and Intel’s LEO.

Copyright © 2012 Jim Jeffers and James Reinders. All Rights Reserved.

4

Table 2 - Intel Language Extensions for Offload Name of Feature Description of Feature C/C++ pragma (start with Pragmas and directives to control the data #pragma): transfer between the processor and the coprocessor. offload offload_attribute offload_transfer offload_wait Fortran directive (start with !dir$ omp): OFFLOAD OFFLOAD_ATTRIBUTE OFFLOAD_TRANSFER OFFLOAD_WAIT OFFLOAD BEGIN END OFFLOAD ATTRIBUTES OFFLOAD

_Cilk_offload keyword _Cilk_shared keyword

__MIC__ macro __KNC__ macro __INTEL_OFFLOAD macro

APIs in offload.h or mic_lib.f90

C/C++ only. Keywords to control the data transfer between the processor and the coprocessor. The data to be exchanged between the processor and the coprocessor can be arbitrarily complex. Predefined macros for Intel® MIC Architecture. See Table 4 for more information. A set of functions for: • dealing with multiple coprocessors • calling functions on the processor to modify the coprocessor’s execution environment • writing code that should not be built for processor-only execution

Compiler options and environment variables for offload The compiler provides several compiler options and environment variables that you can use when building a binary for Intel Xeon Phi coprocessors. You can use the compiler options to: 1. Ignore language constructs for offloading (no-offload) 2. Build an application that runs natively on Intel MIC Architecture (mmic) 3. Flag every global routine and global data object in the source file with the offload attribute target(mic)(offload-attribute-target) 4. Specify options to be used for the specified target and tool (offload-option) 5. Specify the offload optimizer phase to use when optimization reports are generated (opt-report-phase = offload) 6. You can use environment variables for a variety of tasks, including • Setting the stack size on the coprocessor (MIC_STACKSIZE • Controlling environment variables passed to the coprocessor ( MIC_ENV_PREFIX) Copyright © 2012 Jim Jeffers and James Reinders. All Rights Reserved.

5

• •

Controlling MIC I/O proxy (MIC_PROXY_FS_ROOT and MIC_PROXY_IO) Diagnostic information printed during execution (OFFLOAD_REPORT). Set to a value of 1 for a condensed report or 2 for a more verbose report. 7. Offload-specific arguments to the Intel Compilers: • -offload-build: Generate host 1 coprocessor code (by default only host code is generated). This activates -openmp. • -opt-report-phase:offload: Produce a report of offload data transfers at compile time (not runtime): • -offload-copts:“switches”: Add Intel MIC Architecture compiler switches • -offload-aropts:“switches”: Add Intel MIC Architecture archiver switches • -offload-ldopts:“switches”: Add Intel MIC Architecture linker switches 8. Caveats: • Standalone programs and their data need to be copied manually by the user using ftp, or scp. • Shared libraries, such as libiomp5.so (which has no static counterpart) may need to be copied manually, even if you link your program statically.

Table 3 – Offload Models: OpenMP Target (TR1), OpenACC, Intel LEO Topic

OpenMP target (Intel compiler supports it, new in 2013)

OpenACC

Intel LEO (documented and used in this chapter)

Memory Model

host and target region data env on device memory

Host and device memory

host and offload region data env on device memory

data clauses

map-{to,from}, scratch

present_or_cop y,-{in,out}, nocopy

inout, in, out, nocopy, alloc_if, free_if

data construct (structured block data placement)

target data

acc data

paired offload_transfer

update construct (Data motion initiated by the host)

target update

acc update

offload_transfer

resident / mirror / declspec directive (unstructured data placement)

declare target mirror

acc mirror

__declspec(target(mic)) __attribute__(target(mic))

link

declare target linkable

acc linkable

__declspec(target(mic)) __attribute__*target(mic))

free/alloc API routines

Yes

yes

yes

Data placement

Copyright © 2012 Jim Jeffers and James Reinders. All Rights Reserved.

6

Code placement Execution model

host-centric, device executes the region

Offloading

#pragma omptarget

#pragmaacc parallel

#pragma offload

parallel for / sections

#pragma omp parallel for / #pragma omp parallel sections

accloop

support full OpenMP inside)

resident/declspecfun ction declare construct

declare target function

declare accfunction

__declspec(target(mic)) / __attribute_-((target(mic)))

device clause

device clause, ICV

device clause, ICV

target(mic)

API routines

get/set devnum

get/set devnum

get/set devnum

Asynchonronous / Synchronous control

thread waits on device @task scheduling point (use tasking model)

async clause and API funcs

async/signal clause / offload_wait

array shape/size

array sections

array sections

array sections

Multiple device support of same type

Sharing environment variables for offload By default, all environment variables set on the host are passed to the coprocessor and affect execution on the coprocessor during offload execution. This can be modified by using MIC_ENV_PREFIX to control environment variables passed to the coprocessor. The common usage is to set MIC_ENV_PREFIX = MIC so that the system only passes environment variables that have a prefix of MIC. Note this does not change the variable names that are transferred at all, so the MIC_LD_LIBRARY_PATH variable is not stripped and passed to the coprocessor. This means that you cannot use MIC_ENV_PREFIX = MIC to change the LD_LIBRARY_PATH on the coprocessor.

Offloading to multiple coprocessors A system may have multiple Intel Xeon Phi coprocessors. Using offload to each coprocessor can be programmed by having a parallel loop which distributes work to each coprocessor explicitly, as shown in Figure 2. Using a parallel loop like that is one way of doing it. An alternative is to do it without threads (remove the OpenMP directive) on the host to use a serial loop, and to use a signal clause signal(&var) (instead of just offload). Using signal makes them into asynchronous offloads that launch work on the coprocessor and gives control back to the host right after having launched the work, not after the work completes, so you can use a simple loop to launch for multiple cards. Another pragma, or looped on pragmas, can use a wait(&var) clause with the offload construct.

Copyright © 2012 Jim Jeffers and James Reinders. All Rights Reserved.

7

Using pragma/directive offload The first of the two offload models we’ll examine might be called the non-shared memory model because shared memory is not available across all the cores and nothing in this model tried to provide for a common address space. However, there is shared memory amongst the processors cores, and a separate shared memory space on each coprocessor that is shared by all cores on a single Intel Xeon Phi coprocessor. int num_coprocessors = _Offload_number_of_devices(); // add code here to handle what to do if there are // no coprocessors in the system; the following code // assumes there are one or more coprocessors. #pragma omp parallel for num_threads(num_coprocessors) \ schedule(static,1) for ( int k = 0; k < num_coprocessors; k++ ) { #pragma offload... { // code to run on a coprocessor } } FIGURE 2 - Sample Code to Offload to Multiple Coprocessors. The advantage of the pragma/directive offload model is the ability to control data movement very precisely but with some loss in generality because the data to transfer has to be specified. If you want to exchange more complex or dynamic data structures, you should consider the offload with shared virtual memory model that we will explore later in this article. This pragma/directive offload model is suitable when the data exchanged between the processor and the coprocessor consists of scalars, arrays, and structs that could be copied from one variable to another using a simple memcpy. This model puts you in control of the data transfer between the processor and the coprocessor, with help from the compiler. You can select the data to be transferred at the point of offload, without needing to declare or allocate it in any special way. This focus on flat, or noncomplex. data structures allows us to precisely specify what blocks of data need to be transferred to and from the coprocessor. Of course, data that is not exchanged has no restrictions and can be arbitrarily complex, including multidimensional arrays, C++ classes of any types, and any composition of data structures using pointers, arrays, and structs. We can place an offload pragma before any statement, including a compound statement as shown in Figure 3. The statement prefixed with the offload pragma can also be an OpenMP parallel pragma. We can place the Fortran OFFLOAD directive before a subroutine call statement, a function call statement of the form x = func() or an OpenMP parallel directive as shown in Figure 4. The code in Figure 3 and Figure 4 finds the first ten even numbers and then puts those numbers into an array. At the start of the code excerpt is the offload pragma. The compiler builds the code block to run on both the processor and coprocessor. Table 4 shows the support available in C, C++ and Fortran for offload. Code using OpenMP and offload together may look like Figure 5. The host processor and coprocessors do not share the same system memory. An implication of this is that the variables used by the code block must be duplicated so that distinct copies exist on both the host processor and coprocessor. The pragmas use specifiers to define the variables to copy between the host processor and coprocessor: • The in specifier defines a variable as strictly an input to the coprocessor. The value is not copied back to the host processor. Copyright © 2012 Jim Jeffers and James Reinders. All Rights Reserved.

8

• •

The out specifier defines a variable as strictly an output of the coprocessor. The host processor does not copy the variable to the coprocessor. The inout specifier defines a variable that is both copied from the host processor to the coprocessor and back from the coprocessor to the host processor.

#pragma offload target(mic : target_id) \ in(all_Vals : length(MAXSZ)) \ inout(numEEs) \ out(EE_vals : length(MAXSZ/2) ) for (k=0; k < MAXSZ; k++) { if ( all_Vals[k] % 2 == 0 ) { EE_vals[numEEs] = all_Vals[k]; numEEs++; } } FIGURE 3 - An Offload Pragma Can Be Placed Before Any Statement. !DIR$ OFFLOAD BEGIN target(mic : target_id) & inout(numEs) in(all_Vals) out(E_vals) do k = 1, MAXSZ if ( MODULO(all_Vals(k),2) == 0 ) then numEs = numEs + 1 E_vals(numEs) = all_Vals(k) endif enddo !DIR$ END OFFLOAD FIGURE 4 - An Offload Directive Can Be Placed in a Subroutine Call Statement, a Function Call Statement of the Form x = func() or an OpenMP Parallel Directive.

Table 4 – Offload Support C/C++ Offload pragma #pragma offload

Semantics Next statement can execute on coprocessor if available, else processor.

Function and variable declarations

__attribute__((target(mic)))

Compile function for, or allocate variable on, both coprocessors and processors

Whole Blocks of code

#pragma offload_attribute(push,\ target(mic)) // code #pragma offload_attribute(pop)

Mark entire files of large blocks of code for being available both for coprocessor and processor

Fortran

Semantics

Offload

The next OpenMP parallel construct !dir$ omp offload Copyright © 2012 Jim Jeffers and James Reinders. All Rights Reserved. 9

directive

Function and variable declarations

can execute on coprocessor if available, else processor.

!dir$ offload

The next statement can execute on coprocessor if available, else processor.

!dir$ attributes offload: :: OR

Compile function for, or allocate variable on, both coprocessors and processors

// C/C++ OpenMP #pragma offload target(mic) #pragma omp parallel for for (i=0; i